We have some less than reliable used laptops to image.
Sometimes on a few machines the PXE boot starts and fails and it requires another boot up to get it going again. I mention this because the problem I mention was not observed before with more predictable machines.
If 15 machines are booted, and set to go into receive mode, and udp-sender is run on the imaging server, and 4 clients need to be rebooted to restart PXE, we have noticed that we see a list of only the first 11 machines on the ready list from udp-sender, even after the remaining 4 have come up and show that UDP receiver is ready (but does not display "hit any key to start" on the remaining 4).
It seems like there is a window of a few minutes before the udp-sender will stop listening for more udp-receivers to say they are ready.
I checked the options and I don't see any that are designed to increase how long it will wait to see more machines responding as ready.
Are there any suggestions?
For now, I think we'll wait until all machines show udp-receiver ready and then run udp-sender, but I'd hope this will not be necessary.
--Donald Teed
Since my earlier email, I've also witnessed this:
Booting up all 15 machines, PXE boot suceeds and each shows the UDP receiver ready. I run udp-sender --file fileimage.gz on the server and then a portion of them (8) them show as ready in the list under the udp-sender. Again, the ones that have not shown up (7) in the ready list do not have "Hit any key to start" or whatever, showing on the client.
All machines were powered off, the process repeated and then all 15 did show as ready. All machines completed imaging sucessfully. I didn't use full-duplex in either run, as I earlier suspected this behaviour might be related to that option.
--Donald Teed
begin Thursday 29 April 2004 18:50, Donald Teed quote:
If 15 machines are booted, and set to go into receive mode, and udp-sender is run on the imaging server, and 4 clients need to be rebooted to restart PXE, we have noticed that we see a list of only the first 11 machines on the ready list from udp-sender, even after the remaining 4 have come up and show that UDP receiver is ready (but does not display "hit any key to start" on the remaining 4).
No idea what is going on here, possibly it is related to the problems seen earlyer (slow card initialization).
It seems like there is a window of a few minutes before the udp-sender will stop listening for more udp-receivers to say they are ready.
No, UDP-Sender listens for new receivers all the time.
However, when it first starts up, it also sends out a probe packet (hello) to get any receivers to register, that might have started up before.
The idea is simple: - when receiver starts, it sends out CONNECT packet - when sender starts, it sends out HELLO packet - on recept of HELLO, receiver tries CONNECT again.
With this, it does not matter whether receivers or senders are started up first, at least in theory: - if receiver starts first, it's the sender's HELLO that triggers the successful rendez-vous - if on the other hand the sender starts first, it's the receiver CONNECT that triggers the rendez-vous.
Now, in your case, what may be happening is the following:
- due to the flakiness of the card, the card may not yet be ready to send once udp-receiver starts up, and thus the CONNECT is not really sent. In case you start the sender afterwards, this doesn't matter, because by the time the sender sends it HELLO, the receiver is ready to receive that message and reply to it. - for those receivers that needed a reboot: if these start after the sender, there is no HELLO.
I checked the options and I don't see any that are designed to increase how long it will wait to see more machines responding as ready.
There is the "--rexmit-hello-interval 3000" option which instructs the sender to keep on resending its HELLO packets until transmission is started. The number is the interval, in milliseconds, between to HELLO packets. This might solve the issue.
udp-sender --rexmit-hello-interval 3000 --file fileimage.gz
Regards,
Alain
On Thu, 29 Apr 2004, Alain Knaff wrote:
[snip]
The idea is simple:
- when receiver starts, it sends out CONNECT packet
- when sender starts, it sends out HELLO packet
- on recept of HELLO, receiver tries CONNECT again.
With this, it does not matter whether receivers or senders are started up first, at least in theory:
- if receiver starts first, it's the sender's HELLO that triggers the
successful rendez-vous
- if on the other hand the sender starts first, it's the receiver
CONNECT that triggers the rendez-vous.
Now, in your case, what may be happening is the following:
- due to the flakiness of the card, the card may not yet be ready to
send once udp-receiver starts up, and thus the CONNECT is not really sent. In case you start the sender afterwards, this doesn't matter, because by the time the sender sends it HELLO, the receiver is ready to receive that message and reply to it.
- for those receivers that needed a reboot: if these start after the
sender, there is no HELLO.
If a CONNECT can trigger the rendez-vous, then if I notice a certain number of machines not connecting, I should be able to simply reboot them and have them try this again. The wierd thing was that we tried that, and the same 4 machines did not rendezvous while 11 were standing by ready. That was what led me to conclude there was a window of time to rendez-vous and it had elasped. However on a third session the 4 missed machines were included in a new batch and did get imaged OK.
I checked the options and I don't see any that are designed to increase how long it will wait to see more machines responding as ready.
There is the "--rexmit-hello-interval 3000" option which instructs the sender to keep on resending its HELLO packets until transmission is started. The number is the interval, in milliseconds, between to HELLO packets. This might solve the issue.
udp-sender --rexmit-hello-interval 3000 --file fileimage.gz
OK, cool, that might be useful.
There are a few things I need to test. I can try substituting the switch involved. In general the client machines are a little unpredictable since they were carried around daily by University students for 2 or 3 years.
--Donald Teed
begin Thursday 29 April 2004 21:20, Donald Teed quote:
If a CONNECT can trigger the rendez-vous, then if I notice a certain number of machines not connecting, I should be able to simply reboot them and have them try this again. The wierd thing was that we tried that, and the same 4 machines did not rendezvous while 11 were standing by ready.
You know, you didn't either confirm nor deny that you usually start up the sender after the receivers.
So let's just suppose you always start up the sender after receivers, except for those where first PXE fails:
In that case, you're observed behaviour is consistent with machines that NEVER send out that first CONNECT after reboot. If, due to some construction limitations, the card is not operational within the 5 first seconds after driver activation, the first CONNECT would ALWAYS fall within that window. By rebooting the machines, you'd trigger another driver removal and re-insertion, which again would make the card unavailable during a short time, and the CONNECT would again be dropped.
Interesting things to test (in order to confirm or deny the hypothesis): 1. Start the sender first - do now _all_ machines fail? If yes, I think that's excellent confirmation that the first CONNECT after reboot never makes it...) - do only some of the machines fail (... and always the same after a _complete_ restart of the experience). If yes, the problem seems not only be dependant on card model, but on each card invidually. - do only some of the machines fail, and always different ones after a complete restart of the experience? If yes, we do have a true mystery ;-) 2. Run a tcpdump on the server, and see what packets you get (port 9000 and 9001) from which machines.
That was what led me to conclude there was a window of time to rendez-vous and it had elasped.
Nope, there is no such window.
However on a third session the 4 missed machines were included in a new batch and did get imaged OK.
good.
I checked the options and I don't see any that are designed to increase how long it will wait to see more machines responding as ready.
There is the "--rexmit-hello-interval 3000" option which instructs the sender to keep on resending its HELLO packets until transmission is started. The number is the interval, in milliseconds, between to HELLO packets. This might solve the issue.
udp-sender --rexmit-hello-interval 3000 --file fileimage.gz
OK, cool, that might be useful.
There are a few things I need to test. I can try substituting the switch involved.
Could help. But from what I've read in the various newsgroups, this particular problem (card initialization) has more to do with the cards themselves than the switch.
In general the client machines are a little unpredictable since they were carried around daily by University students for 2 or 3 years.
--Donald Teed
Alain
We had tried starting in both ways: server/sender first and client/receiver first.
Anyway, the ' --rexmit-hello-interval 3000 ' option did fix our problem. We are using full-duplex and everything is running fine with other defaults.
It was never consistantly ignoring any client, so it was hard to know exactly why the packet wasn't heard. What we observe with the rexmit-hello-interval is that out of a pool of 15 clients, it can take a few "phone calls" to reach the other end. In the last session I witnessed today, out of 15, only 6 initially connected as "Ready", and then another 2, then another 3, and so on until we had a full allotment. Combining this with min-clients switch worked well to automate the launch reliably.
Thanks for the quick solution...
--Donald Teed
On Thu, 29 Apr 2004, Alain Knaff wrote:
begin Thursday 29 April 2004 21:20, Donald Teed quote:
If a CONNECT can trigger the rendez-vous, then if I notice a certain number of machines not connecting, I should be able to simply reboot them and have them try this again. The wierd thing was that we tried that, and the same 4 machines did not rendezvous while 11 were standing by ready.
You know, you didn't either confirm nor deny that you usually start up the sender after the receivers.
So let's just suppose you always start up the sender after receivers, except for those where first PXE fails:
In that case, you're observed behaviour is consistent with machines that NEVER send out that first CONNECT after reboot. If, due to some construction limitations, the card is not operational within the 5 first seconds after driver activation, the first CONNECT would ALWAYS fall within that window. By rebooting the machines, you'd trigger another driver removal and re-insertion, which again would make the card unavailable during a short time, and the CONNECT would again be dropped.
Interesting things to test (in order to confirm or deny the hypothesis):
- Start the sender first
confirmation that the first CONNECT after reboot never makes it...)
- do now _all_ machines fail? If yes, I think that's excellent
a _complete_ restart of the experience). If yes, the problem seems not only be dependant on card model, but on each card invidually.
- do only some of the machines fail (... and always the same after
after a complete restart of the experience? If yes, we do have a true mystery ;-)
- do only some of the machines fail, and always different ones
- Run a tcpdump on the server, and see what packets you get (port
9000 and 9001) from which machines.
That was what led me to conclude there was a window of time to rendez-vous and it had elasped.
Nope, there is no such window.
However on a third session the 4 missed machines were included in a new batch and did get imaged OK.
good.
I checked the options and I don't see any that are designed to increase how long it will wait to see more machines responding as ready.
There is the "--rexmit-hello-interval 3000" option which instructs the sender to keep on resending its HELLO packets until transmission is started. The number is the interval, in milliseconds, between to HELLO packets. This might solve the issue.
udp-sender --rexmit-hello-interval 3000 --file fileimage.gz
OK, cool, that might be useful.
There are a few things I need to test. I can try substituting the switch involved.
Could help. But from what I've read in the various newsgroups, this particular problem (card initialization) has more to do with the cards themselves than the switch.
In general the client machines are a little unpredictable since they were carried around daily by University students for 2 or 3 years.
--Donald Teed
Alain