I'm looking at using udpcast to broadcast large disk images (10+ GB) to a very large network of machines (1,000-10,000 receivers) over a mostly switched, partially segmented gig-ethernet network.
Needless to say, the network of machines is all production-critical and I cannot get access to perform real testing. However testing it on my home network I can see some potential problems:
- If _any_ receiver is misbehaving or unreachable then this stops all transmissions. Is there a way to get udpcast to drop troublesome receivers in this situation (other than unicast)?
- Has anyone used the --ttl option to multicast over routers? Does it work (the manpage is unclear)? Does it need special routers?
- Any other scaling tips? Should I try to go for the full set of machines at once or break up the broadcast into groups of machines?
If anyone has used udpcast on such large networks, can you share any experiences.
Rich.
Hi Rich, I tried to solve a problem much smaller than yours but still had incredible difficulty. I was moving 10GB datasets out to 64 receivers over a flat switched network using multicast. Unfortunately, for reasons I never tracked down, files of this size would always get corrupted along the way even though all the receivers had received all packets (i.e. the md5sum would be different across all the different machines). Eventually I ended up using small bittorrent clients instead of udpcast since it checks the hash of each block. This also makes the process take about twice as long, but better to get correct data slow than corrupt data fast!
Hope you have better luck, -Michael
Richard W.M. Jones wrote:
I'm looking at using udpcast to broadcast large disk images (10+ GB) to a very large network of machines (1,000-10,000 receivers) over a mostly switched, partially segmented gig-ethernet network.
Needless to say, the network of machines is all production-critical and I cannot get access to perform real testing. However testing it on my home network I can see some potential problems:
If _any_ receiver is misbehaving or unreachable then this stops all transmissions. Is there a way to get udpcast to drop troublesome receivers in this situation (other than unicast)?
Has anyone used the --ttl option to multicast over routers? Does it work (the manpage is unclear)? Does it need special routers?
Any other scaling tips? Should I try to go for the full set of machines at once or break up the broadcast into groups of machines?
If anyone has used udpcast on such large networks, can you share any experiences.
Rich.
On Thu, Apr 24, 2008 at 11:06:03AM -0400, Michael Holroyd wrote:
I tried to solve a problem much smaller than yours but still had incredible difficulty. I was moving 10GB datasets out to 64 receivers over a flat switched network using multicast. Unfortunately, for reasons I never tracked down, files of this size would always get corrupted along the way even though all the receivers had received all packets (i.e. the md5sum would be different across all the different machines).
I haven't seen this problem (my tests are too small-scale probably) but I notice that the protocol doesn't do any sort of error detection for the dataBlocks. So we're relying on UDP's 16 bit checksum and maybe ethernet's CRC32. Both types of checksum are known to be very weak, and ethernet checksumming is even sometimes turned off.
Shouldn't be too hard to add a more robust checksum to the packets. Is anyone interested in a patch? I might have a go at one later.
Rich.
PS. My cheap-ish consumer switch slows down from gigabit-ethernet to 10 Mbps as soon as I ask it to do multicast or broadcast. Is this normal?
Hi Rich, I'd be very interested in a patch. I might have worked on one myself (I'd already edited the source to turn off the absurd stdout spam), but unfortunately my sender is a windows box and I wasn't in the mood to battle compiling udpcast for windows. The behavior I saw was that 50-55 of my recievers would have the same md5sum, but it was still the *wrong* md5sum. The outliers could very plausibly be due to checksums being unreliable, and perhaps there was some bit-flipping that occurred before the first router sent everything out on multicast. Let me know how it goes if you decide to try it out, -Michael
Richard W.M. Jones wrote:
On Thu, Apr 24, 2008 at 11:06:03AM -0400, Michael Holroyd wrote:
I tried to solve a problem much smaller than yours but still had incredible difficulty. I was moving 10GB datasets out to 64 receivers over a flat switched network using multicast. Unfortunately, for reasons I never tracked down, files of this size would always get corrupted along the way even though all the receivers had received all packets (i.e. the md5sum would be different across all the different machines).
I haven't seen this problem (my tests are too small-scale probably) but I notice that the protocol doesn't do any sort of error detection for the dataBlocks. So we're relying on UDP's 16 bit checksum and maybe ethernet's CRC32. Both types of checksum are known to be very weak, and ethernet checksumming is even sometimes turned off.
Shouldn't be too hard to add a more robust checksum to the packets. Is anyone interested in a patch? I might have a go at one later.
Rich.
PS. My cheap-ish consumer switch slows down from gigabit-ethernet to 10 Mbps as soon as I ask it to do multicast or broadcast. Is this normal?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Just a quick addition to my earlier message. I just did a retest of doing the imaging after opening both port 9000 and 9001 on both senders and receivers. Sent the image to 19 receivers, and this time, it went thru with no errors, and the files were the correct size? So, at least in this case, it would appear that port 9001 is playing a part in this.
On 24 Apr 2008 at 17:14, Michael Holroyd wrote:
Date sent: Thu, 24 Apr 2008 17:14:37 -0400 From: Michael Holroyd meekohi@cs.virginia.edu To: "Richard W.M. Jones" rjones@redhat.com Copies to: udpcast@udpcast.linux.lu Subject: Re: [Udpcast] Scaling udpcast
Hi Rich, I'd be very interested in a patch. I might have worked on one myself (I'd already edited the source to turn off the absurd stdout spam), but unfortunately my sender is a windows box and I wasn't in the mood to battle compiling udpcast for windows. The behavior I saw was that 50-55 of my recievers would have the same md5sum, but it was still the *wrong* md5sum. The outliers could very plausibly be due to checksums being unreliable, and perhaps there was some bit-flipping that occurred before the first router sent everything out on multicast. Let me know how it goes if you decide to try it out, -Michael
Richard W.M. Jones wrote:
On Thu, Apr 24, 2008 at 11:06:03AM -0400, Michael Holroyd wrote:
I tried to solve a problem much smaller than yours but still had incredible difficulty. I was moving 10GB datasets out to 64 receivers over a flat switched network using multicast. Unfortunately, for reasons I never tracked down, files of this size would always get corrupted along the way even though all the receivers had received all packets (i.e. the md5sum would be different across all the different machines).
I haven't seen this problem (my tests are too small-scale probably) but I notice that the protocol doesn't do any sort of error detection for the dataBlocks. So we're relying on UDP's 16 bit checksum and maybe ethernet's CRC32. Both types of checksum are known to be very weak, and ethernet checksumming is even sometimes turned off.
Shouldn't be too hard to add a more robust checksum to the packets. Is anyone interested in a patch? I might have a go at one later.
Rich.
PS. My cheap-ish consumer switch slows down from gigabit-ethernet to 10 Mbps as soon as I ask it to do multicast or broadcast. Is this normal?
Udpcast mailing list Udpcast@udpcast.linux.lu https://lll.lgl.lu/mailman/listinfo/udpcast
+----------------------------------------------------------+ Michael D. Setzer II - Computer Science Instructor Guam Community College Computer Center mailto:mikes@kuentos.guam.net mailto:msetzerii@gmail.com http://www.guam.net/home/mikes Guam - Where America's Day Begins +----------------------------------------------------------+
http://setiathome.berkeley.edu (Original) Number of Seti Units Returned: 19,471 Processing time: 32 years, 290 days, 12 hours, 58 minutes (Total Hours: 287,489)
BOINC@HOME CREDITS SETI 5,269,727.070797 | EINSTEIN 1,573,038.609732 | ROSETTA 480,077.992597
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 24 Apr 2008 at 11:06, Michael Holroyd wrote:
Date sent: Thu, 24 Apr 2008 11:06:03 -0400 From: Michael Holroyd meekohi@cs.virginia.edu To: "Richard W.M. Jones" rjones@redhat.com Copies to: udpcast@udpcast.linux.lu Subject: Re: [Udpcast] Scaling udpcast
Hi Rich, I tried to solve a problem much smaller than yours but still had incredible difficulty. I was moving 10GB datasets out to 64 receivers over a flat switched network using multicast. Unfortunately, for reasons I never tracked down, files of this size would always get corrupted along the way even though all the receivers had received all packets (i.e. the md5sum would be different across all the different machines). Eventually I ended up using small bittorrent clients instead of udpcast since it checks the hash of each block. This also makes the process take about twice as long, but better to get correct data slow than corrupt data fast!
I had a similar problem recently on a much smaller scale. I was testing udpcast with a classroom to sent a just under 7GB ntfsclone image file to varios machines. I had one sender and 4 receivers and it worked fine twice. Then I tried with 8 receivers, and it failed. No error messages, and did it twice, with same error. Not sure what is the cause?
I am planning on doing some more testing, and it might be a problem with SELINUX and ports. To get it to work, I had to open port 9000 and 9001 on the sender with udp, and port 9000 on the receiver with udp on the receiver. Perhaps receiver also needs 9001?
The files on all the receivers seems to be that same size. Before this, I was using a script to down the file via ftp using ncftp. On the successful runs, it would udpcast the files to the linux partition, and then run the scipt to restore the new XP partition. The script woulld then show the file as being the same, and skip the download, and go straight to the retore. On the error batch, it started downloading the file via ftp, since they were not the same.
I've used udpcast to image 19 machines from one sender with no errors usign udpcast images, and have noticed any errors, and have systems run the disk test on boot with no errors.
So, don't know if it is a size problem, or ports, or kernel option, or option or something else? I'll try some more things, and try to see what is different between a good file and a bad one, and see if it was always the same.
Hope you have better luck, -Michael
Richard W.M. Jones wrote:
I'm looking at using udpcast to broadcast large disk images (10+ GB) to a very large network of machines (1,000-10,000 receivers) over a mostly switched, partially segmented gig-ethernet network.
Needless to say, the network of machines is all production-critical and I cannot get access to perform real testing. However testing it on my home network I can see some potential problems:
If _any_ receiver is misbehaving or unreachable then this stops all transmissions. Is there a way to get udpcast to drop troublesome receivers in this situation (other than unicast)?
Has anyone used the --ttl option to multicast over routers? Does it work (the manpage is unclear)? Does it need special routers?
Any other scaling tips? Should I try to go for the full set of machines at once or break up the broadcast into groups of machines?
If anyone has used udpcast on such large networks, can you share any experiences.
Rich.
Udpcast mailing list Udpcast@udpcast.linux.lu https://lll.lgl.lu/mailman/listinfo/udpcast
+----------------------------------------------------------+ Michael D. Setzer II - Computer Science Instructor Guam Community College Computer Center mailto:mikes@kuentos.guam.net mailto:msetzerii@gmail.com http://www.guam.net/home/mikes Guam - Where America's Day Begins +----------------------------------------------------------+
http://setiathome.berkeley.edu (Original) Number of Seti Units Returned: 19,471 Processing time: 32 years, 290 days, 12 hours, 58 minutes (Total Hours: 287,489)
BOINC@HOME CREDITS SETI 5,269,727.070797 | EINSTEIN 1,573,038.609732 | ROSETTA 480,077.992597