I've managed to get reliable transfers to all 85 machines at this point (bad nic in the sender, I think).
I'm still bottlenecking at 45Mbit or so, though, despite the fact that all machines are on switched copper gigabit.
If fec is turned on (either 8x8/128 or 16x4/128), receivers disconnect throughout the transfer until only ~10 remain at the end.
If fec is turned off, I get high retransmission rates (15-25%), but the transfer at least succeeds to all hosts. Again, I'm sending about 20G of data, which should take about an hour at 45Mbit. In practice, it takes about an hour and a half, and sometimes close to two hours.
For simple point-to-point tests with udpcast (between 1 and 5 receivers), I can achieve higher bandwidths reliably (sending 128 megabytes at 80mbit), both with and without fec.
It sounds like my issue is similar to Ramon Bastiaan's: http://lll.lgl.lu/pipermail/udpcast/2004-September/000298.html
Ramon, have you had any luck?
George,
I'm afraid not really, well.. a little. Will explain why and what.
It seemed that the problem with us was the fact that udpcast uses synchronous writes to disk by default, and the fact that we pipe data through tar.
For us, after some test scenarios (casting /dev/zero from sender to /dev/null on receiver, casting tarred data from sender to /dev/null on receiver), we found that the slow down/bottleneck was being caused by the synchronous writes. When setting udp-receiver to --nosync we had a (significant) speed increase.
We are running a Linux cluster of 275 machines which we install from a central server using SystemImager and udpcast.
Because SystemImager's image'ing tools dont compress or image the files, we need to cast an entire filesystem (lots of files) over the network. And because udpcast only supports sending/receiving a single file and writing to a single file descriptor, it can only write asynchronously to one file. Because of this we use tar to pipe all files through on both receiver and sender.
This is when the problem arose. When we didn't use tar, we could get high speeds and the (network/harddisk) hardware seemed to become the limiting factor. But only when using the --nosync writes. Because we use tar (which obviously has no --nosync option), now tar became the bottleneck.
Tar writes synchronous to disk by default and there seems to be no way to disable this. As an alternative Alain Knaff suggested using cpio, but we never came to this. This would require some massive changes to both server and client casting setup for SystemImager, and by then we got the speed up to 100 Mbps (still only a 10th of capability but better than before).
This was enough for us, since we are casting a image of approximately 2 Gb. I can image it being a bigger problem when you're casting 20 Gb on a daily basis.
So, to cut a long story short:
Try adding the --nosync argument to your udp-receiver's, and see what happens.
When you don't pipe the data through an external program you should be able to get (very) fast speeds (at least we did). If you _do_ use tar or something to pipe the data through, we found that the maximum lies around 100 Mbps, because of previously explained reasons.
Another approach could be to first send the data as one big tar to your clients, and extract the tar after transmission. Or to try pipe'ing through cpio (haven't tried that one myself) in stead of tar.
Hope this helps you a bit. I know it can be a frustrating problem ;). Let me know how things work out for you.
Kind regards, Ramon.
George Coulouris wrote:
I've managed to get reliable transfers to all 85 machines at this point (bad nic in the sender, I think).
I'm still bottlenecking at 45Mbit or so, though, despite the fact that all machines are on switched copper gigabit.
If fec is turned on (either 8x8/128 or 16x4/128), receivers disconnect throughout the transfer until only ~10 remain at the end.
If fec is turned off, I get high retransmission rates (15-25%), but the transfer at least succeeds to all hosts. Again, I'm sending about 20G of data, which should take about an hour at 45Mbit. In practice, it takes about an hour and a half, and sometimes close to two hours.
For simple point-to-point tests with udpcast (between 1 and 5 receivers), I can achieve higher bandwidths reliably (sending 128 megabytes at 80mbit), both with and without fec.
It sounds like my issue is similar to Ramon Bastiaan's: http://lll.lgl.lu/pipermail/udpcast/2004-September/000298.html
Ramon, have you had any luck?
Ramon Bastiaans wrote: [snip]
Try adding the --nosync argument to your udp-receiver's, and see what happens.
I'll give this a shot, and then try enabling FEC, and then try increasing the bitrate.
When you don't pipe the data through an external program you should be able to get (very) fast speeds (at least we did). If you _do_ use tar or something to pipe the data through, we found that the maximum lies around 100 Mbps, because of previously explained reasons.
We do use tar in all cases, since we're moving trees around. 100 is a lot better than the ~30 we're achieving now!
Thanks for your help!
Regards, George Coulouris
George Coulouris wrote:
Ramon Bastiaans wrote: [snip]
Try adding the --nosync argument to your udp-receiver's, and see what happens.
Sadly, --nosync didn't seem to have any effect. There is a strangeness I 've noticed in the log about halfway through the transfer:
... bytes= 10 540 294 128 re-xmits=2147464 ( 29.6%) slice=0112 73 709 551 615 - 59 bytes= 10 540 457 200 re-xmits=2147484 (-29.-6%) slice=0112 73 709 551 615 - 59 ... bytes= 19 628 001 280 re-xmits=3351819 ( -6.-9%) slice=0112 73 709 551 615 - 31 Transfer complete. Disconnecting #0 ...
On Friday 25 February 2005 04:22, George Coulouris wrote:
George Coulouris wrote:
Ramon Bastiaans wrote: [snip]
Try adding the --nosync argument to your udp-receiver's, and see what happens.
Sadly, --nosync didn't seem to have any effect. There is a strangeness I 've noticed in the log about halfway through the transfer:
... bytes= 10 540 294 128 re-xmits=2147464 ( 29.6%) slice=0112 73 709 551 615 - 59 bytes= 10 540 457 200 re-xmits=2147484 (-29.-6%) slice=0112 73 709 551 615 - 59 ... bytes= 19 628 001 280 re-xmits=3351819 ( -6.-9%) slice=0112 73 709 551 615 - 31 Transfer complete. Disconnecting #0 ...
This was due to a confusion between signed and unsigned variables. However, the only consequence of the bug was the messed-up display.
It is fixed in today's version (20050226)
Regards,
Alain
On Thu, 24 Feb 2005, Ramon Bastiaans wrote: [...]
Because SystemImager's image'ing tools dont compress or image the files, we need to cast an entire filesystem (lots of files) over the network. And because udpcast only supports sending/receiving a single file and writing to a single file descriptor, it can only write asynchronously to one file. Because of this we use tar to pipe all files through on both receiver and sender.
The idea is to send a whole partition, if possible. This can be significantly faster than using tar (see below). We used to clone whole disks or partitions on our 128-node cluster, which was pretty fast: About 20 MByte/s over two Fast Ethernet links (we used our own tool though, not udpcast).
This is when the problem arose. When we didn't use tar, we could get high speeds and the (network/harddisk) hardware seemed to become the limiting factor. But only when using the --nosync writes. Because we use tar (which obviously has no --nosync option), now tar became the bottleneck.
The problem with tar is that it has to deal with each file individually, which causes many movements of the disk's head. Each move to a track where the next file or its corresponding inode is located, brings some latency, which will ultimately reduce throughput.
If you clone a single file (like a whole partition), then there are only very few head movements and the movements are only to the next track on the disk. Hence, you get higher throughput than with tar.
Of course, cloning a whole partition also copies empty blocks, which is not strictly necessary. Therefore, you clone more data than when using tar, but you can do it with higher throughput. Whether this is actually a win depends on a number of factors, most importantly the fill rate of your partition.
- Felix