Kyle Cordes wrote:
To try to track down the udpcast corrupt file problem, I ran some more tests. This time I used a ~50GB file, a sender, and only 1 receiver.
side bytes sender: 53687091200 receiver: 53686091776
In all of my large-file runs, udp-receiver comes up a bit "short", it missing some of the data, never "long".
I created a 50 GB test file with predictable text data in it, suing this ugly little program:
#include <stdio.h>
// 16 bytes per entry. // int main(void) { long long gb = 1024 * 1024 * 1024; long long m = 50 * gb; long long i; for(i = 0; i<m; i+= 16) { printf("%.15lld\n", i); } }
so that I could easily look at the files. I found that the received file ended with the same data as the sent file; in other words, the problem is *not* a matter of terminating early, or other finishing-out process.
Rather, it's much earlier. According to "cmp":
differ: byte 2098176010, line 131136001
That's a little under 2 GB of the way in to a 50 GB file.
Strangely, I ran repeated tests with 10 GB files, and didn't get any corruption.
Alain - it would warm my heart to see you ack these messages, even if you don't have a solution at hand.
I do get your messages, but for the moment I am somewhat busy on some other project (preparing the release of mtools version 4 with Unicode support). However, in some two weeks time I'll be more available to check out what is going on.
The strange thing is, we do use udpcast for duplicating entire disks, most of which are larger than 50GB by now, and we never did notice any ill effect. A large piece of data missing in the middle would have been pretty obvious, but we've never have seen any of this so far.
So apparently, it only happens under certain circumstances... and we need to understand what exactly these circumstances which are triggering this are.
I appreciate your work on this subject, and I'm pretty confident that within a couple of more tests, you'll have identified what is going on (... making it easier for me to fix...)
One suggestion (careful: this may take some time, and needs *huge* amounts of diskspace): try running udpcast under strace (strace -fo log.send udp-sender ... and strace -fo log.recv udp-receiver ...), and try to locate the system calls around the place where the missing data occurs (strace output should have reads and writes whose parameter is your textual data. The stretch of output between the reading or writing 000002098175984 and 000002098225152 is the interesting one here...
Actually, to be precise, as udpcast reads and writes in largish chunks, you'll not see a read or write for every line. So the last read or write before the error will probably have a number less than 000002098175984, and the next write will have a number larger than 000002098225152, but you get the gist of it.
Another weird thing is that although the problem happens relatively "early" in the file, it only occurs for certain minimum file sizes... just as if the file was being corrupted after the fact (say, after 10GB have been transferred.) It might be interesting to do a cmp midway through and see if the difference is already there "from the beginning..." (for instance, you may start your cmp as soon as your receive file reached size 2GB...)
And, do several runs with the same input file always produce the error at the exact same spot?
Regards,
Alain