Kyle Cordes wrote:
Alain Knaff wrote:
The strange thing is, we do use udpcast for duplicating entire disks, most of which are larger than 50GB by now, and we never did notice any ill effect. A large piece of data missing in the middle would have been pretty obvious, but we've never have seen any of this so far.
Alain,
By any chance do you typically use it like so on your large files?
udp-receiver | some-process ?
Contrary to my earlier findings, in ongoing testing I have found that if I used it like this:
udp-receiver --file foo
I sometimes get bad results; and things like this:
udp-receiver --pipe "lzop -d" --file foo
also sometimes get bad results.
but I noticed that my real scripts do this:
udp-receiver | lzop -d | pg_restore
and I tested like this:
udp-receiver | lzop -d >foo
... and I get correct results. To 5 or 6 receivers. Every night.
Also, I found that having lzop (or other common compression tool) in the loop acts as a guard against data integrity problems - if udp-receiver skip or damages data, it would fail lzop's checksums and make the whole process fail.
Thus, it looks like there is some issue that comes in to play with --file, but not when simply letting the data fall out on stdout.
I'm sitting the issue down for the moment, but later I may beat on it a little more to try to track down the specifics of the failure.
I also think that perhaps the "--pipe" and "--file" features are unnecessary; that udp-receiver would be better by being simpler, and simply assume that the user will redirect the output where they need it.
I have a suspicion that there may be a bug in some versions of the Linux kernel as far as seek is concerned, that seek is not thread-safe.
Udpcast uses lseek(fd, 0, SEEK_CUR) to read the current file position for statistics printing. Theoretically, this should not be harmful, as it should have no influence of file position. But I've got the suspicion that what this really does it read the file position, do some stuff, and then _write_back_ that same position: leading to corruption if ever a read or write in a different thread happened in between (file position will be reset to just before the read).
Could you try out whether you still get the problem if you comment out the contents of the printFilePosition function in statistics.c ?
What if you replace that contents with:
static void printFilePosition(int fd) { if(fd != -1) { int fd2 = dup(fd); if(fd2 != -1) { #ifdef HAVE_LSEEK64 loff_t offset = lseek64(fd2, 0, SEEK_CUR); if(offset != -1) printLongNum(offset); #else off_t offset = lseek(fd2, 0, SEEK_CUR); if(offset != -1) fprintf(stderr, "%10d", offset); #endif close(fd2); } } }
(Trying to read the position from a _copy_ of the file descriptor)
Regards,
Alain