We use udpcast in an environment somewhat based on systemimager but
We have run in to a very sad situation where a particular model of switch
decides it doesn't want to pass the multicast data packets and none of them
get through. It is an extremely hard problem to duplicate because it seems
to work 99% of the time.
The end result when we hit the problem is the udp-receiver "waits forever".
So I've begun to consider setting some additional timeouts in the system
to try to work around this switch problem.
I investigated receive-timeout and start-timeout. Since I'm never able
to reproduce the problem "on demand", I emulated the problem by using
iptables to block the data stream packets.
I found that receive-timeout isn't in play in a situation where not one
data channel packet has been sent. However, start-timeout is in play.
There seem to be a couple select()-like situations that use the start-timeout.
Prior to the "Connected as" message in udp-receiver, if you hit start-timeout
there, udp-receiver will exit with an exit code that can be captured by a
script for a re-try. However, the selectWithConsole() call in
dispatchMessage(), while returning a 0 on select timeout, the caller
(netReceiverMain()) doesn't test the return value of the timed out transfer.
It turns out selectWithConsole is where I hit the timeout for the "no
multicast data packets transferred" problem.
After digging further, I realized this is likely because there are threads
involved and this makes it more complicated to handle status.
I'm rusty with my C but I came up with a work around to get us going. I am
not suggesting that this is a good solution, but it does solve it for me
and could be used as an illustration of my problem. Maybe some of you
experts can quickly come up with the correct solution.
Basically, I exit with 100 if zero bytes were transferred. Then I can test
that easily in our scripts and re-try if needed in some sort of loop.
I realize the correct solution is to exit with an error code if the
selectWithConsole() call in dispatchMessage() times out, but it looked
hard to deal with when combined with my C-programming rust.
Other suggestions welcome. Attached is my "patch" for illustration reasons
and not a suggested "fix" for the problem.