This is the mail archive of the cygwin-developers@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: TCP connections can occasionally fail because of a winsock bug

[Get raw message]
On Thursday 15 November 2001 14:21, you wrote:
> I've dug deeply enough into this to determine that I believe the
> problem is caused by a bug in winsock.  I can get the problem to
> manifest itself completely independently from Cygwin.  See the full
> description in the attached program, which one of my coworkers with an
> MSDN subscription is going to forward to Microsoft to see what they
> have to say about it.

For what it's worth, we recently encountered this problem in the ONC RPC 
library. The original Sun code, and any revision I've been able to find, 
binds a local port even on the TCP protocol. The same thing happens, with the 
bind not failing, and the failure occurring on the connect. 

We depend on RPC heavily, and would see delays on startup when the inital 
clnt_create would fail repeatedly. The RPC attempts to use a pool of local 
ports, and will increment and retry if the bind fails -- but it doesn't.

This is not a cygwin issue; we are using the MKS/DataFocus NutCracker 
toolkit. DataFocus provided the ported ONC RPC code but does not support it.  
We have been tinkering with it in-house. The bind can be eliminated for some 
improvement, in this case. 

There are other issues we are dealing with. I've forwarded a couple of the 
emails to another programmer at work who is also working on NT/2000 socket 
issues.

Interestingly enough, on Linux, the bind also fails unless the process has 
root priveleges. However, the code only iterates on EADDRINUSE and the return 
is not checked, so the connect succeeds. 

I, also, wrote a native testcase with the WSA calls and got the same results. 
I did note that the OS expires the port eventually, but it takes 5 to 20 
minutes. 

I believe the root of the problem is that both the remote host address and 
local port are used to determine if the connection is unique. bind would fail 
if anything other than ANY_ADDR is used, so at the time of the bind it isn't 
known if the combination is unique. Only when the host address is known in 
connect, will the combination fail.

Our problem was exacerbated by the fact several apps are typically started at 
the same time on one station, and they are all trying to make RPC connections 
to the server machine. The ONC RPC algo uses the pid to calculate  which port 
to try first; with several clients starting and making several connection, 
there would be groups of used ports; if a connection timed out, and the next 
attempt moved into a cluster of ports being used by another app, the 
clnt_create would fail many times, before it finally iterated into fresh 
territory.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]