Sunday, August 12, 2007

MTU too large, communication stalls (under research)

I have been researching this problem for some time now, found a workaround, but no problem reason and no good permanent solution:

Environment and problem: LAN1 is connected to LAN2 via a LINUX PC with two network interfaces, running NAT in both directions. On LAN2, a VPN device has a site-to-site VPN tunnel over the internet to a remote LAN3. When computers on LAN1 tries to talk to computers on LAN3, it works well on some computers, while on others, all comms stalls after a few packets. There are other LANs connected to LAN2 via the same VPN device where the problem never occurs.

Clue: When connecting to LAN2 over a VPN client from a remote location, the communication to LAN3 always works.

Technical details: Using the Ethereal Wireshark network protocol analyzer, a bunch of TCP DUP ACK messages occurs around the time of the freeze.

Workaround: On computers where this is a problem, if the MTU value of the network interface used is modified to 1300 (or 514 hex), the problem goes away. This is done here:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\ID for Adapter
For a reference on this value, see Microsoft Knowledge Base art. no.
314053.

Potential negative effects of workaround: Smaller MTUs cause more network overhead because smaller and thereby more network packets must be transmitted, all with their own network packet header. In practice though, I could see no negative effect in my case.

Problem cause: Not found. The MTU that should have been used for the overall transfer should have been determined by the lowest MTU given by any router along the path the packets take. My assumption is therefore that some router "out there" is set up with a MTU larger than the network packet size it can actually handle (is this possible?). It works for the first few packets, then the router gets overloaded and freezes.

Permanent solution NOT found: A permanent solution where the MTU value of each client computer is untouched has not been found. Given that my assumption is right, a permanent solution can only be found if the router working under false MTU conditions can be identified and accordingly modified.

Further research: I want to try setting the MTU of the interfaces on the LINUX computer that routes between LAN1 and LAN2 to 1300. If my theory holds, this should lower the overall packet size so that the problem goes away. Since traffic load between the two LANs are never significant, it is acceptible in my case to live with that, even though it does not address the problem at the router that I assume is causing the real problem.

UPDATES ON MY RESEARCH WILL (may) FOLLOW

No comments: