TIME_WAIT and its design implications for protocols and scalable client server systems

| 18 Comments | 2 TrackBacks
When building TCP client server systems it's easy to make simple mistakes which can severely limit scalability. One of these mistakes is failing to take into account the TIME_WAIT state. In this blog post I'll explain why TIME_WAIT exists, the problems that it can cause, how you can work around it, and when you shouldn't.
TIME_WAIT is an often misunderstood state in the TCP state transition diagram. It's a state that some sockets can enter and remain in for a relatively long length of time, if you have enough socket's in TIME_WAIT then your ability to create new socket connections may be affected and this can affect the scalability of your client server system. There is often some misunderstanding about how and why a socket ends up in TIME_WAIT in the first place, there shouldn't be, it's not magical. As can be seen from the TCP state transition diagram below, TIME_WAIT is the final state that TCP clients usually end up in.

TCP-StateTransitionDiagram-NormalTransitions.png

Although the state transition diagram shows TIME_WAIT as the final state for clients it doesn't have to be the client that ends up in TIME_WAIT. In fact, it's the final state that the peer that initiates the "active close" ends up in and this can be either the client or the server. So, what does it mean to issue the "active close"?

A TCP peer initiates an "active close" if it is the first peer to call Close() on the connection. In many protocols and client/server designs this is the client. In HTTP and FTP servers this is often the server. The actual sequence of events that leads to a peer ending up in TIME_WAIT is as follows.

TCP-StateTransitionDiagram-ClosureTransitions.png

Now that we know how a socket ends up in TIME_WAIT it's useful to understand why this state exists and why it can be a potential problem.

TIME_WAIT is often also known as the 2MSL wait state. This is because the socket that transitions to TIME_WAIT stays there for a period that is 2 x Maximum Segment Lifetime in duration. The MSL is the maximum amount of time that any segment, for all intents and purposes a datagram that forms part of the TCP protocol, can remain valid on the network before being discarded. This time limit is ultimately bounded by the TTL field in the IP datagram that is used to transmit the TCP segment. Different implementations select different values for MSL and common values are 30 seconds, 1 minute or 2 minutes. RFC 793 specifies MSL as 2 minutes and Windows systems default to this value but can be tuned using the TcpTimedWaitDelay registry setting.

The reason that TIME_WAIT can affect system scalability is that one socket in a TCP connection that is shut down cleanly will stay in the TIME_WAIT state for around 4 minutes. If many connections are being opened and closed quickly then socket's in TIME_WAIT may begin to accumulate on a system; you can view sockets in TIME_WAIT using netstat. There are a finite number of socket connections that can be established at one time and one of the things that limits this number is the number of available local ports. If too many sockets are in TIME_WAIT you will find it difficult to establish new outbound connections due to there being a lack of local ports that can be used for the new connections. But why does TIME_WAIT exist at all?

There are two reasons for the TIME_WAIT state. The first is to prevent delayed segments from one connection being misinterpreted as being part of a subsequent connection. Any segments that arrive whilst a connection is in the 2MSL wait state are discarded.

TIME_WAIT-why.png

In the diagram above we have two connections from end point 1 to end point 2. The address and port of each end point is the same in each connection. The first connection terminates with the active close initiated by end point 2. If end point 2 wasn't kept in TIME_WAIT for long enough to ensure that all segments from the previous connection had been invalidated then a delayed segment (with appropriate sequence numbers) could be mistaken for part of the second connection...

Note that it is very unlikely that delayed segments will cause problems like this. Firstly the address and port of each end point needs to be the same; which is normally unlikely as the client's port is usually selected for you by the operating system from the ephemeral port range and thus changes between connections. Secondly, the sequence numbers for the delayed segments need to be valid in the new connection which is also unlikely. However, should both of these things occur then TIME_WAIT will prevent the new connection's data from being corrupted.

The second reason for the TIME_WAIT state is to implement TCP's full-duplex connection termination reliably. If the final ACK from end point 2 is dropped then the end point 1 will resend the final FIN. If the connection had transitioned to CLOSED on end point 2 then the only response possible would be to send an RST as the retransmitted FIN would be unexpected. This would cause end point 1 to receive an error even though all data was transmitted correctly.

Unfortunately the way some operating systems implement TIME_WAIT appears to be slightly naive. Only a connection which exactly matches the socket that's in TIME_WAIT need by blocked to give the protection that TIME_WAIT affords. This means a connection that is identified by client address, client port, server address and server port. However, some operating systems impose a more stringent restriction and prevent the local port number being reused whilst that port number is included in a connection that is in TIME_WAIT. If enough sockets end up in TIME_WAIT then new outbound connections cannot be established as there are no local ports left to allocate to the new connection.

Windows does not do this and only prevents outbound connections from being established which exactly match the connections in TIME_WAIT.

Inbound connections are less affected by TIME_WAIT. Whilst the a connection that is actively closed by a server goes into TIME_WAIT exactly as a client connection does the local port that the server is listening on is not prevented from being part of a new inbound connection. On Windows the well known port that the server is listening on can form part of subsequently accepted connections and if a new connection is established from a remote address and port that currently form part of a connection that is in TIME_WAIT for this local address and port then the connection is allowed as long as the new sequence number is larger than the final sequence number from the connection that is currently in TIME_WAIT. However, TIME_WAIT accumulation on a server may affect performance and resource usage as the connections that are in TIME_WAIT need to be timed out eventually, doing so requires some work and until the TIME_WAIT state ends the connection is still taking up (a small amount) of resources on the server.

Given that TIME_WAIT affects outbound connection establishment due to the depletion of local port numbers and that these connections usually use local ports that are assigned automatically by the operating system from the ephemeral port range the first thing that you can do to improve the situation is make sure that you're using a decent sized ephemeral port range. On Windows you do this by adjusting the MaxUserPort registry setting; see here for details. Note that by default many Windows systems have an ephemeral port range of around 4000 which is likely too low for many client server systems.

Whilst it's possible to reduce the length of time that socket's spend in TIME_WAIT this often doesn't actually help. Given that TIME_WAIT is only a problem when many connections are being established and actively closed, adjusting the 2MSL wait period often simply leads to a situation where more connections can be established and closed in a given time and so you have to continually adjust the 2MSL down until it's so low that you could begin to get problems due to delayed segments appearing to be part of later connections; this would only become likely if you were connecting to the same remote address and port and were using all of the local port range very quickly or if you connecting to the same remote address and port and were binding your local port to a fixed value.

Changing the 2MSL delay is usually a machine wide configuration change. You can instead attempt to work around TIME_WAIT at the socket level with the SO_REUSEADDR socket option. This allows a socket to be created whilst an existing socket with the same address and port already exists. The new socket essentially hijacks the old socket. You can use SO_REUSEADDR to allow sockets to be created whilst a socket with the same port is already in TIME_WAIT but this can also cause problems such as denial of service attacks or data theft. On Windows platforms another socket option, SO_EXCLUSIVEADDRUSE can help prevent some of the downsides of SO_REUSEADDR, see here, but in my opinion it's better to avoid these attempts at working around TIME_WAIT and instead design your system so that TIME_WAIT isn't a problem.

The TCP state transition diagrams above both show orderly connection termination. There's another way to terminate a TCP connection and that's by aborting the connection and sending an RST rather than a FIN. This is usually achieved by setting the SO_LINGER socket option to 0. This causes pending data to be discarded and the connection to be aborted with an RST rather than for the pending data to be transmitted and the connection closed cleanly with a FIN. It's important to realise that when a connection is aborted any data that might be in flow between the peers is discarded and the RST is delivered straight away; usually as an error which represents the fact that the "connection has been reset by the peer". The remote peer knows that the connection was aborted and neither peer enters TIME_WAIT.

Of course a new incarnation of a connection that has been aborted using RST could become a victim of the delayed segment problem that TIME_WAIT prevents, but the conditions required for this to become a problem are highly unlikely anyway, see above for more details. To prevent a connection that has been aborted from causing the delayed segment problem both peers would have to transition to TIME_WAIT as the connection closure could potentially be caused by an intermediary, such as a router. However, this doesn't happen and both ends of the connection are simply closed.

There are several things that you can do to avoid TIME_WAIT being a problem for you. Some of these assume that you have the ability to change the protocol that is spoken between your client and server but often, for custom server designs, you do.

For a server that never establishes outbound connections of its own, apart from the resources and performance implication of maintaining connections in TIME_WAIT, you need not worry unduly.

For a server that does establish outbound connections as well as accepting inbound connections then the golden rule is to always ensure that if a TIME_WAIT needs to occur that it ends up on the other peer and not the server. The best way to do this is to never initiate an active close from the server, no matter what the reason. If your peer times out, abort the connection with an RST rather than closing it. If your peer sends invalid data, abort the connection, etc. The idea being that if your server never initiates an active close it can never accumulate TIME_WAIT sockets and therefore will never suffer from the scalability problems that they cause. Although it's easy to see how you can abort connections when error situations occur what about normal connection termination? Ideally you should design into your protocol a way for the server to tell the client that it should disconnect, rather than simply having the server instigate an active close. So if the server needs to terminate a connection the server sends an application level "we're done" message which the client takes as a reason to close the connection. If the client fails to close the connection in a reasonable time then the server aborts the connection.

On the client things are slightly more complicated, after all, someone has to initiate an active close to terminate a TCP connection cleanly, and if it's the client then that's where the TIME_WAIT will end up. However, having the TIME_WAIT end up on the client has several advantages. Firstly if, for some reason, the client ends up with connectivity issues due to the accumulation of sockets in TIME_WAIT it's just one client. Other clients will not be affected. Secondly, it's inefficient to rapidly open and close TCP connections to the same server so it makes sense beyond the issue of TIME_WAIT to try and maintain connections for longer periods of time rather than shorter periods of time. Don't design a protocol whereby a client connects to the server every minute and does so by opening a new connection. Instead use a persistent connection design and only reconnect when the connection fails, if intermediary routers refuse to keep the connection open without data flow then you could either implement an application level ping, use TCP keep alive or just accept that the router is resetting your connection; the good thing being that you're not accumulating TIME_WAIT sockets. If the work that you do on a connection is naturally short lived then consider some form of "connection pooling" design whereby the connection is kept open and reused. Finally, if you absolutely must open and close connections rapidly from a client to the same server then perhaps you could design an application level shutdown sequence that you can use and then follow this with an abortive close. Your client could send an "I'm done" message, your server could then send a "goodbye" message and the client could then abort the connection.

TIME_WAIT exists for a reason and working around it by shortening the 2MSL period or allowing address reuse using SO_REUSEADDR are not always a good idea. If you're able to design your protocol with TIME_WAIT avoidance in mind then you can often avoid the problem entirely.

If you want more information about TIME_WAIT its implications and ways to work around it then this article is very informative, as is this one.

Note that The Server Framework ships with some examples that clearly show the various options that you have for connection termination. See here for more details.

The initial version of this article was unclear in several places and contained some errors. Thanks to jwoyame for pointing out the potential errors in my reasoning in his comments below and for encouraging me to revisit my research and rework this article to improve the clarity and correctness

2 TrackBacks

I have a client who is possibly suffering from TIME_WAIT accumulation issues and I thought that the best way to find out for sure was to get them to add the TIME_WAIT perfmon counter to their normal counter logs... Read More

As I mentioned here, the WebSockets protocol is, at this point, a bit of a mess due to the evolution of the protocol and the fact that it's being pulled in various directions by various interested parties. I'm just... Read More

18 Comments

"For a server the golden rule is to always ensure that if a TIME_WAIT needs to occur that it ends up on the client and not the server."

I smiled when I read this. I wish this article existed last year when I went through great pain to solve the TIME_WAIT problem at work.

As always, great post, and keep up the great work. :)

... Alan

All too often I see servers with TIME_WAIT issues where the customer has designed the protocol themselves and has failed to take into account the potential TIME_WAIT issues...

Of course it's fine when you can apply my golden rule and adjust things so that the TIME_WAIT occurs on the client, and often the protocol dictates that you can't, it's just that not enough people are aware of the rule and why it's useful :)

Could you elaborate a bit on how web servers deal with this issue?

In HTTP 0.9 each request would occur on a new TCP connection which would be closed by the server once it had sent the response. This resulted in the server initiating the active close and ending up with a socket in TIME_WAIT.

In HTTP 1.1 the default is for the server to support persistent connections; that is the client can open a connection and send a series of requests on the same TCP connection. Because of this the server does not close the connection after each response. The client is then free to initiate the active close and it will end up with the TIME_WAIT.

Len: This is the best article on the TIME_WAIT server burden that I could find. Thank you!

There are a couple statements that I was wondering if you could clarify on:

1. "Only a connection which exactly matches the socket that's in TIME_WAIT need by blocked to give the protection that TIME_WAIT affords. This means a connection that is identified by client address, client port, server address and server port. However, most operating systems impose a more stringent restriction and prevent the local port number being reused whilst that port number is included in a connection that is in TIME_WAIT. If enough sockets end up in TIME_WAIT then new connections cannot be established as there are no local ports left to allocate to the new connection."

I was under the impression that the reuse restriction for the local port number only applies if you want to create a new socket to bind to your local port, i.e. a new listening socket; so if you wanted to restart your server without waiting for TIME_WAIT to expire, you need to set the SO_REUSEADDR option. I don't see that it would be possible to run out of local ports, as the local port number for all client connections are identical (port 80, port 21, ...)

2. "The best way to do this is to never initiate an active close from the server, no matter what the reason. If your client times out, abort the connection with an RST rather than closing it. If your client sends invalid data, abort the connection, etc."

I like the philosophy of this guideline a lot. But since there is no TIME_WAIT, wouldn't you still have to worry about the connection getting reestablished after the RST and having old segments arrive late to the party: If a client goes berserk and starts sending corrupt packets, the server responds with a RST, and then the user restarts the client application, my understanding says it would theoretically be possible for corrupt packets still on the network to invade the new, clean session...


Thanks again for posting this.

1) Yes, that's true, but that won't help when creating a new outbound connection if you happen to select the same ephemeral port for the client side and the target server is the same...

2) Yes it probably is theoretically possible... I'll have to think more on that one...

Thanks Len.

1) If I understand correctly then, TIME_WAIT only limits the number of available sockets for client-side outbound connections.
For servers, there is no scalability concern about hitting a hard upper bound on the number of connections, but the accumulation of sockets in TIME_WAIT does affect performance due to the memory and search costs of maintaining a list of disconnected sockets sitting in the TIME_WAIT period on the server.

Hopefully this is the case, so that having TIME_WAIT occur on the server is simply a performance concern rather than a connection-rejection concern. But it is still nice to have TIME_WAIT take place on the client side...

I need to go back to my notes and testing to take a look at this as what you're suggestion is obviously the logical conclusion; it's not what I've seen in all cases though.

Thanks for the insightful comments.

No problem. Of course any problems with RST would of course be EXTREMELY rare. If one did appear, then it's time to buy lottery tickets or something. There has to be two perfect conditions both in place:

1. The client would have to timeout, send invalid data, or perform some other behavior severe enough to trigger the server to drop the connection on its own side.

2. There would have to be client packets bouncing around between the routers or buffered somewhere long enough that a new connection from that same client would take place before they are received. If the RST was sent as a result of a server-triggered connection timeout, the client likely has not sent anything in a very long time anyway.

Secondly, the consequences of the second connection getting old packets are not severe even if this extreme alignment of stars occurred... a 3rd connection would probably be just fine.

Anyway, just some fun stuffs to think about :-P

Jon Woyame

Yeah, I was going to say that the RST issue would be pretty unlikely.

In addition to your 2 conditions the new connection needs to be a new incarnation of the reset connection, so the peer's address AND port need to be the same as the reset connection.

This seems likely only if the peer were to be binding to the same port on its side; which is unusual. And then, of course, the stars need to align for the sequence numbers in the new connection to be valid in the old one.

If the peer wasn't explicitly binding to the same port then either the OS would need to be reusing local ports in a fifo manner which seems to be an unlikely design choice or the peer would need to be cycling through all available ephemeral ports very quickly. I suppose embedded systems or OS's with a small number of ephemeral ports would be more susceptible.

I'm putting together some test code and investigating further at the moment and I expect I'll update the blog posting to clarify the issues that you've raised.

I did not even think about your two other conditions at the time, but yes, choosing the same ephemeral port + hitting the proper sequence number makes the RST issue even far more unlikely!

The tweaks you made to the article address the first concern very well. It wouldn't be a stretch to say this is the most thorough coverage of TIME_WAIT available; well done.

Thanks a lot for this article and the comments, very useful.

While troubleshooting a performance issue I came across unexpected server behaviour in regards to TIME_WAIT that I cannot explain, maybe someone here can shed some light...

It seems that in this particular scenario with a server running IIS 7, when a client is done and wants to close the connection, the server initiates the active close and sends the first FIN packet (as expected per standard HTTP), normally this should result in a TIME_WAIT state on the server, however here it does not, the socket just moves from ESTABLISHED to vanished. I read a post that IIS sends a RST instead of FIN but it looks like this was changed as when I capture the session I only see FIN, and even though the session looks quite normal and the capture confirms the server is doing the active close, there is no TIME_WAIT state, neither on the server nor on the client

I see exactly the same on several machines running the same setup. I didn`t even manage to produce a single TIME_WAIT on these machines by putting them in the client role.

All of these are running Win2008r2 and IIS7, maybe some TIME_WAIT supression mechanism was implemented, but how if the server is initiating the active close?

Have you tried taking a Wireshark (or similar) log of the whole shutdown sequence? Does the client send an RST?

Yes I did capture the whole shutdown (on the client side) and there is no RST, the client just replies with it`s FIN. In fact I am not using any browser software to perform these tests in case you are thinking some of them close the session with a RST, I am simply doing telnet to port 80 so the whole conversation is not more than a few packets.

In the meantime we found that rebooting any of these servers brings them to expected operation (i.e. now we see the TIME_WAIT status after the session is closed), again this is confusing as no changes were made to the registry or the application settings prior to the reboot (at least not to our knowledge :), no updates installed either... and it`s not just a windows box that is confused, we see this on multiple devices that follow strict version control.

At this stage we are a bit lost, our conslusions point to someone at MS deciding they know better than the RFC and have somehow bypassed the whole TIME_WAIT state, and then for some reason they decided it`s not that good of an idea and brought it back again, all of this without our admins knowing of any hot fixes or changes on the system. Sure it sounds like a lot of nonsense but this is what we see :)


I do remember seeing some information somewhere, the DisconnectEx() MSDN page, perhaps, which mentioned that Windows would terminate a TIME_WAIT early in situations where all TCBs were in use. Perhaps that's what you're seeing?

If you call GetExtendedTcpTable() when the problem is occurring and dump the table are there any suspicious looking entries?

Your post confuses and conflates sockets with ports throughout. It is the *port* that is in the TIME_WAIT state. The socket is gone. You need to fix this.

Whilst what you say is technically correct I don't actually think that changing the article to refer to ports rather than sockets would improve the readability (or correctness) of the article for most readers.

There's an interesting article here: http://blogs.technet.com/b/networking/archive/2010/08/11/how-tcp-time-wait-assassination-works.aspx which talks about how the Windows TCP stack allows TIME_WAIT assassination on the server side. This is an interesting twist which means that TIME_WAIT is even less of an issue on the server side (for Windows servers at least) than I expected.

In summary; a server side, inbound socket in TIME_WAIT would cause problems for a duplicate inbound connection (where source address AND port match the TIME_WAIT connection) BUT if that machine is also running Windows then the Initial Sequence Number of the new connection is guaranteed to be higher than the last sequence number of the old connection and so the server side stack will allow the connection to occur even though the socket is in TIME_WAIT. If the ISN was NOT higher then the connection would be blocked.

Leave a comment

Follow us on Twitter: @ServerFramework

About this Entry

New client profile: RTE Network was the previous entry in this blog.

Performance, allocators, pooling and 6.4 is the next entry in this blog.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.