How to support 10,000 or more concurrent TCP connections

2010-10-25

Using a modern Windows operating system it’s pretty easy to build a server system that can support many thousands of connections if you design the system to use the correct Windows APIs. The key to server scalability is to always keep in mind the Four Horsemen of Poor Performance as described by Jeff Darcy in his document on High Performance Server Architecture. These are:

Data copies
Context switches
Memory allocation
Lock contention

I’ll look at context switches first, as IMHO this is where outdated designs often rear their head first. On Windows systems you MUST be using I/O Completion Ports and overlapped I/O if you want the best scalability. Using the IOCP API correctly can mean that you can service thousands of concurrent connections with a handful of threads.

So, we’ll assume we’re using a modern overlapped I/O, IO Completion Port based design; something similar to what The Server Framework provides, perhaps. Using an IOCP allows you to limit the number of threads that are eligible to run at the same time and the IOCP uses a last in, first out mechanism to ensure that the thread that was last active and processing is always the next one to be given new work, thus preventing a new thread’s memory needing to be paged in.

The latest versions of Windows allow you to reduce context switching even further by enabling various options on each socket. I’ve spoken in detail about how FILE_SKIP_COMPLETION_PORT_ON_SUCCESS can help to reduce context switching on my blog, here. The gist is that with this option enabled you can process overlapped operation completions on the same thread that issued the operation if the operation can complete at the time it’s issued. This removes unnecessary context switching. The Server Framework supports this mode of operation on operating systems that support it.

Next on to data copies, as Jeff says in his document, one of the best ways to avoid data copies is to use reference counted buffers and to manage them in terms of the amount of data present in them, building buffer chains where necessary. The Server Framework has always worked in terms of flexible, reference counted buffers. Many server designs can benefit from accumulating inbound data into a single buffer with no buffer copying required simply by reissuing a read on a connection with the buffer that was passed to you when the previous read completed. In this way it’s easy to accumulate ‘complete messages’ and process them without needing to copy data.

Since the buffers are reference counted you can easily pass them off to other threads for processing, or keep them hanging around until you’re done with them. The CMultiBufferHandle class allows you to use scatter/gather I/O so that you can transmit common blocks of data without needing to copy the common data and the CBufferHandle class allows you to broadcast data buffers to multiple connections without needing to copy the data for each connection.

Memory allocation during connection processing is minimal and custom buffer allocators can reduce this even further. The Server Framework ships with several different allocators and it’s easy to implement your own if you need to and simply plug it in to the framework code. By default the buffer and socket allocators pool memory for reuse which helps reduce contention and improve performance.

Once you’re processing your messages several tricks can be employed to optimise your memory allocation use, a favourite of mine is to use custom memory allocators that use scratch space in the buffer that the message was initially read into. This can then be used to provide all of the dynamic memory needed during message processing and avoids the need for traditional memory management and its potential lock contention.

Lock contention within the framework itself is limited. The design of the socket object is such that you can design a server where there’s never any need to access a common collection of connection objects simply to obtain per connection data from it. Each connection can have as much user data as you like associated with it and this can all be accessed from the connection without the need for any locks. The locks in the buffer allocators are probably the most likely locks to result in contention but here you can select from several different strategies, including a lock free allocator.

Of course once you’ve our of the framework code and into your own code you still have to be careful not to fall foul of the Four Horsemen of Poor Performance, but you can rest assured that The Server Framework is designed very much with these issues in mind. I believe that the only way to make sure that you maintain scalability and performance is to test for it at all stages of development, it’s important to set up scalability tests from day 1 and to run them regularly using real world usage scenarios. I’ll talk about how to go about setting up these kinds of tests in a later blog post.