Performance, allocators, pooling and 6.4

2011-02-14

I’ve been doing a lot of performance related work over the past few months. Much of it has originally stemmed from work that I have been doing for our Online Game Company client. They want the asynchronous ENet-based server that we’ve built for them to be the fastest it can be and so I’ve been looking at all aspects of performance within the framework. The testing hasn’t brought anything too embarrassing to light; the I/O recursion limiter that was introduced in 6.3 affected performance more than I would have liked, but that’s now fixed and they weren’t using it anyway.

Their server makes heavy use of our pooling buffer allocators and so contention for their locks is high and these, and the datagram socket allocators, are key areas that need looking at. I’ve made a few adjustments that work well in simpler datagram servers; we now reuse buffers and sockets immediately if we can rather than putting them back into their allocators and then extracting them again a short timer later. Unfortunately their server is complicated enough that both the buffer and the socket aren’t available for immediate reuse. I’ve been building and testing and profiling several new buffer allocators over the past few months. Many show good results in isolated performance tests and then go on to perform less spectacularly in real world profiling. I’m now down to two allocators that work better than the standard allocator, the first is the lock free allocator that’s been in the framework for years and the second is a new allocator that uses thread-local storage to maintain thread specific pools. Both of these allocators out perform the standard allocator for non-trivial server designs but both are unable to maintain a list of allocated buffers. To be able to compare like with like the ‘in-use’ list for the standard allocator is now optional.

The original buffer allocator design maintains two lists of buffers, one that is a list of buffers in the pool, a free-list from which buffers are removed, if available, to avoid the need to allocate new buffers from the memory allocator. The buffer allocator can then be configured to limit the size of the free-list so that it doesn’t use too much memory but can deal with peaks and troughs in demand. The second list is an ‘in-use’ list which is used for diagnostic purposes. When a server is cleanly shut down the ‘in-use’ list should be empty by the time the allocator is destroyed, if it isn’t then you have some buffers that are being leaked. This has proven to be useful in the past.

The lock free buffer allocator has never supported the concept of an ‘in-use’ list as it uses the Microsoft SLIST lock free stack. This allows you to push and pop items from the head of the list but doesn’t allow the random access to list elements that is required for maintaining an ‘in-use’ list.

The new thread-local storage allocator maintains per thread free-lists which allow the allocator to operate with no locking and no waiting if buffers are present in the free-list when an allocation is required or if space is available when a buffer is released to the free-list. Locking occurs only when the per thread free-list is full or empty and at this point the per thread allocator accesses other threads’ free-lists. To ensure that locking is only required when a thread needs to access an other thread’s free-list the actual free-lists are split in two. Each thread can access their private free-list without any need for locking and can swap it for their public free-list using InterlockedExchange() if the new free-list is still not suitable the allocator can lock the list of per thread allocators and walk the list of all of the allocator’s public free-lists and attempt to find a suitable one by swapping its current free-list for the one that it wants to look at, again using InterlockedExchange(). This all works surprisingly well, it avoids the problem of having a group of threads which only ever have buffers released to their allocators and the opposite problem of having some threads which only ever allocate; both of these scenarios defeat the per-thread free-lists and degenerate to straight forward new and delete calls without the free-list swapping code.

The thread-local storage allocator performs as well as the lock free allocator in most server designs and has the edge in situations where contention for the allocator is very very high; the lock free allocator burns CPU in these situations whereas the TLS allocator causes threads to block on the lock and so allows other threads to do stuff… This is a very hard situation to force the allocators into though and so the TLS and lock free allocators are, at present, almost equally useful.

The TLS allocator is also unable to maintain an ‘in-use’ list as this list would need to be a single list that is accessed from all of the per-thread allocators and this would introduce potential contention to each allocation and release and somewhat defeat the object of the isolated per-thread allocators.

I’m hoping to simplify how you select an appropriate allocator in 6.4. Ideally there will be one class that you use and this can be configured to be either a ‘standard’ allocator with an ‘in-use’ list for debugging or one of the more scalable allocators for production use. The ’low contention allocator’ that I blogged about last year is likely to be removed as it’s too hard to configure and too easy to force it into acting like a giant spin lock around the allocators.

Of course the best way to improve performance is to simply not do the work. I’m currently working on a new version of the asynchronous ENet implementation that avoids much of the buffer allocator action that the current version requires. The new version performs sub-allocations for reliable tracking data and other datagram book-keeping within the buffers that are used for the datagrams themselves rather than in separate buffers. This works very well and that, and the simplification of many internal data structures should make the new ENet code considerably more performant.

Now that I understand the issues around the buffer allocator design with regards to the ‘in-use’ list I hope to make some progress on improving the socket allocators. These MUST have an ‘in-use’ list as a server or connection manager MUST be able to abort all of the connections that are still been established when it is destroyed. The allocator is not necessarily the best place for this list, however, and there have been some issue with the socket allocator’s ‘in-use’ list in the past so I’m quite keen to adjust how this works… Due to how socket allocators are used any performance improvements are most likely to be seen more in datagram servers than stream servers as the socket’s in stream servers are relatively long lived and so the socket allocators are less of a hot spot.

The changes mentioned here are currently scheduled for release in version 6.4 of The Server Framework which does not currently have a release date.