Winsock Registered I/O, I/O Completion Port Performance

2012-08-29

Page content

Continuing my analysis of the performance of the new Winsock Registered I/O API, RIO, compared to more traditional networking APIs we finally get to the point where we compare I/O completion port designs.

I’ve been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview, though lately most of my testing has been using Windows Server 2012 RC. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the “traditional” APIs so that I could compare performance. It took a pair of 10 Gigabit network cards to fully appreciate the performance advantages of using RIO with the simpler server designs and now I can present my findings for the more complex server designs.

Our test system

Our test system consists of a dual Xeon E5620 @ 2.40GHz (16 cores in 2 Numa nodes with 16GB of memory, running Windows Server 2012 RC) and an Intel Core i7-3930K @ 3.20GHz (12 cores with 32GB of memory, running Windows 7). These are connected by a pair of 2 Intel 10 Gigabit AT2 cards wired back to back. The code under test runs on the dual Xeon and the datagram generator runs on the Core i7.

How to test RIO’s performance

The test results presented here are from running the same kind of tests that we ran back in March against the more complex server designs.

An update to the RIO IOCP server design

The original RIO IOCP server design that I presented back in March had a bug in it, which is now fixed. The bug was due to a lack of detail in the RIO API documentation and an assumption on my part. It seems that calling RIOReceive() on a given request queue is not thread safe. With our IOCP server design where we call RIONotify() as soon as we have dequeued the available completions and then call RIOReceive() each time we have finished with a datagram and can issue another read into the buffer that is now available it is likely that multiple threads are calling RIOReceive() on the same request queue at the same time. I’ve witnessed some failures due to the number of reads permitted being exceeded, and also performance degradations. Both of these issues are fixed by locking around the calls to RIOReceive(); in a design which used multiple request queues you would have one lock per queue. The locking causes some inter-thread contention but the API does not appear to be able to be used without it. It would be useful if the documentation were to be explicit about this.

The results

Running the traditional, multi-threaded, IOCP UDP server with 8 I/O threads and no “per datagram workload” we achieved a rate of 384,000 datagrams per second whilst using 33% of the 10 Gigabit link. Pushing the link harder would be possible but, as I discovered here, would simply result in packet loss and increased memory usage on the machine running the datagram generator. The test took almost a minute and we processed around 99% of the transmitted datagrams. The counters that we added to the server to help us visualise the performance show that all 8 of the available I/O threads were processing datagrams and they each processed a roughly similar number of datagrams.

The multi-threaded, RIO server, again with 8 I/O threads and no workload, achieved 492,000 datagrams per second, using 43% of the link and processing 100% of the transmitted datagrams. Only 4 of the 8 threads were used. The work split between the threads was not even, with one of the 4 threads doing practically nothing, two doing about the same and one doing a small number of datagrams. The threads dequeued, on average, 5 datagrams at a time, with a minimum of 1 datagram at a time and a max of 1000 on one thread with a max of around 500 for the other three threads.

So the simplest RIO base IOCP server is almost a third faster than the traditional server design and uses fewer threads. Digging into the perfmon logs, which you can download here, you’ll see that the RIO server’s threads perform far fewer context switches per second than the traditional server and that the system’s interrupts per second is also far lower. Both of these indicate that the RIO server will scale far better than the traditional server.

A hybrid RIO design?

Whilst the RIO server is considerably faster than the traditional server and uses fewer threads to do the work there’s still scope for improvement. The fact that we’re dequeuing, on average, only 5 datagrams at a time implies that we could possibly do more work with fewer threads. What’s probably happening is that the work is ping-ponging between threads when more could be done on fewer threads. The RIO API makes it easy for us to prevent this as we can use the IOCP notification as a trigger to put the I/O thread into “polling” mode unless we retrieve more than a tunable number of datagrams in one call to RIODequeueCompletion(). Since the completion queue is effectively locked until we call RIONotify() again we can defer calling this if we retrieve some datagrams but not enough for us to feel that another thread need get involved in servicing the IOCP. We can then spin until we retrieve 0 datagrams, at which point we call RIONotify() and wait on the IOCP for more work. This hybrid design scales well, as soon as an I/O thread can retrieve more than the tunable “receive threshold” it calls RIONotify() and another I/O thread can become active whilst we deal with the datagrams that we have retrieved. As the workload drops so does the number of threads that are in use. This helps reduce the number of I/O threads that are active and this is a good thing if each datagram requires that some form of shared state be examined or modified, it also helps with the contention on the per-request queue RIOReceieve() lock.

I’ll present the hybrid server design in a later article but for now, lets look at the results. The hybrid server achieved 501,000 datagrams per second and processed 98% of the transmitted datagrams using all 8 of its I/O threads but doing 99% of the work on two of them. The average number of datagrams dequeued is 1000 and all threads dequeue the maximum number at least once.

A more complex hybrid design?

I’ve also looked at the performance of a design which uses a dedicated reader thread to issue the RIOReceieve() which has buffers passed to it via a queue which it does not need to lock to remove chunks of buffers from. This minimises the I/O thread lock contention but didn’t actually affect performance that much in my tests.

A traditional design using GetQueuedCompletionStatusEx()

It seems a little unfair to use a “clever” design for the RIO server yet stick with GetQueuedCompletionStatus(), which can only ever return a single completion per call, for the traditional server design. Switching to a design which uses GetQueuedCompletionStatusEx() allows us to retrieve multiple completions per call and more closely resembles the RIO API’s RIODequeueCompletion() call. This doesn’t allow us to spin a single thread on the completion queue in the same way that our hybrid design can for RIO, but it does move the traditional server fractionally closer to the RIO server design.

Unfortunately this doesn’t help us much as we get slightly fewer datagrams per second, at 360,000 and use slightly less of the link, at 31%. The threads spread the work evenly and dequeue, on average, around 65 datagrams per call.

Whilst running these tests with no work per datagram is not especially realistic it does isolate the cost of calling the different APIs. Also, all of the server designs give similar, slightly reduced, figures when run with a modest load, of “500”. See the earlier tests for how the single threaded server designs’ performance degraded with load.

Some Conclusions

Bear in mind that these results are specific to the test machine I was running on and that we’re testing on a release candidate of the Windows 2012 Server operating system. Even so, the figures are impressive. The lack of kernel mode transitions allow much more CPU to be used for real work on each datagram that arrives. As with the simpler server designs, the registering of I/O buffers once at program start up reduces the work done per operation and also means that your server will use a known amount of non-paged pool rather than a completely variable amount. Though non-paged pool is more plentiful than it used to be pre-Vista this is likely still an advantage.

The RIO API isn’t especially complicated but your server designs will be different. The IOCP designs presented here are simple but easy to scale. The hybrid server design is possibly the most best performing and it’s probably pretty easy to build a system whereby enabling the hybrid design can be a configuration option to allow for flexible performance tuning dependent on workload.

Translating these performance gains into TCP servers may be more complex as the RIO API imposes some restrictions, or, at least, up front decisions, on your designs; especially in terms of the number of outstanding read and write operations per socket. This probably means that a TCP design will be more complex, especially if it’s aiming to be a generic solution…

Code is here

Code - updated 15th April 2023

Full source can be found here on GitHub.

This isn’t production code, error handling is simply “panic and run away”.

This code is licensed with the MIT license.