Winsock Registered I/O, I/O Completion Port Performance

| 5 Comments
Continuing my analysis of the performance of the new Winsock Registered I/O API, RIO, compared to more traditional networking APIs we finally get to the point where we compare I/O completion port designs.

I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview, though lately most of my testing has been using Windows Server 2012 RC. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. It took a pair of 10 Gigabit network cards to fully appreciate the performance advantages of using RIO with the simpler server designs and now I can present my findings for the more complex server designs.

Our test system

Our test system consists of a dual Xeon E5620 @ 2.40GHz (16 cores in 2 Numa nodes with 16GB of memory, running Windows Server 2012 RC) and an Intel Core i7-3930K @ 3.20GHz (12 cores with 32GB of memory, running Windows 7). These are connected by a pair of 2 Intel 10 Gigabit AT2 cards wired back to back. The code under test runs on the dual Xeon and the datagram generator runs on the Core i7.

How to test RIO's performance

The test results presented here are from running the same kind of tests that we ran back in March against the more complex server designs.

An update to the RIO IOCP server design

The original RIO IOCP server design that I presented back in March had a bug in it, which is now fixed. The bug was due to a lack of detail in the RIO API documentation and an assumption on my part. It seems that calling RIOReceive() on a given request queue is not thread safe. With our IOCP server design where we call RIONotify() as soon as we have dequeued the available completions and then call RIOReceive() each time we have finished with a datagram and can issue another read into the buffer that is now available it is likely that multiple threads are calling RIOReceive() on the same request queue at the same time. I've witnessed some failures due to the number of reads permitted being exceeded, and also performance degradations. Both of these issues are fixed by locking around the calls to RIOReceive(); in a design which used multiple request queues you would have one lock per queue. The locking causes some inter-thread contention but the API does not appear to be able to be used without it. It would be useful if the documentation were to be explicit about this.

The results

Running the traditional, multi-threaded, IOCP UDP server with 8 I/O threads and no "per datagram workload" we achieved a rate of 384,000 datagrams per second whilst using 33% of the 10 Gigabit link. Pushing the link harder would be possible but, as I discovered here, would simply result in packet loss and increased memory usage on the machine running the datagram generator. The test took almost a minute and we processed around 99% of the transmitted datagrams. The counters that we added to the server to help us visualise the performance show that all 8 of the available I/O threads were processing datagrams and they each processed a roughly similar number of datagrams.

The multi-threaded, RIO server, again with 8 I/O threads and no workload, achieved 492,000 datagrams per second, using 43% of the link and processing 100% of the transmitted datagrams. Only 4 of the 8 threads were used. The work split between the threads was not even, with one of the 4 threads doing practically nothing, two doing about the same and one doing a small number of datagrams. The threads dequeued, on average, 5 datagrams at a time, with a minimum of 1 datagram at a time and a max of 1000 on one thread with a max of around 500 for the other three threads.

So the simplest RIO base IOCP server is almost a third faster than the traditional server design and uses fewer threads. Digging into the perfmon logs, which you can download here, you'll see that the RIO server's threads perform far fewer context switches per second than the traditional server and that the system's interrupts per second is also far lower. Both of these indicate that the RIO server will scale far better than the traditional server.

A hybrid RIO design?

Whilst the RIO server is considerably faster than the traditional server and uses fewer threads to do the work there's still scope for improvement. The fact that we're dequeuing, on average, only 5 datagrams at a time implies that we could possibly do more work with fewer threads. What's probably happening is that the work is ping-ponging between threads when more could be done on fewer threads. The RIO API makes it easy for us to prevent this as we can use the IOCP notification as a trigger to put the I/O thread into "polling" mode unless we retrieve more than a tunable number of datagrams in one call to RIODequeueCompletion(). Since the completion queue is effectively locked until we call RIONotify() again we can defer calling this if we retrieve some datagrams but not enough for us to feel that another thread need get involved in servicing the IOCP. We can then spin until we retrieve 0 datagrams, at which point we call RIONotify() and wait on the IOCP for more work. This hybrid design scales well, as soon as an I/O thread can retrieve more than the tunable "receive threshold" it calls RIONotify() and another I/O thread can become active whilst we deal with the datagrams that we have retrieved. As the workload drops so does the number of threads that are in use. This helps reduce the number of I/O threads that are active and this is a good thing if each datagram requires that some form of shared state be examined or modified, it also helps with the contention on the per-request queue RIOReceieve() lock.

I'll present the hybrid server design in a later article but for now, lets look at the results. The hybrid server achieved 501,000 datagrams per second and processed 98% of the transmitted datagrams using all 8 of its I/O threads but doing 99% of the work on two of them. The average number of datagrams dequeued is 1000 and all threads dequeue the maximum number at least once.

A more complex hybrid design?

I've also looked at the performance of a design which uses a dedicated reader thread to issue the RIOReceieve() which has buffers passed to it via a queue which it does not need to lock to remove chunks of buffers from. This minimises the I/O thread lock contention but didn't actually affect performance that much in my tests.

A traditional design using GetQueuedCompletionStatusEx()

It seems a little unfair to use a "clever" design for the RIO server yet stick with GetQueuedCompletionStatus(), which can only ever return a single completion per call, for the traditional server design. Switching to a design which uses GetQueuedCompletionStatusEx() allows us to retrieve multiple completions per call and more closely resembles the RIO API's RIODequeueCompletion() call. This doesn't allow us to spin a single thread on the completion queue in the same way that our hybrid design can for RIO, but it does move the traditional server fractionally closer to the RIO server design.

Unfortunately this doesn't help us much as we get slightly fewer datagrams per second, at 360,000 and use slightly less of the link, at 31%. The threads spread the work evenly and dequeue, on average, around 65 datagrams per call.

Whilst running these tests with no work per datagram is not especially realistic it does isolate the cost of calling the different APIs. Also, all of the server designs give similar, slightly reduced, figures when run with a modest load, of "500". See the earlier tests for how the single threaded server designs' performance degraded with load.

Some Conclusions

Bear in mind that these results are specific to the test machine I was running on and that we're testing on a release candidate of the Windows 2012 Server operating system. Even so, the figures are impressive. The lack of kernel mode transitions allow much more CPU to be used for real work on each datagram that arrives. As with the simpler server designs, the registering of I/O buffers once at program start up reduces the work done per operation and also means that your server will use a known amount of non-paged pool rather than a completely variable amount. Though non-paged pool is more plentiful than it used to be pre-Vista this is likely still an advantage.

The RIO API isn't especially complicated but your server designs will be different. The IOCP designs presented here are simple but easy to scale. The hybrid server design is possibly the most best performing and it's probably pretty easy to build a system whereby enabling the hybrid design can be a configuration option to allow for flexible performance tuning dependent on workload.

Translating these performance gains into TCP servers may be more complex as the RIO API imposes some restrictions, or, at least, up front decisions, on your designs; especially in terms of the number of outstanding read and write operations per socket. This probably means that a TCP design will be more complex, especially if it's aiming to be a generic solution...

5 Comments

Hi!

It seems to me that Overlapped and registered I/O are completely incompatible.
Example: An AcceptEx call with an accepting socket that has WSA_FLAG_REGISTERED_IO | WSA_FLAG_OVERLAPPED (listening socket doesn't matter) specified will ignore the receive buffer completely. Only by removing the RIO flag, the receive buffer will be used again, otherwise completion will say 0 bytes.
Also it appears to me that RIOReceive does not work on a socket used with AcceptEx. MSDN explicitly stated that the accepted socket can only be used with a limited set of WinSock functions (RIO functions are not listed there).

Moreover, when using WSASocket with both flags WSA_FLAG_REGISTERED_IO | WSA_FLAG_OVERLAPPED, WSARecv will not work either on the accepted socket. The call will return 0, but it never completes. Only by removing the RIO-flag, the call will actually complete.

It looks like RIO is not really compatible to Overlapped. Too bad, because I think AcceptEx is too important to be sacrificed for RIO, therefore RIO cannot be really used with TCP at all. And AcceptEx is not the only useful overlapped function that cannot be used with RIO any more, I believe.

I haven't done much with TCP based RIO systems as yet as my initial thoughts on the design of a general purpose TCP RIO system led me to believe that it was too complex to manage an arbitrary number of connections with an arbitrary number of concurrent operations. Also I figure that RIO is less useful if the underlying networking protocol stack is adding latency.

In summary, I'm not especially surprised that RIO doesn't play that well with TCP and AcceptEx as I would expect the primary use case to have been UDP.

Len,

Thanks a lot for publishing the results of your experiments on the (at that time) bleeding edge. I have quite a few detail questions so I can understand what to make of the results.

Combining what you write in several posts on your RIO tests, it seems you are able to receive UDP packets at rates of up to about 500kpps on a "Dual Xeon E5620 @ 2.40GHz (16 cores in 2 Numa nodes".

One E5620 has 4 cores, 8 with Hyper-Threading. As you have 2 of them, it's 2x4 physical cores, or 2x8 with Hyper-Threading. When you write that 8 cores are being used (although only 2 are being used heavily), are those 8 cores all on one NUMA node, or distributed across both NUMA nodes?

Did you send the packets to just one IP/Port combo, or did you use multiple ones?

In the latter case (which would allow for more parallelism), were you using RSS (Receive Side Scaling)? (And were you REALLY: http://serverfault.com/a/694153/124382)

What was the size of the UDP packets? I grepped the sources of your simple, blocking packet generator which you publish in one of the posts, and which you say is easily able to saturate a 1Gb link to about 100% (so I assume the packet generator used for your 10Gb tests was different?). And in those sources it seems you are using a 1024 byte UDP payload. In other words, Ethernet frames with a size of 1024+8+38=1070 bytes. In one of the posts you further state that with RIO, you were able to saturate the 10Gb links, which wasn't possible before.

But at 1070 bytes per frame, you'd have to do 1.17Mpps to saturate a 10GbE link, if I'm not mistaken:

10,000,000,000/(1,070*8)=1,168,224

So I'm not sure how this adds up. If you'd fill in the gaps for me, and correct me where I'm wrong, that would be great.

Thanks
Eugene

Correction to my comment (I'm sorry, there's a lot of info in your articles, but it's sometimes hard to figure out what figures are for what test, and what are the final, best case results - a table of the results would be great):

For one of the tests you write

"492,000 datagrams per second, using 43% of the link and processing 100% of the transmitted datagrams"

Which adds up more or less with a 1024 byte payload:

(10,000,000,000*0.43)/((1,024+8+38)*8)=502,336

So I guess you weren't able to saturate a 10Gb link in these tests - something which I thought I'd read in one of the posts.

Answers to the other questions in my previous comment could clarify the circumstances (one IP/Port combo? RSS? etc.)

My "% of link" values came from the task manager's networking tab. I wasn't trying that hard to get maximum performance just to compare relative performance between old and new APIs and so I wasn't that thorough in my testing.

I don't think I ever saturated the 10GB link.

I was using a single IP/port and so you should get better results spreading across multiple ports or NICs.

I didn't explore RSS that much, I don't think I got it to work and besides, I was looking for relative performance rather than the absolute best perf.

I think you could take the example code that you can download from these postings and move forward quite a bit if you tuned for your specific scenario.

Leave a comment

Follow us on Twitter: @ServerFramework

About this Entry

Winsock Registered I/O - Traditional Multi threaded IOCP UDP Example Server was the previous entry in this blog.

AcceptEx() bug in Windows 8 and all Server 2012 variants is the next entry in this blog.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.