Winsock Registered I/O Archives

Continuing my analysis of the performance of the new Winsock Registered I/O API, RIO, compared to more traditional networking APIs we finally get to the point where we compare I/O completion port designs.

I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview, though lately most of my testing has been using Windows Server 2012 RC. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. It took a pair of 10 Gigabit network cards to fully appreciate the performance advantages of using RIO with the simpler server designs and now I can present my findings for the more complex server designs.
This article presents the sixth in my series of example servers for comparing the performance of the Winsock Registered I/O Networking extensions, RIO, and traditional Windows networking APIs. This example server is a traditional multi-threaded, IOCP based, UDP design that we can use to compare to the multi-threaded RIO IOCP UDP example server. I've been looking at the Winsock Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview, though lately most of my testing has been using Windows Server 2012 RC. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Winsock Registered I/O example servers here.
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance.

I had my first attempt at performance comparisons back in March and whilst the results were in RIO's favour they weren't especially compelling given the new API you needed to learn and the fact that the resulting code would only run on Windows 8/Server 2012 and later. As I moved on from the simplest servers to more complex ones it became increasingly difficult to justify the code change for the performance improvement. I began to doubt my test environment and so upgraded the networking from teamed 1Gb connections to a pair of Intel 10 Gigabit AT2 cards wired back to back. It then became apparent that whilst my test server was fine in this configuration, I didn't have another machine that was powerful enough to do the networking justice... After adding another new machine to the test system I could finally drive the 10 Gigabit AT2 cards at full capacity and having just begun to run the original tests again I can now clearly see the advantage of using RIO.

Do bear in mind that I'm learning as I go here, RIO is a new API and there is precious little in the way of documentation about how and why to use the API. Comments and suggestions are more than welcome, feel free to put me straight if I've made some mistakes, and submit code to help make these examples better.
As I mentioned back in March I'm now doing some RIO performance testing with a 10 Gigabit Ethernet link as I couldn't push my test server hard enough with multiple 1 Gigabit links. This involves 2 Intel 10 Gigabit AT2 cards connected directly to each other. One of the cards is in our test server and the other has been tried in various other pieces of test hardware in an attempt to ramp up the bandwidth usage for the RIO tests. Unfortunately driving one of these network cards to its limits takes quite a lot of CPU and also requires that the card is in a PCI Express slot with x8 capabilities; our other test hardware, being slightly older, has plenty of slots capable of taking the card, but only a relatively low spec build machine had any of those slots wired up as x8. This left me in a situation which was fractionally better than teamed multiple 1 Gigabit links but nowhere near the full potential of the 10 Gigabit link.

I've just finished building a new developer workstation with an Asus P9X79 WS Intel X79 motherboard and a 3.20GHz Core i7 3930K processor. We now have the bus bandwidth and the processor power to push the link and stress the server nicely. At 50% link saturation we're running at around 10% CPU usage on the source machine (1 of the 12 cores pegged at 100% by the simple UDP traffic generator that we used in the earlier tests). On the test server we are also using around 10% CPU for my most complicated RIO IOCP design.

Rough results for this 50% loading of the link give 350,000 datagrams per second processed by a traditional IOCP UDP server with 8 threads as against 450,000 datagrams per second from a naive RIO IOCP server design and 920,000 per second from a more complicated design. Pushing the loading to 75%, by running the traffic generator twice, shows no improvement in datagram processing speed from the traditional design but the complicated RIO design moves up to 970,000 per second. These results should be taken with a pinch of salt as I haven't had a chance to check them over and test more thoroughly. I'll present the server designs in the same way that I've presented the previous designs when I publish the full results.

Looks like I'm finally in a position to see what RIO can really do...
When I switched to looking at the performance of the more advanced RIO server designs that use IOCP it quickly became apparent that even multiple 1 Gigabit connections weren't enough of a challenge to give me any meaningful figures; my traditional IOCP datagram servers were easily able to keep up and increasing the workload per datagram required such high workloads that the tests became meaningless. So, we've brought forward the purchase of the hardware that we intended to use for our private cloud scalability testing and we now have 2 Intel 10 Gigabit AT2 cards. Switch prices are still prohibitive for lab use and so these two cards are directly connected, point to point.

The good news is that we now have a 10 Gigabit network link between two of our test servers. The bad news is that I now have to work out how to use it, the traditional datagram generation program that I was previously using to test simply doesn't scale to saturate the new link.

Windows 8 Registered I/O Performance

| 0 Comments | 1 TrackBack
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance.

Of course these comparisons should be taken as preliminary since we're working with a beta version of the operating system. However, though I wouldn't put much weight in the exact numbers until we have a non-beta OS to test on, it's useful to see how things are coming along and familiarise ourselves with the designs that might be required to take advantage of RIO once it ships. The main thing to take away from these discussions on RIO's performance are the example server designs, the testing methods and a general understanding of why RIO performs better than the traditional Windows networking APIs. With this you can run your own tests, build your own servers and get value from using RIO where appropriate.

Do bear in mind that I'm learning as I go here, RIO is a new API and there is precious little in the way of documentation about how and why to use the API. Comments and suggestions are more than welcome, feel free to put me straight if I've made some mistakes, and submit code to help make these examples better.

Please note: the results of using RIO with a 10 Gigabit Ethernet link are considerably more impressive than those shown in this article - please see here for how we ran these tests again with improved hardware.
Now that we have five example servers, four RIO designs and a traditional polled UDP design, we can begin to look at how the RIO API performs compared to the traditional APIs. Of course these comparisons should be taken as preliminary since we're working with a beta version of the operating system. However, though I wouldn't put much weight in the exact numbers until we have a non-beta OS to test on, it's useful to see how things are coming along and familiarise ourselves with the designs that might be required to take advantage of RIO once it ships.

Sending a stream of datagrams

Before we can compare performance we need to be able to push the example servers hard. We do this by sending a stream of datagrams at them as fast as we can for a period of time. The servers start timing when they get the first datagram and then count the number of datagrams that they process. The test finishes by sending a series of smaller datagrams at the server. When the server sees one of these smaller datagrams it shuts down and reports on the time taken and the number of datagrams processed and the rate at which they were processed.

All we need to be able to do to stress the servers is to send datagrams at a rate that gets close to 100% utilisation of a 1Gb Ethernet link. This is fairly simple to achieve using the traditional blocking sockets API.
   for (size_t i = 0; i < DATAGRAMS_TO_SEND; ++i)
   {
      if (SOCKET_ERROR == ::WSASendTo(
         s,
         &buf,
         1,
         &bytesSent,
         flags,
         reinterpret_cast<sockaddr *>(&addr),
         sizeof(addr),
         0,
         0))
      {
         ErrorExit("WSASend");
      }
   }

There's not much more to it than that. We use similar code to setup and clean up, but if you've been following along with the other examples then there's nothing that needs to be explained about that.

The code for this example can be downloaded from here. This code requires Visual Studio 11, but would work with earlier compilers if you have a Windows SDK that supports RIO. Note that Shared.h and Constants.h contain helper functions and tuning constants for ALL of the examples and so there will be code in there that is not used by this example. You should be able to unzip each example into the same directory structure so that they all share the same shared headers. This allows you to tune all of the examples the same so that any performance comparisons make sense. This program can be run on versions of Windows prior to Windows 8, which is useful for testing as you only need one machine set up with the beta of Windows 8 server.
This article presents the fifth in my series of example servers for comparing the performance of the Windows 8 Registered I/O Networking extensions, RIO, and traditional Windows networking APIs. This example server is a traditional polled UDP design that we can use to compare to the RIO polled UDP example server. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

A traditional polled UDP server

This server is probably the simplest UDP server you could have. It's pretty much just a tight loop around a blocking call to WSARecv(). There's none of the complexity required by RIO for registering memory buffers for I/O and so we use a single buffer that we create on the stack.
   do
   {
      workValue += DoWork(g_workIterations);

      if (SOCKET_ERROR == ::WSARecv(
         s,
         &buf,
         1,
         &bytesRecvd,
         &flags,
         0,
         0))
      {
         ErrorExit("WSARecv");
      }

      if (bytesRecvd == EXPECTED_DATA_SIZE)
      {
         g_packets++;
      }
      else
      {
         done = true;
      }
   }
   while (!done);
There is some added complexity to allow us to compare performance, and this is similar to the RIO server examples. We can add an arbitrary processing overhead to each datagram by setting g_workIterations to a non zero value and we count each datagram that arrives and stop the test when a datagram of an unexpected size is received.
This article presents the fourth in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server, like the last example, uses the I/O Completion Port notification method to handle RIO completions, but where the last example used only a single thread to service the IOCP this one uses multiple thread to scale the load . I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an I/O Completion Port for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using an IOCP for RIO completions allows you to easily scale your completion handling across multiple threads as we do here and this is the first of my example servers that allows for more than one thread to be used to process completions.
This article presents the third in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server uses the I/O Completion Port notification method to handle RIO completions, but only uses a single thread to service the IOCP. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an I/O Completion Port for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using an IOCP for RIO completions allows you to easily scale your completion handling across multiple threads, though in this first IOCP example server we use a single thread so as to allow us to compare the performance against the polled and event driven servers. The next example server will adapt this server for multiple threads and allow us to scale our completion processing across more CPUs.
This article presents the second in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server uses the event driven notification method to handle RIO completions. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an event for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using the event driven approach is similar to using the polling approach that I described in the previous article except that the server doesn't burn CPU in a tight polling loop.
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Polling RIO for completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. The first is the simplest though it burns CPU time even when no datagrams are being received.

At its simplest a polled RIO server obtains datagrams to process like this:
   RIORESULT results[RIO_MAX_RESULTS];

   ULONG numResults = 0;

   do
   {
      numResults = g_rio.RIODequeueCompletion(
         g_queue,
         results,
         RIO_MAX_RESULTS);

      if (0 == numResults)
      {
         YieldProcessor();
      }
      else if (RIO_CORRUPT_CQ == numResults)
      {
         ErrorExit("RIODequeueCompletion");
      }
   }
   while (0 == numResults);
You then loop over the results array and process each result in turn before looping back to dequeue more completions.

Getting to the point where you can call RIODequeueCompletion() takes a bit of setting up though...

Windows 8 Registered I/O Example UDP Servers

| 0 Comments
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance.

RIO API demonstration

The examples are simple in that they do the bare minimum to demonstrate the APIs in question but they are configurable so that you can tune them to the hardware on which you're running them. You can run them to compare the maximum speed at which you can pull UDP datagrams off of the wire using each API and then adjust the examples so that they do a specific amount of "work" with each datagram to simulate a slightly more realistic scenario.

Simplified error handling

Error handling is limited, we display an error and exit the program, but we don't skip error checking, all API calls are checked for errors. The examples are each stand alone but can share two common header files. The first, Constants.h, contains all constants that are used to tune the examples. The second, Shared.h, contains inline helper functions which hide some of the complexity and allow the individual example programs to focus on the area of the API that they're demonstrating.

This is the index page

I will be blogging about the construction of the various examples over the next few weeks and updating this entry as an index page for all of the examples. I've listed the examples that I'll be talking about and I'll link to each blog post as they go live. Once I've presented the RIO examples I'll present the more traditional examples and finally some performance comparisons.

RIO server examples

  • RIO Polled UDP - A server which uses a single thread and a tight loop to poll for RIO completions.
  • RIO Event Driven UDP - A server which uses a single thread and event driven notifications to handle RIO completions.
  • RIO IOCP UDP - A server which uses a single thread and I/O Completion Port notifications to handle RIO completions.
  • RIO IOCP MT UDP - A server which uses a configurable number of threads and I/O Completion Port notifications to handle RIO completions.

Traditional server examples

  • Simple Polled UDP - A server which uses a single thread and a tight loop to poll WSARecv() for datagrams.
  • IOCP UDP - A server which uses a single thread and I/O Completion Port notifications to handle overlapped WSARecv() completions.
  • IOCP MT UDP - A server which uses a configurable number of threads and I/O Completion Port notifications to handle overlapped WSARecv() completions.

A simple UDP datagram traffic generator

  • Simple UDP traffic generator - A client which uses a single thread and a tight loop send datagrams using WSASendTo(), this easily saturates a 1000BASE-T, 1Gb ethernet connection.

Test scripts

  • Test scripts - These simple scripts create performance counter logs and run the test servers.

Performance Test results

  • The first tests - Where we compare the simple polled traditional server with the polled RIO server.

Join in

Comments and suggestions are more than welcome. I'm learning as I go here and I'm quite likely to have made some mistakes or come up with some erroneous conclusions, feel free to put me straight and help make these examples better.

Windows 8 Registered I/O Buffer Strategies

| 4 Comments
One of the things that allows the Windows 8 Registered I/O Networking Extensions, RIO, to perform better than normal Winsock calls is the fact that the memory used for I/O operations is pre-registered with the API. This allows RIO to do all the necessary checks that the buffer memory is valid, etc. once, and then lock the buffer in memory until you de-register it. Compare this to normal Winsock networking where the memory needs to be checked and locked on each operation and already we have a whole load of work that simply isn't required for each I/O operation. As always, take a look at this video from Microsoft's BUILD conference for more in-depth details.

RIO buffers need to be registered before use

The recommended way to use RIORegisterBuffer() is to register large buffers and then use smaller slices from these buffers in your I/O calls, rather than registering each individual I/O buffer separately. This reduces the book-keeping costs as each registered buffer requires some memory to track its registration. It's also sensible to use page aligned memory for buffers that you register with RIORegisterBuffer() as the locking granularity of the operating system is page level so if you use a buffer that is not aligned on a page boundary you will lock the entire page that it occupies. This is especially important given that there's a limit to the number of I/O pages that can be locked at one time and I would imagine that buffers registered with RIORegisterBuffer() count against this limit.

Windows 8 Registered I/O and I/O Completion Ports

| 0 Comments
In my last blog post I introduced the Windows 8 Registered I/O Networking Extensions, RIO. As I explained there are three ways to retrieve completions from RIO, polled, event driven and via an I/O Completion Port (IOCP). This makes RIO pretty flexible and allows it to be used in many different designs of servers. The polled scenario is likely aimed at very high performance UDP or High Frequency Trading style situations where you may be happy to burn CPU so as to process inbound datagrams as fast as possible. The event driven style may also help here, allowing you to wait efficiently rather than spin, but it's the IOCP style that currently interests me most at present as this promises to provide increased performance to more general purpose networking code.

Please bear in mind the caveats from my last blog post, this stuff is new, I'm still finding my way, the docs aren't in sync with the headers in the SDK and much of this is based on assumption and intuition.

How do RIO and IOCP work together?

RIO's completions arrive via a completion queue, which is fixed sized data structure that is shared between user space and kernel space (via locked memory?) and which does not require a kernel mode transition to dequeue from (see this BUILD video for more details on RIO's internals). As I showed last time, you specify how you want to retrieve completions when you create the queue, either providing an event to be signalled, an IOCP to be posted to or nothing if you will simply poll the queue. When using an IOCP you get a notification sent to you when the completion queue is no longer empty after you have indicated that you want to receive completions by calling RIONotify().
Most of the buzz being generated around the Windows 8 Developer Previews at the moment seems to be centred on the new Metro user interface and the Windows Runtime. Whilst both Metro and WinRT are key components of the next Windows release I find the Registered I/O Networking Extensions to be far more interesting, but then I guess I would...

What are the Registered I/O Networking Extensions?

The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. I assume that the increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go.

The RIO API is pretty simple and straight forward but servers that currently use I/O Completion Port based designs will need to change somewhat to take advantage of it and probably not all server designs will benefit from changing. RIO relies on you registering the memory that you will use as data buffers and knowing in advance how many pending operations a given socket will have at any time. This allows it to lock the data buffers in memory once, rather than on each operation and removes the whole concept of the OVERLAPPED structure from the user space API. Since completion queue space is also of a fixed size you're also required to know how many sockets you will be allocating to a given queue and the maximum number of pending operations that these sockets will have. You can increase all of these limits after socket creation but, except for registering new data buffers, I expect that you're likely to take a performance hit for doing so.

I've been looking at the pre-release documentation and the headers from the latest Windows SDK and experimenting with the new RIO API. Note that at present the documentation is out of sync with the headers and there's little more than reference documentation so much of what I have to say about RIO is based on assumptions and intuition based on the available information and my knowledge of how I/O Completion Port based networking currently works on pre Windows 8 operating systems. In other words, don't rely on all of this to be correct.

Follow us on Twitter: @ServerFramework

About this Archive

This page is an archive of recent entries in the Winsock Registered I/O category.

Bug fixes is the next category.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.