Windows 8 Registered I/O - Single threaded RIO Event Driven UDP Example Server

| 9 Comments
This article presents the second in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server uses the event driven notification method to handle RIO completions. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an event for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using the event driven approach is similar to using the polling approach that I described in the previous article except that the server doesn't burn CPU in a tight polling loop.

Creating an event driven RIO completion queue

We start by initialising things in the same way that we did with the earlier example RIO servers.
int _tmain(int argc, _TCHAR* argv[])
{
   SetupTiming("RIO Event Driven UDP");

   InitialiseWinsock();

   CreateRIOSocket();

   HANDLE hEvent = WSACreateEvent();

   if (hEvent == WSA_INVALID_EVENT)
   {
      ErrorExit("WSACreateEvent");
   }

   RIO_NOTIFICATION_COMPLETION completionType;

   completionType.Type = RIO_EVENT_COMPLETION;
   completionType.Event.EventHandle = hEvent;
   completionType.Event.NotifyReset = TRUE;

   g_queue = g_rio.RIOCreateCompletionQueue(
      RIO_PENDING_RECVS,
      &completionType);

   if (g_queue == RIO_INVALID_CQ)
   {
      ErrorExit("RIOCreateCompletionQueue");
   }
Once that is done we create an event and then create a RIO completion queue which uses the event for notification. The event is signalled when there are completions to process and reset when we call RIONotify().

Creating a RIO request queue

Creating the request queue and posting our receives is identical to the polled example. The only difference is how we handle the completions.

Calling RIODequeueCompletion() and processing results

The processing loop is, again, similar to the polled example. Unsurprisingly rather than polling we wait on the event and dequeue the completions once the event is set. This reduced the amount of CPU used as there's no need to spin whilst waiting for new datagrams to process. The only complication is that we need to call RIONotify() to indicate that we're ready to process more completions. Note that in a real server you would probably want to wait on your completions available event and a 'we're ready to shut down' event so that you can shut the sever down cleanly.
   bool done = false;

   DWORD recvFlags = 0;

   RIORESULT results[RIO_MAX_RESULTS];

   const INT notifyResult = g_rio.RIONotify(g_queue);

   if (notifyResult != ERROR_SUCCESS)
   {
      ErrorExit("RIONotify");
   }

   const DWORD waitResult = WaitForSingleObject(
      hEvent,
      INFINITE);

   if (waitResult != WAIT_OBJECT_0)
   {
      ErrorExit("WaitForSingleObject");
   }

   ULONG numResults = g_rio.RIODequeueCompletion(
      g_queue,
      results,
      RIO_MAX_RESULTS);

   if (0 == numResults ||
       RIO_CORRUPT_CQ == numResults)
   {
      ErrorExit("RIODequeueCompletion");
   }

   StartTiming();

   int workValue = 0;

   bool running = true;

   do
   {
      for (DWORD i = 0; i < numResults; ++i)
      {
         EXTENDED_RIO_BUF *pBuffer = reinterpret_cast<EXTENDED_RIO_BUF *>(results[i].RequestContext);

         if (results[i].BytesTransferred == EXPECTED_DATA_SIZE)
         {
            g_packets++;

            workValue += DoWork(g_workIterations);

            if (!g_rio.RIOReceive(
               g_requestQueue,
               pBuffer,
               1,
               recvFlags,
               pBuffer))
            {
               ErrorExit("RIOReceive");
            }

            done = false;
         }
         else
         {
            done = true;
         }
      }

      if (!done)
      {
         const INT notifyResult = g_rio.RIONotify(g_queue);

         if (notifyResult != ERROR_SUCCESS)
         {
            ErrorExit("RIONotify");
         }

         const DWORD waitResult = WaitForSingleObject(
            hEvent,
            INFINITE);

         if (waitResult != WAIT_OBJECT_0)
         {
            ErrorExit("WaitForSingleObject");
         }

         numResults = g_rio.RIODequeueCompletion(
            g_queue,
            results,
            RIO_MAX_RESULTS);

         if (0 == numResults ||
             RIO_CORRUPT_CQ == numResults)
         {
            ErrorExit("RIODequeueCompletion");
         }
      }
   }
   while (!done);

   StopTiming();

   PrintTimings();

   return workValue;
}
As before, the structure of the processing loop is complicated somewhat by the fact that we want to start and stop the timing for the performance testing, and the DoWork() function can be used to add 'processing overhead' to each datagram. This can be configured using the g_workIterations which is defined in Constants.h. With this set to 0 there is no overhead and we can compare how quickly each API can receive datagrams. Setting larger values will affect how the various multi-threaded examples perform and can be useful if you're unable to saturate the test machine's network interfaces.

This example can be optimised slightly so that we revert to straight polling as long as calling RIODequeueCompletion() returns us at least one result. We'll look at this variation after we've studied the performance of the example shown here.

The code for this example can be downloaded from here. This code requires Visual Studio 11, but would work with earlier compilers if you have a Windows SDK that supports RIO. Note that Shared.h and Constants.h contain helper functions and tuning constants for ALL of the examples and so there will be code in there that is not used by this example. You should be able to unzip each example into the same directory structure so that they all share the same shared headers. This allows you to tune all of the examples the same so that any performance comparisons make sense.

Join in

Comments and suggestions are more than welcome. I'm learning as I go here and I'm quite likely to have made some mistakes or come up with some erroneous conclusions, feel free to put me straight and help make these examples better.

9 Comments

Great post! Thanks for sharing the code.
I have some questions:

playing around with your code, if I set 'RIO_PENDING_RECVS' to 1, and I have a program that send 10.000 UDP datagrams immediately, I will receive only about 50 datagrams, instead of 10.000. Why is that? With IOCPs even you have only 1 pending OVERLAPPED reading an udp socket, you'll receive almost *all* (if not all, often) of the 10.000 packets!
And you only have to allocate 1 OVERLAPPED and the buffer for reading!

Instead, with RIO, it seems you must have a lot of pending receives, to grab all of those UDP datagrams. Well, having only 1 pending receive, and grabbing only 50 udp packets of the 10.000 sent, is a very poor result. Can't I improve that?

I have tried also with TCP connections, and of course receive *all* of the 10.000 TCP packets sent, even with 1 only pending receive. Of course this is because for the different nature of TCP and UDP packets, but losing 9950 UDP packets by having only 1 pending receive seems very bad.

I have changed 'RIO_PENDING_RECVS' to 256, and I'll receive almost 300 datagrams, with 4096 I'll receive almost 4000 datagrams of the 10.000, with 4096*2 I'll receive almost 8000. Finally I can receive *all* the 10.000 datagrams by setting 'RIO_PENDING_RECVS' to 4096*4, which needs a quite big memory allocation for "only" 10.000 datagrams.

I cannot understand what kind of relation there is between those numbers, why this happen? I guess this is because the consumer (the RIO server) can't consume data (UDP datagrams) as fast as the producer (my application which writes 10.000 datagrams).
So, if you have a little request queue, the server just can't cope with the big number of coming packets.

But, again, why this would happen, and instead with IOCPs I don't have any of this troubles?

This is also bad because if I have an UDP server, and I don't have any clue about how many UDP datagrams I'll receive, how I can set those values? I risk to allocate much more memory than necessary.

Firstly; Try running your "normal UDP" test after turning off recv buffering in the network stack...

RIO is fast, in part, because you don't have any additional buffer copies going on as the data rises up the networking stack. The inbound datagrams go straight into your buffer space (which is why it needs to be preallocated and registered so that it's "locked" in place and can be accessed directly by the kernel).

Secondly; how realistic is it to actually have 10,000 UDP datagrams arriving 'immediately'. If it's very realistic and that level of load will be maintained then you need to tune your server appropriately. Can your "normal" IOCP server handle a sustained load like that? Does it do any real work?

You can ignore any comparison with TCP. TCP is designed to deliver ALL of the data stream, in order. The peers have flow control between them in case the receiver's recv buffer fills up and (see above re the network stack's recv buffer).

To recv all datagrams with RIO UDP you need enough pending recvs to deal with the desired "burst load" of datagrams. You then need to issue new recvs FAST to make up for the ones that have completed and that you're processing. Ideally you do this by processing the datagram and reusing the buffer, if you can't do that fast enough issue more recvs. If you're working on the kind of system where the kind of performance gain given by RIO is required then I don't see that cost of memory would be an issue...

Your tests where you increase the number of pending recvs are only relevant for a given workload. That is they will depend on how long it takes for you to reuse the buffer that has just completed. If you're reusing the buffers slowly then you either miss datagrams or need to issue more recvs to start with.

You could probably tune the system to handle peak loads by issuing more recvs at the time the peak load is identified BUT IMHO you should just decide that the max load that you want to handle is X and make sure you have X memory available and registered as RIO buffers (with pending recvs) from the start.

Anyway, I work through all of this in the examples:

a) look at the IOCP based RIO examples as they perform best.

b) look at the tunable workload.

c) look at your requirements for burst load datagram handling and spec a decent amount of memory and a fast enough CPU for the machine.

IMHO RIO designs are for special purpose, high performance, systems and generally I've found that people who want these kind of systems have appropriately deep pockets.

Yeah, I agree with you.
Usually you use RIO when your system can cope with the required memory you need, etc, and when the application requires such technology's performance gains. Otherwise you could use normal IOCPs for most of the cases.

What you mean by "turning off recv buffering in the network stack" ? Turning off (udp?) socket's buffering on a system level basis?
This is why in you code in 'Shared.h' you have the functions 'SetSocketSendBufferToMaximum' and 'SetSocketRecvBufferToMaximum' ? But as their name does imply, they set the socket buffer to maximum, instead of turning it off. So why do you have those functions in the code?

Yeah well, I've just tried with TCP only to test if that was some kind of a problem with RIO itself.

So, basically, just to sum up our discussion: I need a lot of pending recvs, in order to exploit the real RIO performance gains, that's just because despite IOCPs, with the RIO facility the system doesn't do much work itself, it just raw copies data in the buffers I gave and notifies me. If i give to the system only 1 buffer, it can't cope with all of those packets coming so fast.
Right?

"turning off recv buffering", set SO_RCVBUF to 0 which disables the network stack's recv buffer... Yes, I do the opposite in the IOCP UDP example and set the network stack's buffer to maximum to enable it to 'help out'.

All the IOCP is used for with the RIO API is to notify you that one or more pending recv has completed. You provide the buffer space for datagrams to be received and the RIO API can't do anything except throw away any datagrams that arrive when you don't have any recvs pending.

And... The best way to get performance from RIO is to avoid as many user mode to kernel mode transitions. So use GetQueuedCompletionStatusEx() to retrieve LOTS of completions in one go for one transition.

And also, having ONE pending recv for a UDP server can only lead to reduced performance with or without the RIO API.

I see. So your point is that even with the old IOCPs, one should allocate N OVERLAPPEDs and set them pending with the WSARecvFrom() API, instead of having just 1 of them pending, right? At least, for UDP sockets.
Is that even true if one uses only 1 thread for the IOCPs?

Of course this won't apply to TCP sockets or pipes, because since they're a stream, it doesn't make sense to have multiple OVERLAPPEDs pending for read data, and 1 is enough.
What about multiple pending OVERLAPPEDs for writing, instead? I guess it could exploit some performance gain even with TCP sockets.

For a UDP system with a single 'well known port' and multiple clients then yes you need multiple pending receives. In fact, any IOCP based UDP system would benefit from multiple pending recvs.

With TCP you may get better performance with multiple pending recvs on a single connection but you need to make sure you sequence the recvs correctly before processing them to ensure you maintain the stream's ordering.

There's never a reason to restrict the number of writes on either UDP or TCP, though it is wise to ensure you manage the number of outstanding writes pending and add flow control if necessary. A TCP connection is most efficient if you keep the recv window full and writes pending on the sender... Just not too many writes.

In summary, I have never seen any advantage in artificially restricting the number of reads or writes that you can have pending on ANY IOCP system.

I see. What about for single-threaded IOCPs queues?
If I have only 1 thread processing IOCPs completions (which is basically what I have with RIO), multiple pending requests would still be a good thing?

My first thought is that if you use 'GetQueuesCompletion'
you don't have such benefits, because you can complete only 1 OVERLAPPED at a time.

But with 'GetQueuesCompletionEx', you will receive benefits, even with single threaded IOCPs queues, because you in fact retrieve multiple OVERLAPPED packets.

What you think about this?

The number of threads servicing the IOCP is not relevant, IMHO.

The advantage of using GetQueuedCompletionStatusEx() is that you do one transition from user mode to kernel mode and bring back a batch of completions, rather than one transition PER completion. This may help performance as it reduces the transitions.

If you have a single UDP socket on a 'well known port' then you should have multiple recvs pending to a) take advantage of the potential GQCSEx() improvement and b) to help ensure you receive all datagrams in times of high burst load.

Leave a comment

Follow us on Twitter: @ServerFramework

About this Entry

Windows 8 Registered I/O - Simple RIO Polled UDP Example Server was the previous entry in this blog.

Windows 8 Registered I/O - Single threaded RIO IOCP UDP Example Server is the next entry in this blog.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.