Windows 8 Registered I/O Networking Extensions

2011-10-24

Page content

Most of the buzz being generated around the Windows 8 Developer Previews at the moment seems to be centred on the new Metro user interface and the Windows Runtime. Whilst both Metro and WinRT are key components of the next Windows release I find the Registered I/O Networking Extensions to be far more interesting, but then I guess I would…

What are the Registered I/O Networking Extensions?

The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. I assume that the increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go.

The RIO API is pretty simple and straight forward but servers that currently use I/O Completion Port based designs will need to change somewhat to take advantage of it and probably not all server designs will benefit from changing. RIO relies on you registering the memory that you will use as data buffers and knowing in advance how many pending operations a given socket will have at any time. This allows it to lock the data buffers in memory once, rather than on each operation and removes the whole concept of the OVERLAPPED structure from the user space API. Since completion queue space is also of a fixed size you’re also required to know how many sockets you will be allocating to a given queue and the maximum number of pending operations that these sockets will have. You can increase all of these limits after socket creation but, except for registering new data buffers, I expect that you’re likely to take a performance hit for doing so.

I’ve been looking at the pre-release documentation and the headers from the latest Windows SDK and experimenting with the new RIO API. Note that at present the documentation is out of sync with the headers and there’s little more than reference documentation so much of what I have to say about RIO is based on assumptions and intuition based on the available information and my knowledge of how I/O Completion Port based networking currently works on pre Windows 8 operating systems. In other words, don’t rely on all of this to be correct.

How do you access the RIO API?

RIO is a Microsoft-specific extension to the Windows Sockets specification in the same way that AcceptEx(), ConnectEx(), etc. are and the API is accessed in a similar way. You don’t link to the functions directly, you obtain them via a call to WSAIoctl(). Since RIO presents an API rather than a single extension function and that API is either available as a whole or not you obtain a table to the API’s function pointers, rather than individual function pointers as with AcceptEx() and ConnectEx() etc. You do this by calling WSAIoctl() with an opcode of SIO_GET_MULTIPLE_EXTENSION_FUNCTION_POINTER and an id of WSAID_MULTIPLE_RIO. The result is a populated RIO_EXTENSION_FUNCTION_TABLE table, see here for more details.

It took me a few attempts to get the WSAIoctl() call to work as this is the first extension API and the first use of SIO_GET_MULTIPLE_EXTENSION_FUNCTION_POINTER and I was unable to find any examples of its usage. Anyway, your call to WSAIoctl() should look like this:

RIO_EXTENSION_FUNCTION_TABLE rio;

GUID functionTableId = WSAID_MULTIPLE_RIO;

DWORD dwBytes = 0;

bool ok = true;

if (0 != WSAIoctl(
   s,
   SIO_GET_MULTIPLE_EXTENSION_FUNCTION_POINTER,
   &functionTableId,
   sizeof(GUID),
   (void**)&rio,
   sizeof(rio),
   &dwBytes,
   0,
   0))
{
   const DWORD lastError = ::GetLastError();

   // handle error...
}
else
{
   // all ok, we have access to RIO
}

Note that you can use the cbSize member of the function table to detect additions to the API if it is changed in later versions of Windows.

First impressions of the RIO API

Looking at the preliminary on-line documentation a couple of things immediately jumped out at me:

None of RIOReceive(), RIOReceiveEx(), RIOSend() and RIOSendEx() support scatter/gather I/O. That is, they all take a single RIO_BUF structure rather than a chain of them. Standard Winsock send and recv functions take chains of WSABUF structures allowing for scatter/gather I/O.
Completions are signalled via an event. Each completion queue can have its own event associated with it and the pattern for retrieving completions appears to be, issue operations, wait on event, retrieve completions. Whilst this is most likely the most performant method for small numbers of connections it leaves you having to scale it yourself which is likely non-trivial.

Luckily the header files are not consistent with the documentation and the fact that they DO include support for scatter/gather I/O and they also allow completion notification either via an event or via IOCP means that I’m pretty sure that the code is right and the docs are wrong. Anyway, I’m getting ahead of myself…

How does RIO work?

As I mentioned above, RIO provides increased performance by working with pre-locked buffers, fixed sized completion queues and reduced user mode to kernel mode transitions. You enable the RIO extensions on a socket by creating the socket with the WSA_FLAG_REGISTERED_IO flag, it seems that this can be combined with WSA_FLAG_OVERLAPPED which is convenient as RIO provides no alternatives to AcceptEx() and ConnectEx() and so it’s likely that your sockets will require both WSA_FLAG_REGISTERED_IO AND WSA_FLAG_OVERLAPPED.

SOCKET s = ::WSASocket(
   AF_INET,
   SOCK_STREAM,
   IPPROTO_TCP,
   NULL,
   0,
   WSA_FLAG_REGISTERED_IO);

Once you have your socket you need to create a registered I/O socket descriptor on the socket. You do this by calling RIOCreateRequestQueue(). The documentation for this function is out of sync with the headers, the call actually looks like this:

ULONG maxOutstandingReceive = 10;
ULONG maxReceiveDataBuffers = 1;
ULONG maxOutstandingSend = 10;
ULONG maxSendDataBuffers = 2;

void *pContext = 0;

RIO_RQ requestQueue = rio.RIOCreateRequestQueue(
   s,
   maxOutstandingReceive,
   maxReceiveDataBuffers,
   maxOutstandingSend,
   maxSendDataBuffers,
   recvQueue,
   sendQueue,
   pContext);

This is where you place limits on the number of outstanding requests (and the number of buffers that can be used with each request) and where you associate your per-connection context that will be returned to you with each completion; this is the same as the “completion key” in regular IOCP designs. You also need a receive queue and a send queue (you can use one queue for both), these queues are created by a call to RIOCreateCompletionQueue(). Again the documentation is out of sync, the call looks like this:

HANDLE hEvent = WSACreateEvent();

RIO_NOTIFICATION_COMPLETION type;

type.Type = RIO_EVENT_COMPLETION;
type.Event.EventHandle = hEvent;
type.Event.NotifyReset = TRUE;

RIO_CQ queue = rio.RIOCreateCompletionQueue(queueSize, &type);

Which creates a completion queue of a specified size which uses an event to signal that it’s no longer empty. When you create a request queue the number of outstanding operations is used to ensure that the queue size is suitable for all the sockets associated with it. You can resize both completion queues and request queues at a later time if you need to but I would imagine that it’s better not to base your design on doing so.

As an alternative you can use an IOCP for completion notification.

HANDLE hIOCP = CreateIoCompletionPort(
   INVALID_HANDLE_VALUE,
   0,
   0,
   0);

OVERLAPPED overlapped;

RIO_NOTIFICATION_COMPLETION type;

type.Type = RIO_IOCP_COMPLETION;
type.Iocp.IocpHandle = hIOCP;
type.Iocp.CompletionKey = pCompletionKey;
type.Iocp.Overlapped = &overlapped;

RIO_CQ queue = rio.RIOCreateCompletionQueue(queueSize, &type);

This makes it easier to scale the use of RIO with a pool of IOCP threads processing completions from one or more queues. Of course the overlapped structure should likely be dynamically allocated so that it can last until the queue is closed.

The header file parameter annotations suggest that the completionType parameter is optional, thus there’s a third way to create a completion queue.

RIO_CQ queue = rio.RIOCreateCompletionQueue(queueSize, 0);

Which seems to provide a polled interface…

Receiving data using RIO

Once the socket is connected using normal connection methods you can send and receive using RIO. The two receive functions available in the Windows Developer Preview SDK differ from the documentation in that they DO support scatter/gather I/O. The simplest recv call looks like this:

RIO_BUF buffer;

buffer.BufferId = id;
buffer.Offset = 0;
buffer.Length = buffer1Size;

DWORD flags = 0;

void *pOperationContext = 0;

rio.RIOReceive(requestQueue,
   &buffer,
   1,
   flags,
   pOperationContext);

The RIO_BUF structure allows us to create a buffer slice from a registered data buffer. This lets us register large buffers, which is more efficient, and lets us slice them up into blocks that are more appropriate to use. A buffer is registered like this:

const DWORD bufferSize = 4096;

char *pBuffer = new char[bufferSize];

RIO_BUFFERID id = rio.RIORegisterBuffer(pBuffer, bufferSize);

Note that, of course, the buffer should outlive the buffer registration and that it would be better to allocate memory that is page aligned, using VirtualAlloc() as buffer registration locks the buffer in memory and the locking granularity is page sized. See the documentation for for more details.

The operation context is your per-operation data, this is what you would have previously used your ’extended’ OVERLAPPED structure for. In a real design this is likely a pointer to a reference counted ‘operation data’ object which knows about the buffer slices being used and the operation, in this case a read, that is being executed.

So, what happens when RIOReceive() completes? Well, if we’re dealing with the event based completion mechanism, then, at present, you need to call RIONotify() to tell the RIO API that you want to receive notifications when completions occur (once again the docs for this function are out of sync with the implementation). You then wait on your event until it’s signalled and then call RIORIODequeueCompletion() to retrieve completion results. Like GetQueuedCompletionStatusEx() you can remove multiple completions in a single call, in this case, by passing an array of RIORESULT structures to this call.

RIORESULT result;

ULONG numResults = rio.RIODequeueCompletion(queue, &result, 1);

Here we request a single completion, but for better performance it’s probably best to always work with arrays of completions. We can now wait for another completion, but, at least at present, we need to call RIONotify() again to request more notifications, we can’t simply reset our event and wait again. It seems strange that we have to call RIONotify() manually when RIO_NOTIFICATION_COMPLETION has a boolean member called NotifyReset, but, at present we do. I would expect that by setting NotifyReset to true the act of dequeuing completions would reset the event AND enable further notifications, thus avoiding another potential kernel mode transition.

There are some flags that you can specify in your RIOReceive() call. The most interesting is RIO_MSG_WAITALL which causes the recv call to only complete when the buffer slice supplied is full, an error occurs, or the connection is terminated. This would be very useful for servers that work with messages which are of a fixed length, or are framed with a length prefix. By supplying a buffer of the appropriate size and specifying RIO_MSG_WAITALL you’ll get a single completion when the buffer is full. This is considerably better than getting a completion with a partial buffer and then needing to adjust the start position of the buffer and reissue the read so that you can read the rest of the message into the buffer. The reduced number of completions that need to be processed in this scenario, especially with large messages, would likely turn into a huge performance gain.

Note that currently, in a multi-buffer slice read operation, RIO_MSG_WAITALL will cause a completion to occur when the first buffer slice is full, not when all buffer slices supplied in the call are full.

Sending data using RIO

Sending data is pretty much the same as receiving it, you pass an array of buffer slices to RIOSend() and process the completion in the normal way.

RIO_BUF sendBuffer;

sendBuffer.BufferId = id;
sendBuffer.Offset = 0;
sendBuffer.Length = bufferSize;

// memcpy your data into the buffer slice...

if (!rio.RIOSend(requestQueue, &sendBuffer, 1, flags, pContext))
{
   // handle error
}

Conclusions

I’ve only scratched the surface of RIO here and the fact that the documentation is out of sync with the actual implementation means that this could all change before Windows 8 is released, but… Although RIO will likely mean that your design is more complicated than a “normal” IOCP design, I expect the performance gains will be worth it for certain types of networking applications. Being able to pre-lock all your memory buffers in memory and pre-assign your completion queues likely means that your server will be more robust, with failures due to the I/O page lock limit and non-paged pool exhaustion becoming a thing of the past. Processing completions on dedicated threads using the eventing version of the API is likely to result in higher performance for applications that suit that design whilst using the IOCP based notification system will scale more easily. I can see two use cases for RIO and I’m sure there are many more. The first is for high performance, jitter free, low latency connections where you have a small number of connections to deal with an want the best performance possible. The second is for servers with many thousands of concurrent connections where the performance gains from streamlined send and recv APIs and the correspondingly reduced kernel transitions lead to higher throughput on all connections.

We certainly intend to incorporate support for RIO into The Server Framework and right now we’re investigating the best way to do this. I’ll be blogging more about RIO over the next few weeks, why not subscribe to our RSS feed.

Code is here

Code - updated 15th April 2023

Full source can be found here on GitHub.

This isn’t production code, error handling is simply “panic and run away”.

This code is licensed with the MIT license.