October 2011 Archives

Windows 8 Registered I/O Buffer Strategies

One of the things that allows the Windows 8 Registered I/O Networking Extensions, RIO, to perform better than normal Winsock calls is the fact that the memory used for I/O operations is pre-registered with the API. This allows RIO to do all the necessary checks that the buffer memory is valid, etc. once, and then lock the buffer in memory until you de-register it. Compare this to normal Winsock networking where the memory needs to be checked and locked on each operation and already we have a whole load of work that simply isn't required for each I/O operation. As always, take a look at this video from Microsoft's BUILD conference for more in-depth details.

RIO buffers need to be registered before use

The recommended way to use RIORegisterBuffer() is to register large buffers and then use smaller slices from these buffers in your I/O calls, rather than registering each individual I/O buffer separately. This reduces the book-keeping costs as each registered buffer requires some memory to track its registration. It's also sensible to use page aligned memory for buffers that you register with RIORegisterBuffer() as the locking granularity of the operating system is page level so if you use a buffer that is not aligned on a page boundary you will lock the entire page that it occupies. This is especially important given that there's a limit to the number of I/O pages that can be locked at one time and I would imagine that buffers registered with RIORegisterBuffer() count against this limit.

Windows 8 Registered I/O and I/O Completion Ports

In my last blog post I introduced the Windows 8 Registered I/O Networking Extensions, RIO. As I explained there are three ways to retrieve completions from RIO, polled, event driven and via an I/O Completion Port (IOCP). This makes RIO pretty flexible and allows it to be used in many different designs of servers. The polled scenario is likely aimed at very high performance UDP or High Frequency Trading style situations where you may be happy to burn CPU so as to process inbound datagrams as fast as possible. The event driven style may also help here, allowing you to wait efficiently rather than spin, but it's the IOCP style that currently interests me most at present as this promises to provide increased performance to more general purpose networking code.

Please bear in mind the caveats from my last blog post, this stuff is new, I'm still finding my way, the docs aren't in sync with the headers in the SDK and much of this is based on assumption and intuition.

How do RIO and IOCP work together?

RIO's completions arrive via a completion queue, which is fixed sized data structure that is shared between user space and kernel space (via locked memory?) and which does not require a kernel mode transition to dequeue from (see this BUILD video for more details on RIO's internals). As I showed last time, you specify how you want to retrieve completions when you create the queue, either providing an event to be signalled, an IOCP to be posted to or nothing if you will simply poll the queue. When using an IOCP you get a notification sent to you when the completion queue is no longer empty after you have indicated that you want to receive completions by calling RIONotify().
Most of the buzz being generated around the Windows 8 Developer Previews at the moment seems to be centred on the new Metro user interface and the Windows Runtime. Whilst both Metro and WinRT are key components of the next Windows release I find the Registered I/O Networking Extensions to be far more interesting, but then I guess I would...

What are the Registered I/O Networking Extensions?

The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. I assume that the increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go.

The RIO API is pretty simple and straight forward but servers that currently use I/O Completion Port based designs will need to change somewhat to take advantage of it and probably not all server designs will benefit from changing. RIO relies on you registering the memory that you will use as data buffers and knowing in advance how many pending operations a given socket will have at any time. This allows it to lock the data buffers in memory once, rather than on each operation and removes the whole concept of the OVERLAPPED structure from the user space API. Since completion queue space is also of a fixed size you're also required to know how many sockets you will be allocating to a given queue and the maximum number of pending operations that these sockets will have. You can increase all of these limits after socket creation but, except for registering new data buffers, I expect that you're likely to take a performance hit for doing so.

I've been looking at the pre-release documentation and the headers from the latest Windows SDK and experimenting with the new RIO API. Note that at present the documentation is out of sync with the headers and there's little more than reference documentation so much of what I have to say about RIO is based on assumptions and intuition based on the available information and my knowledge of how I/O Completion Port based networking currently works on pre Windows 8 operating systems. In other words, don't rely on all of this to be correct.
Some stream protocols have the concept of 'out of band' (OOB) data. This is a separate logical communication channel between the peers which enables data that is unrelated to the current data in the stream to be sent alongside the normal data stream. This is often a way for some data to jump ahead of the normal stream and arrive faster than if it were delivered via the the normal data stream.

Winsock supports out of band data in a protocol independent way, see here, but accessing it from networking code that uses overlapped I/O rather than the old-fashioned BSD API is somewhat under documented. By default, out of band data does not appear in the normal data stream, you have to read it explicitly by setting MSG_OOB in the flags of a call to WSARecv(). For non overlapped I/O designs you can use WSAAsyncSelect() or select() to explicitly check for the presence of out of band data. With overlapped I/O your options appear limited, it seems that you should be able to use an overlapped WSAIoctl() call with SIOCATMARK but this will return immediately when OOB is either present or not present, it doesn't wait for OOB to become available.

The solution is to post a separate, out of band, overlapped WSARecv() passing the MSG_OOB flag. This will only return on socket closure or when out of band data arrives. By using a distinct indicator in your 'per operation data' you can distinguish this read from normal reads and deal with it accordingly. Once you have processed the special out of band data read you can then post another to read subsequent out of band data.

Latest release of The Server Framework: 6.5.1

Version 6.5.1 of The Server Framework was released today.

This is primarily a bug fix release, although we also add several new example clients and servers.

This release includes the following, see the release notes, here, for full details of all changes.

  • Bug fixes to The Core Framework which affect the use of the newly added "Read Again" functionality.
  • Fixes to the Hixie76 WebSockets protocol handler to improve interoperability.
  • Added outbound connection establishment support to the Hixie76 protocol handler.
  • Updated the WebSocket Echo Server Test and the Secure WebSocket Echo Server Test example clients to support the creation of both Hixie and HyBi connections.
  • Fixed a race condition in the WebSocket example clients that could cause a connection to "stall" - note that this was an issue with the how the client code used the WebSocket layer and not an issue in The WebSockets Option Pack itself.
  • Added two new example servers; a secure, managed WebSocket server which hosts the CLR and routes complete WebSocket messages to managed code for processing and a version of this example which also hosts a simple HTTP server.
  • Added a new HTTP client. This is designed to stress test HTTP and HTTPS servers by creating thousands of concurrent connections and requesting various resources at controllable rates.

A little note to all you Chinese hackers

My server logs are showing that there are some people currently trying to hack this site. They appear to be mainly Chinese. I assume you think you might be able to download the source code to The Server Framework for free if you manage to hack my websites; after all the same IP addresses have been exploring my sites a lot and looking at lots of the documentation pages on here...

Anyway, you're wrong. The source doesn't live on these internet facing servers and never has done, all that's there are the publicly accessible examples.

Please go away.

And of course, to anyone who isn't trying to hack the site, happy browsing!

New client profile: MiX Telematics

We have a new client profile available here for a client that we've had since 2009 and who use The Server Framework in their vehicle tracking software.

Follow us on Twitter: @ServerFramework

About this Archive

This page is an archive of entries from October 2011 listed from newest to oldest.

September 2011 is the previous archive.

November 2011 is the next archive.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.