Development Archives

Dropping support for Visual Studio 2005 and 2008

| 0 Comments
The 7.0 release of The Server Framework, which is likely to be released early next year, will no longer support Visual Studio 2005 or Visual Studio 2008.

The 6.6.x releases will be the last to support these compilers.

Please get in touch immediately if this will be a problem for you.

Dropping support for Windows XP and Server 2003

| 2 Comments
The 7.0 release of The Server Framework, which is likely to be released early next year, will no longer support Windows XP or Windows Server 2003.

The 6.6.x releases will be the last to support these operating systems. Release 6.6.3, is due shortly and is a minor bug fixing release. We may release subsequent 6.6.x bug fix releases but no new development will occur on the 6.6 branch.

Removal of support for these operating systems allows us to clean up the code considerably and to remove lots of code that's required purely to work around 'interesting' twists in various Windows APIs pre-Vista.

New option pack: Streaming Media

| 0 Comments
We have a new Option Pack, The Streaming Media Option Pack. This allows you to easily add streaming of H.264 and MPEG audio and video to your clients and servers using RTSP, RTP and RTCP.

With more and more Internet Of Things devices supporting rich media streaming for remote monitoring it's becoming essential to have the ability to manage these media streams within your device management servers and clients. Whether it's recording device streams for later analysis or arbitrating between multiple clients and devices, manipulating streaming media is becoming more and more important.

As always, this Option Pack integrates seamlessly with The Server Framework's Core Framework and other options and allows you to quickly and easily add rich media support.
The latest release of The Server Framework, which is due later this month, adds support for Visual Studio 2013 and Windows 8.1 as well as a host of other major changes.

As we first mentioned here, release 6.6 of The Server Framework removes support for Visual Studio .Net (2002) and Visual Studio .Net (2003). The 2002 compiler is no longer supported by Microsoft and the 2003 compiler becomes unsupported in October this year. To be honest, I'm very pleased to see the back of them. Hopefully most users of the framework are using at least Visual Studio 2005, if you're not, get in touch now.

We're also dropping support for Windows 2000, which Microsoft stopped supporting in 2010 this means the Windows XP is the earliest supported operating system for version 6.6.

Finally we're deprecating quite a bit of code. This will still be present in 6.6 but will, generally, be unavailable by default. You can add various defines to your Config.h to re-enable the deprecated code but be warned, it will be going in a future release.

The following code is deprecated in 6.6
Release 6.6 of The Server Framework includes some breaking changes to both the IService, IServiceCallbacks and IShutdownService interfaces. Many functions now return an ServiceTools::ExitCode, either directly or by value, which allows you to fine tune the exit code returned from your service under failure conditions. This exit code is reported to the Service Control Manager (SCM) when your service shuts down and also returned from the exe if you run the service as a normal exe. These changes allow finer control of your service but can easily be completely ignored if you want things to stay the way they were.

Other functions take slightly different parameters

Slightly more efficient locking

| 0 Comments
Another performance improvement in the forthcoming 6.6 release is due to a change in our default choice for locking primitives on most platforms. Note that the perf improvement is small and, according to our testing, it doesn't materialise on all hardware (though there's no performance degradation seen).

The change is to switch from using CRITICAL_SECTION objects to using Slim Reader Writer Locks in exclusive (write) mode. You can read about the differences between these two locks in Kenny Kerr's MSDN article here. This change can't be applied to all uses of our CCriticalSection class as SRW locks are not recursive and so we have a whole new locking class hierarchy with new CLockableObject locks which use SRW locks on platforms that support it and drop back to using a CRITICAL_SECTION on XP and earlier. Then there's a CReentrantLockableObject which is, basically, a CCriticalSection with a new name and a slightly new interface. There are also classes for locks which track the thread that owns them (just so that they can tell you if you currently have them locked) as we use that functionality in a couple of places in The Server Framework for optimising code paths.

The new locks give a slight improvement in the time taken to acquire them and use fewer resources. They haven't been fully integrated into all libraries yet (in particular they are not in use throughout the Socket Tools library yet) and so the full affect of these changes cannot yet be appreciated.
I've been working on a "big" new release for some time, too long actually. It has steadily been accumulating new features for over a year but the arrival of my second son in July last year and masses of client work has meant that it has repeatedly been pushed on the back burner. Well, no more, Release 6.6 is now in the final stages of development and testing (so I won't be adding more new features) and hopefully will see a release in Q2

I'm planning a "what's new in 6.6" blog posting which will detail all of the major changes but first I'd like to show you the results of some performance tuning that I've been doing. Most people are familiar with the quote from Donald Knuth, "premature optimization is the root of all evil", and it's often used as a stick to beat people with when they want to tweak low level code to "make things faster". Yet the full quote is more interesting; "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.". I like to think that much of what I've done in 6.6 is in that 3% but even if it isn't it's still optimisation that comes for free to users of the framework. What's more, the first change also removes complexity and makes it much easier to write correct code using the framework.

Since explaining the changes is pretty heavy going lets jump to the pictures of the results first.

Performance of a 6.5.9 Open SSL Server

This is the "before" graph; a v6.5.9 Open SSL Server. 6.5.9-OpenSSLServerPerf.png

Performance of a 6.6 Open SSL Server

And this is the "after" graph; a v6.6 server. 6.6-OpenSSLServerPerf.png The important things are the red and purple lines (bytes processed/sec) where higher values are better. The next important are the faint dotted lines (thread context switches) where lower values are better.

What the graphs above show is that with the new changes in 6.6 a server can process more data in less time with fewer thread context switches.

These tests were run on a dual quad-core Xeon E5320 and pushed the box to around 80% cpu usage; the 6.6 test using slightly more cpu but for much less time. The results have been far less dramatic, but still positive, on our Core i7-3930K (12 core), but it's hard to push it above 30% cpu utilisation.
As I mentioned in the release notes for version 6.5.5, The Server Framework now supports Visual Studio 11 Beta. It also supports Visual Studio 2012 RC but there are a couple of new warnings that you may need to suppress in Warnings.h.

I haven't been able to locate and details of the differences in the native code generation and C++ compiler side of Visual Studio 2012 RC from what was present in Visual Studio 11 Beta. If anyone has any links or information, please let me know. I don't anticipate any problems, all of the libraries' unit tests pass and most of the server examples that I've tested so far are fine.

I will be releasing a new release of The Server Framework before Visual Studio 2012 is released and this will finalise support for the new compiler and project file format. If you have any questions please get in touch.

Windows 8 Registered I/O Performance

| 0 Comments | 1 TrackBack
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance.

Of course these comparisons should be taken as preliminary since we're working with a beta version of the operating system. However, though I wouldn't put much weight in the exact numbers until we have a non-beta OS to test on, it's useful to see how things are coming along and familiarise ourselves with the designs that might be required to take advantage of RIO once it ships. The main thing to take away from these discussions on RIO's performance are the example server designs, the testing methods and a general understanding of why RIO performs better than the traditional Windows networking APIs. With this you can run your own tests, build your own servers and get value from using RIO where appropriate.

Do bear in mind that I'm learning as I go here, RIO is a new API and there is precious little in the way of documentation about how and why to use the API. Comments and suggestions are more than welcome, feel free to put me straight if I've made some mistakes, and submit code to help make these examples better.

Please note: the results of using RIO with a 10 Gigabit Ethernet link are considerably more impressive than those shown in this article - please see here for how we ran these tests again with improved hardware.
Now that we have five example servers, four RIO designs and a traditional polled UDP design, we can begin to look at how the RIO API performs compared to the traditional APIs. Of course these comparisons should be taken as preliminary since we're working with a beta version of the operating system. However, though I wouldn't put much weight in the exact numbers until we have a non-beta OS to test on, it's useful to see how things are coming along and familiarise ourselves with the designs that might be required to take advantage of RIO once it ships.

Sending a stream of datagrams

Before we can compare performance we need to be able to push the example servers hard. We do this by sending a stream of datagrams at them as fast as we can for a period of time. The servers start timing when they get the first datagram and then count the number of datagrams that they process. The test finishes by sending a series of smaller datagrams at the server. When the server sees one of these smaller datagrams it shuts down and reports on the time taken and the number of datagrams processed and the rate at which they were processed.

All we need to be able to do to stress the servers is to send datagrams at a rate that gets close to 100% utilisation of a 1Gb Ethernet link. This is fairly simple to achieve using the traditional blocking sockets API.
   for (size_t i = 0; i < DATAGRAMS_TO_SEND; ++i)
   {
      if (SOCKET_ERROR == ::WSASendTo(
         s,
         &buf,
         1,
         &bytesSent,
         flags,
         reinterpret_cast<sockaddr *>(&addr),
         sizeof(addr),
         0,
         0))
      {
         ErrorExit("WSASend");
      }
   }

There's not much more to it than that. We use similar code to setup and clean up, but if you've been following along with the other examples then there's nothing that needs to be explained about that.

The code for this example can be downloaded from here. This code requires Visual Studio 11, but would work with earlier compilers if you have a Windows SDK that supports RIO. Note that Shared.h and Constants.h contain helper functions and tuning constants for ALL of the examples and so there will be code in there that is not used by this example. You should be able to unzip each example into the same directory structure so that they all share the same shared headers. This allows you to tune all of the examples the same so that any performance comparisons make sense. This program can be run on versions of Windows prior to Windows 8, which is useful for testing as you only need one machine set up with the beta of Windows 8 server.
This article presents the fifth in my series of example servers for comparing the performance of the Windows 8 Registered I/O Networking extensions, RIO, and traditional Windows networking APIs. This example server is a traditional polled UDP design that we can use to compare to the RIO polled UDP example server. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

A traditional polled UDP server

This server is probably the simplest UDP server you could have. It's pretty much just a tight loop around a blocking call to WSARecv(). There's none of the complexity required by RIO for registering memory buffers for I/O and so we use a single buffer that we create on the stack.
   do
   {
      workValue += DoWork(g_workIterations);

      if (SOCKET_ERROR == ::WSARecv(
         s,
         &buf,
         1,
         &bytesRecvd,
         &flags,
         0,
         0))
      {
         ErrorExit("WSARecv");
      }

      if (bytesRecvd == EXPECTED_DATA_SIZE)
      {
         g_packets++;
      }
      else
      {
         done = true;
      }
   }
   while (!done);
There is some added complexity to allow us to compare performance, and this is similar to the RIO server examples. We can add an arbitrary processing overhead to each datagram by setting g_workIterations to a non zero value and we count each datagram that arrives and stop the test when a datagram of an unexpected size is received.
This article presents the fourth in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server, like the last example, uses the I/O Completion Port notification method to handle RIO completions, but where the last example used only a single thread to service the IOCP this one uses multiple thread to scale the load . I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an I/O Completion Port for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using an IOCP for RIO completions allows you to easily scale your completion handling across multiple threads as we do here and this is the first of my example servers that allows for more than one thread to be used to process completions.
This article presents the third in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server uses the I/O Completion Port notification method to handle RIO completions, but only uses a single thread to service the IOCP. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an I/O Completion Port for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using an IOCP for RIO completions allows you to easily scale your completion handling across multiple threads, though in this first IOCP example server we use a single thread so as to allow us to compare the performance against the polled and event driven servers. The next example server will adapt this server for multiple threads and allow us to scale our completion processing across more CPUs.
This article presents the second in my series of example servers using the Windows 8 Registered I/O Networking extensions, RIO. This example server uses the event driven notification method to handle RIO completions. I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Using an event for RIO completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. Using the event driven approach is similar to using the polling approach that I described in the previous article except that the server doesn't burn CPU in a tight polling loop.
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance. This series of blog posts describes each of the example servers in turn. You can find an index to all of the articles about the Windows 8 Registered I/O example servers here.

Polling RIO for completions

As I mentioned back in October, there are three ways to receive completion notifications from RIO; polling, event driven and via an I/O Completion Port. The first is the simplest though it burns CPU time even when no datagrams are being received.

At its simplest a polled RIO server obtains datagrams to process like this:
   RIORESULT results[RIO_MAX_RESULTS];

   ULONG numResults = 0;

   do
   {
      numResults = g_rio.RIODequeueCompletion(
         g_queue,
         results,
         RIO_MAX_RESULTS);

      if (0 == numResults)
      {
         YieldProcessor();
      }
      else if (RIO_CORRUPT_CQ == numResults)
      {
         ErrorExit("RIODequeueCompletion");
      }
   }
   while (0 == numResults);
You then loop over the results array and process each result in turn before looping back to dequeue more completions.

Getting to the point where you can call RIODequeueCompletion() takes a bit of setting up though...

Windows 8 Registered I/O Example UDP Servers

| 0 Comments
I've been looking at the Windows 8 Registered I/O Networking Extensions since October when they first made an appearance as part of the Windows 8 Developer Preview. Whilst exploring and understanding the new API I spent some time putting together some simple UDP servers using the various notification styles that RIO provides. I then put together some equally simple UDP servers using the "traditional" APIs so that I could compare performance.

RIO API demonstration

The examples are simple in that they do the bare minimum to demonstrate the APIs in question but they are configurable so that you can tune them to the hardware on which you're running them. You can run them to compare the maximum speed at which you can pull UDP datagrams off of the wire using each API and then adjust the examples so that they do a specific amount of "work" with each datagram to simulate a slightly more realistic scenario.

Simplified error handling

Error handling is limited, we display an error and exit the program, but we don't skip error checking, all API calls are checked for errors. The examples are each stand alone but can share two common header files. The first, Constants.h, contains all constants that are used to tune the examples. The second, Shared.h, contains inline helper functions which hide some of the complexity and allow the individual example programs to focus on the area of the API that they're demonstrating.

This is the index page

I will be blogging about the construction of the various examples over the next few weeks and updating this entry as an index page for all of the examples. I've listed the examples that I'll be talking about and I'll link to each blog post as they go live. Once I've presented the RIO examples I'll present the more traditional examples and finally some performance comparisons.

RIO server examples

  • RIO Polled UDP - A server which uses a single thread and a tight loop to poll for RIO completions.
  • RIO Event Driven UDP - A server which uses a single thread and event driven notifications to handle RIO completions.
  • RIO IOCP UDP - A server which uses a single thread and I/O Completion Port notifications to handle RIO completions.
  • RIO IOCP MT UDP - A server which uses a configurable number of threads and I/O Completion Port notifications to handle RIO completions.

Traditional server examples

  • Simple Polled UDP - A server which uses a single thread and a tight loop to poll WSARecv() for datagrams.
  • IOCP UDP - A server which uses a single thread and I/O Completion Port notifications to handle overlapped WSARecv() completions.
  • IOCP MT UDP - A server which uses a configurable number of threads and I/O Completion Port notifications to handle overlapped WSARecv() completions.

A simple UDP datagram traffic generator

  • Simple UDP traffic generator - A client which uses a single thread and a tight loop send datagrams using WSASendTo(), this easily saturates a 1000BASE-T, 1Gb ethernet connection.

Test scripts

  • Test scripts - These simple scripts create performance counter logs and run the test servers.

Performance Test results

  • The first tests - Where we compare the simple polled traditional server with the polled RIO server.

Join in

Comments and suggestions are more than welcome. I'm learning as I go here and I'm quite likely to have made some mistakes or come up with some erroneous conclusions, feel free to put me straight and help make these examples better.

Windows 8 Registered I/O Buffer Strategies

| 4 Comments
One of the things that allows the Windows 8 Registered I/O Networking Extensions, RIO, to perform better than normal Winsock calls is the fact that the memory used for I/O operations is pre-registered with the API. This allows RIO to do all the necessary checks that the buffer memory is valid, etc. once, and then lock the buffer in memory until you de-register it. Compare this to normal Winsock networking where the memory needs to be checked and locked on each operation and already we have a whole load of work that simply isn't required for each I/O operation. As always, take a look at this video from Microsoft's BUILD conference for more in-depth details.

RIO buffers need to be registered before use

The recommended way to use RIORegisterBuffer() is to register large buffers and then use smaller slices from these buffers in your I/O calls, rather than registering each individual I/O buffer separately. This reduces the book-keeping costs as each registered buffer requires some memory to track its registration. It's also sensible to use page aligned memory for buffers that you register with RIORegisterBuffer() as the locking granularity of the operating system is page level so if you use a buffer that is not aligned on a page boundary you will lock the entire page that it occupies. This is especially important given that there's a limit to the number of I/O pages that can be locked at one time and I would imagine that buffers registered with RIORegisterBuffer() count against this limit.

Windows 8 Registered I/O and I/O Completion Ports

| 0 Comments
In my last blog post I introduced the Windows 8 Registered I/O Networking Extensions, RIO. As I explained there are three ways to retrieve completions from RIO, polled, event driven and via an I/O Completion Port (IOCP). This makes RIO pretty flexible and allows it to be used in many different designs of servers. The polled scenario is likely aimed at very high performance UDP or High Frequency Trading style situations where you may be happy to burn CPU so as to process inbound datagrams as fast as possible. The event driven style may also help here, allowing you to wait efficiently rather than spin, but it's the IOCP style that currently interests me most at present as this promises to provide increased performance to more general purpose networking code.

Please bear in mind the caveats from my last blog post, this stuff is new, I'm still finding my way, the docs aren't in sync with the headers in the SDK and much of this is based on assumption and intuition.

How do RIO and IOCP work together?

RIO's completions arrive via a completion queue, which is fixed sized data structure that is shared between user space and kernel space (via locked memory?) and which does not require a kernel mode transition to dequeue from (see this BUILD video for more details on RIO's internals). As I showed last time, you specify how you want to retrieve completions when you create the queue, either providing an event to be signalled, an IOCP to be posted to or nothing if you will simply poll the queue. When using an IOCP you get a notification sent to you when the completion queue is no longer empty after you have indicated that you want to receive completions by calling RIONotify().
Most of the buzz being generated around the Windows 8 Developer Previews at the moment seems to be centred on the new Metro user interface and the Windows Runtime. Whilst both Metro and WinRT are key components of the next Windows release I find the Registered I/O Networking Extensions to be far more interesting, but then I guess I would...

What are the Registered I/O Networking Extensions?

The Registered I/O Networking Extensions, RIO, is a new API that has been added to Winsock to support high-speed networking for increased networking performance with lower latency and jitter. These extensions are targeted primarily for server applications and use pre-registered data buffers and completion queues to increase performance. I assume that the increased performance comes from avoiding the need to lock memory pages and copy OVERLAPPED structures into kernel space when individual requests are issued, instead relying on pre-locked buffers, fixed sized completion queues, optional event notification on completions and the ability to return multiple completions from kernel space to user space in one go.

The RIO API is pretty simple and straight forward but servers that currently use I/O Completion Port based designs will need to change somewhat to take advantage of it and probably not all server designs will benefit from changing. RIO relies on you registering the memory that you will use as data buffers and knowing in advance how many pending operations a given socket will have at any time. This allows it to lock the data buffers in memory once, rather than on each operation and removes the whole concept of the OVERLAPPED structure from the user space API. Since completion queue space is also of a fixed size you're also required to know how many sockets you will be allocating to a given queue and the maximum number of pending operations that these sockets will have. You can increase all of these limits after socket creation but, except for registering new data buffers, I expect that you're likely to take a performance hit for doing so.

I've been looking at the pre-release documentation and the headers from the latest Windows SDK and experimenting with the new RIO API. Note that at present the documentation is out of sync with the headers and there's little more than reference documentation so much of what I have to say about RIO is based on assumptions and intuition based on the available information and my knowledge of how I/O Completion Port based networking currently works on pre Windows 8 operating systems. In other words, don't rely on all of this to be correct.

The WebSocket protocol - Draft, HyBi 09

| 0 Comments
Due to client demand we're working on the WebSocket protocol again. Things have moved on since the work we did in December and this time the resulting option pack really will make it into the next release rather than simply being something that we tweak for each client that asks for it.

Back in December one of our gaming clients wanted WebSocket functionality in their game server so we did some work on the two versions of the spec that they wanted, the Hixie 76 draft and the HyBi 03 draft. The protocol has since gone through some changes and is now looking to be stabilising once again with the HyBi 09 draft.

The new version of the protocol still has a few outstanding questions though, such as whether the existing design of the deflate-stream extension is actually worth having in the presence of frame data masking, see here... So we expect that there will be several more revisions of our implementation as the standard solidifies and our client's usage patterns emerge.
I'm in the process of completing a custom server development project for a client. The server deals with connections from thousands of embedded devices and allows them to download new firmware or configuration data and upload data that they've accumulated in their local memory. To enable maximum scalability the server use asynchronous reads and writes to the file system as well as the network.

One feature of the server is the ability to configure it to create per session log files. This helps in debugging device communication issues by creating a separate log file for each device every time it connects. You can easily see the interaction between a specific device and the server, including full dumps of all data sent and received if desired. Again, the log system uses asynchronous file writes to allow us to scale well.

With some misconfiguration of the server and a heavy load (8000+ connected devices all doing data uploads and file downloads with full logging for all sessions) I managed to put the system into a mode whereby it was using non-paged pool in an uncontrolled manner. Each data transmission generated several log writes, all log write completions went through a single thread and we were writing to over 16000 files at once. Each asynchronous write used a small amount of non-paged pool but I was issuing writes too fast and so I watched the non-paged pool usage grow to over 2gb on my Windows 7 development box and then watched the box fall over as various drivers failed due to lack of non-paged pool. Not a good design.

Back in 2009 I wrote about how I had added the ability to restrict the number of pending writes on an instance of the CAsyncFileWriter class. The idea being that in some logging situations you can generate log lines faster than you can write them and if you have a potentially unlimited number of overlapped writes pending then you can run out of non-paged pool memory and this is a very bad situation to get into. The problem with that is that it's a limit per file writer. With this new server's per session logs we have thousands of file writers active at any one time, limiting the number of writes that each writer can have pending isn't really enough, we need to have an overall limit for all of the writers to share.

Such a limiter was pretty easy to add, you simply pass an instance of the limiter in to each file writer and they all share a single limit. Profiling can show how large you can make the limit for a given box and a general purpose "good enough for most machines" limit value is pretty easy to come up with.

Running the badly configured server again with the new limiter showed that everything was working nicely, the server simply slowed down as the limit was reached and the asynchronous logging became, effectively, synchronous. The non-paged pool memory usage stayed reasonable and the server serviced all of the clients without problem.

The changes to the CAsyncFileWriter and the new limiter will be available in the next release, 6.4, which currently doesn't have a release date.

Automatic crash dump creation.

| 0 Comments
The next release of The Server Framework, 6.4, includes code which allows a server to create a crash dump whilst it is running. Typically you might want to do this if some strange and possibly fatal exception gets thrown. Crash dumps are a great way to debug a server that is failing in production, the server generates the dump when something goes wrong and you can then load up the dump on your development machine and your debugger will be sitting on the line that caused the problem. You can read all about crash dumps here.

Being able to enable automatic crash dump creation in a server means that you don't have to rely on external tools to generate the dumps for you. By simply adjusting your Config.h you can generate a crash dump whenever an SEH exception is caught. Your server may be able to continue to run but you're generating diagnostic data that will be useful to prevent the exceptions being generated. Likewise you can enable crash dump generation from purecalls, simply install the purecall handler and adjust your Config.h to enable the dump creation.

Of course you can also use the crash dump generation code in other areas of your code if you need to. You can tune how much data is included in the dump, and hence the size of the resulting file and you can determine where the files are generated. Of course the source is available so that you can tweak the crash dump generation if you want to.

Once you are building code that can generate crash dumps you will probably want to set up a symbol server to store the exe and pdb files for every release that you create so that the dumps that your clients send to you can be used to debug properly. A good introduction to setting up and using a symbol server can be found here.
The CIOPool class's constructor takes a ThreadIdentifier value which is used to allow the threads in the pool to know that they're in the pool. The value is stored in thread local storage for each thread in the pool and can then be examined to determine if a thread belongs to the I/O pool. The value can also be queried and other threads can 'be treated as I/O pool threads' by setting the appropriate TLS slot to this value.

This is all used to allow us to skip the marshalling of I/O requests into the I/O pool if the request is being issued by a thread that's already in the pool. This allows the code to run slightly faster on pre-Vista machines in some circumstances.

Unfortunately the value passed in is never checked for zero and so incorrect configuration could lead to a thread pool which has zero passed in for its thread identifier. This is highly unlikely as most pools are never specifically configured for the thread identifier and simply use the default. If a thread identifier is set to zero then all threads will appear to be part of the I/O pool - an uninitialised TLS slot contains zero. This will mean that NO I/O requests are marshalled even on platforms where not doing so could cause problems.

In 6.4 I've removed the thread identifier from the constructor, it's now automatically generated and guaranteed to be non-zero. I've also changed the interface for how a thread acts as part of the I/O pool. This removes the design bugs and prevents incorrect configuration causing I/O operations to potentially be aborted if the thread that issued them exits before they complete (on pre-Vista systems).

Performance, allocators, pooling and 6.4

| 0 Comments
I've been doing a lot of performance related work over the past few months. Much of it has originally stemmed from work that I have been doing for our Online Game Company client. They want the asynchronous ENet-based server that we've built for them to be the fastest it can be and so I've been looking at all aspects of performance within the framework. The testing hasn't brought anything too embarrassing to light; the I/O recursion limiter that was introduced in 6.3 affected performance more than I would have liked, but that's now fixed and they weren't using it anyway.

One Million TCP Connections...

| 6 Comments
We get the software and then we hold the company to ransom for .... ONE MILLION TCP CONNECTIONS!
It seems that C10K is old hat these days and that people are aiming a little higher. I've seen several questions on StackOverflow.com (and had an equal number of direct emails) asking about how people can achieve one million active TCP connections on a single Windows Server box. Or, in a more round about way, what is the theoretical maximum number of active TCP connections that a Windows Server box can handle.

I always respond in the same way; basically I think the question is misguided and even if a definitive answer were possible it wouldn't actually be that useful.

What the people asking these questions seem to ignore is that with a modern Windows Server operating system, a reasonable amount of ram and a decent enough set of CPUs it's likely that it's YOUR software that is the limiting factor in the equation.

The WebSocket protocol

| 0 Comments
I've spent the last few days implementing the WebSocket protocol (well, two versions of the draft standard actually) and integrating it into an existing server for one of our clients. This has proved to be an interesting exercise. The protocol itself is pretty simple but, as ever, the devil is in the detail. I now have server side code that deals with both the Hixie 76 draft and the HyBi 03 draft of the protocol. Once the initial handshake (which is pretty much just HTTP) is out of the way the two drafts deal in terms of frames of data rather than a simple byte stream. The library that I've developed accumulates these frames in an I/O buffer frames until they're complete and then dispatches them to the layer of code above by using a callback interface. Thus your server simply sits and waits for frames to arrive and then sends out frames of its own.

My approach to bugs

| 0 Comments
As the recent spate of bug fix and patch releases shows I'm not scared of talking about the bugs that I find in the code of The Server Framework and pushing fixes out quickly. It's my belief that the most important thing to get out of a bug report is an improved process which will help prevent similar bugs from occurring in future and the only way to achieve that is to be open about the bugs you find and equally open about how you then address them and try and prevent similar issues. Every bug is an opportunity to improve. Sometimes I wish I had fewer opportunities...
As I mentioned last time, supporting a large number of concurrent connections on a modern Windows operating system is reasonably straight forward if you get your initial design right; use an I/O Completion Port based design, minimise context switches, data copies and memory allocation and avoid lock contention... The Server Framework gives you this as a starting point and you can often use one of the many, complete and fully functionaly, real world example servers to provide you with a whole server shell, complete with easy performance monitoring and SSL security, where you simply have to fill in your business logic.

Unfortunately it can be all too easy to squander the scalability and performance that you have at the start with poor design choices along the way. These design choices are often quite sensible when viewed from a single threaded or reasonably complex desktop development viewpoint but can be performance killers when writing scalable servers. It's also easy to over engineer your solution because of irrational performance fears; the over-engineering takes time and delays your delivery, and can often add complexity which then causes maintenance issues for years after. There's a fine line to walk between the two and I firmly believe that the only way to walk this line is to establish realistic performance tests from day 0.

Step one on the road to a high performance server is obviously to select The Server Framework to do your network I/O ;)

Step two is to get a shell of an application up and running so that you can measure its performance.

Step three is, of course, to measure.

There is no excuse for getting to your acceptance testing phase only to find that your code doesn't perform adequately under the desired load. What's more, at that point it's often too late to do anything about the problem without refactoring reams of code. Even if you have decent unit test coverage, refactoring towards performance is often a painful process. The various inappropriate design decisions that you can make tend to build on each other and the result can be difficult to unpick. The performance of the whole is likely to continue to suffer even as you replace individual poor performing components.

So the first thing you should do once you have your basic server operating, even if all it does is echo data back, is to build a client that can stress it and which you can develop in tandem with the server to ensure that real world usage scenarios can scale. The Server Framework comes with an example client that provides the basic shell of a high performance multiple client simulator. This allows you to set up tests to prove that your server can handle the loads that you need it to handle. Starting with a basic echo server you can first base line the number of connections that it can handle on a given hardware platform. Then as you add functionality to the server you can performance test real world scenarios by sending and receiving appropriate sequences of messages. As your server grows in complexity you can ensure that your design decisions don't adversely affect performance to the point where you no longer meet your performance targets. For example, you might find that adding a collection of data which connections need to access on every message causes an unnecessary synchronisation point across all connections which reduces the maximum number of active connections that you can handle from vastly above your performance target to somewhere very close to it... Knowing this as soon as the offending code is added to the code base means that the redesign (if deemed required) is less painful. Tracking this performance issue down later on and then fixing it might be considerably harder once the whole server workflow has come to depend on it.

I'm a big fan of unit testing and agile development and so I don't find this kind of 'incremental' acceptance testing to be anything but sensible and, in the world of high performance servers, essential.

You can download a compiled copy of the Echo Server test program from here, where I talk about using it to test servers developed using WASP.

Of course the key to this kind of testing is using realistic scenarios. When your test tools are written with as much scalability and performance as the server under test it's easy to create unrealistic test scenarios. One of the first problems that clients using the echo server test program to evaluate the performance of example servers have is that of simulating too many concurrent connection attempts. Whilst it's easy to generate 5000 concurrent connections and watch most servers fail to deal with them effectively it's not usually especially realistic. A far more realistic version of this scenario might be to handle a peak of 1000 connections per second for 5 seconds, perhaps whilst the server is already dealing with 25,000 concurrent connections that had arrived at a more modest connection rate. Likewise it's easy to send messages as fast as possible but that's often not how the server will actually be used. The Echo Server test program can be configured to establish connections and send data at predetermined rates which helps you build more realistic tests.

You should also be careful to make sure that you're not, in fact, simply testing the capabilities of the machines being used to run the test clients, or the network bandwidth between them and the server. With the easy availability of cloud computing resources such as Amazon's EC2 it's pretty easy to put together a network of machines to use to load test your server.

Once you have a suitable set of clients, running a reasonably number of connections each you can begin to stress your server with repeatable, preferably scripted, tests. You can then automate the gathering of results using perfmon and your server's performance counters mixed in with the standard system counters.

Personally I tend to do two kinds of load tests. The first is to prove that we can achieve the client's target performance for the desired number of connections on given hardware. The second is to see what happens when we drive the server to destruction. These destruction tests are useful to know what kind of gap there is between target performance and server meltdown and also to ensure that server meltdown is protected against, either by actively limiting the number of connections that a server is willing to accept or by ensuring that performance degrades gracefully rather than spectacularly.

Knowledge is power, and when aiming to build super scalable, high performance code you need to gather as much knowledge as you can by measuring and performance testing your whole system from the very start. 
Using a modern Windows operating system it's pretty easy to build a server system that can support many thousands of connections if you design the system to use the correct Windows APIs. The key to server scalability is to always keep in mind the Four Horsemen of Poor Performance as described by Jeff Darcy in his document on High Performance Server Architecture. These are:

  • Data copies
  • Context switches
  • Memory allocation
  • Lock contention
I'll look at context switches first, as IMHO this is where outdated designs often rear their head first. On Windows systems you MUST be using I/O Completion Ports and overlapped I/O if you want the best scalability. Using the IOCP API correctly can mean that you can service thousands of concurrent connections with a handful of threads.

So, we'll assume we're using a modern overlapped I/O, IO Completion Port based design; something similar to what The Server Framework provides, perhaps. Using an IOCP allows you to limit the number of threads that are eligible to run at the same time and the IOCP uses a last in, first out mechanism to ensure that the thread that was last active and processing is always the next one to be given new work, thus preventing a new thread's memory needing to be paged in. 

The latest versions of Windows allow you to reduce context switching even further by enabling various options on each socket. I've spoken in detail about how FILE_SKIP_COMPLETION_PORT_ON_SUCCESS can help to reduce context switching on my blog, here. The gist is that with this option enabled you can process overlapped operation completions on the same thread that issued the operation if the operation can complete at the time it's issued. This removes unnecessary context switching. The Server Framework supports this mode of operation on operating systems that support it.

Next on to data copies, as Jeff says in his document, one of the best ways to avoid data copies is to use reference counted buffers and to manage them in terms of the amount of data present in them, building buffer chains where necessary. The Server Framework has always worked in terms of flexible, reference counted buffers. Many server designs can benefit from accumulating inbound data into a single buffer with no buffer copying required simply by reissuing a read on a connection with the buffer that was passed to you when the previous read completed. In this way it's easy to accumulate 'complete messages' and process them without needing to copy data. 

Since the buffers are reference counted you can easily pass them off to other threads for processing, or keep them hanging around until you're done with them. The CMultiBufferHandle class allows you to use scatter/gather I/O so that you can transmit common blocks of data without needing to copy the common data and the CBufferHandle class allows you to broadcast data buffers to multiple connections without needing to copy the data for each connection.

Memory allocation during connection processing is minimal and custom buffer allocators can reduce this even further. The Server Framework ships with several different allocators and it's easy to implement your own if you need to and simply plug it in to the framework code. By default the buffer and socket allocators pool memory for reuse which helps reduce contention and improve performance.

Once you're processing your messages several tricks can be employed to optimise your memory allocation use, a favourite of mine is to use custom memory allocators that use scratch space in the buffer that the message was initially read into. This can then be used to provide all of the dynamic memory needed during message processing and avoids the need for traditional memory management and its potential lock contention.  

Lock contention within the framework itself is limited. The design of the socket object is such that you can design a server where there's never any need to access a common collection of connection objects simply to obtain per connection data from it. Each connection can have as much user data as you like associated with it and this can all be accessed from the connection without the need for any locks. The locks in the buffer allocators are probably the most likely locks to result in contention but here you can select from several different strategies, including a lock free allocator. 

Of course once you've our of the framework code and into your own code you still have to be careful not to fall foul of the Four Horsemen of Poor Performance, but you can rest assured that The Server Framework is designed very much with these issues in mind. I believe that the only way to make sure that you maintain scalability and performance is to test for it at all stages of development, it's important to set up scalability tests from day 1 and to run them regularly using real world usage scenarios. I'll talk about how to go about setting up these kinds of tests in a later blog post.

Testing complex server code

| 0 Comments
As I mentioned in the release notes for v6.3 here, I've added some code to prevent potential recursion issues if certain performance improvements are enabled.

In Windows Vista and later it's possible to set the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS flag on a socket using SetFileCompletionNotificationModes(). When this flag is set an overlapped operation can complete "in-line" and the completion operation can be handled on the thread that issued the operation rather than on one of the threads that is servicing the IO completion port that is associated with the socket. This is great as it means that if, for example, data is already available when an overlapped read is issued then we avoid a potentially costly context switch to an I/O thread to handle this data. The downside of this is that the code for handling overlapped completions becomes potentially recursive. If we issue a read and it completes straight away and is handled on the thread that issued it then the code that handles the read completion is likely to issue another read which itself may complete "in-line", etc. With a suitable rate of supply of inbound data this can lead to stack overflows due to unconstrained recursion.

Some thoughts on that two thread pool server design

| 0 Comments
I'm currently re-reading "High Performance Server Architecture" by Jeff Darcy and he has a lot of sensible stuff to say about avoiding context switches and how my multiple thread pool design, whilst conceptually good is practically not so good. In general I agree with him but often the design provides good enough performance and it's easy to compose from the various classes in The Server Framework.

Explicitly managing the threads that could run, using a semaphore that only allows a number of threads that is equal to or less than your number of cores to do work at once is a nice idea but one that adds complexity to the workflow as you need to explicitly acquire and release the semaphore as you perform your blocking operations. This approach, coupled with a single thread pool with more threads than you have processors would likely result in less context switches and higher performance.

I'm currently accumulating ideas for the performance work that I have scheduled for the 6.4 release, I expect a single pool design with a running threads limiter will feature...  

Using OpenSSL with Asynchronous Sockets

| 0 Comments
OpenSSL is an open source implementation of the SSL and TLS protocols. Unfortunately it doesn't play well with windows style asynchronous sockets. This article - previously published in Windows Developer Magazine and now available on the Dr. Dobbs site - provides a simple connector that enables you to use OpenSSL asynchronously.

Integrating OpenSSL with asynchronous sockets is similar to integrating it with overlapped I/O and IO completion port based designs and so the ideas behind the code discussed in the article were then used as part of the original design for The Server Framework's OpenSSL option pack.

Changes to the CLR Hosting Tools library in 6.3

| 0 Comments
One of my clients has recently required .Net 4.0 hosting support and so most of the changes in the CLR Hosting Tools library in 6.3 have been driven by them.

Changes to the Service Tools library in 6.3

| 0 Comments
The development of WASP has been acting as a bit of an internal driver for new feature development in the 6.3 release of The Server Framework. Sitting down to develop a service that was easy to use for a mass market exposed some small holes in the 6.2 release; nothing too serious but pretty soon after putting together the first service shell of the WASP application I had a list of nice to have additions for the Service Tools Library. 

Follow us on Twitter: @ServerFramework

About this Archive

This page is an archive of recent entries in the Development category.

Bug fixes is the previous category.

General is the next category.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.