New features Archives

I hinted at the end of the last post that the 6.7 release might increase performance a little. Well, whilst the bulk of the changes in 6.7 are purely code cleaning and the removal of legacy support there is a fairly major functional change as well.

In most situations references or pointers to I/O buffers have been replaced with smart pointers. This change may cause some issues during an upgrade as you need to change some function signatures from IBuffer refs to CSmartBuffers. The advantage is that in many servers there will no longer be any need for buffer reference counting during normal I/O operations.

The Server Framework relies on reference counting to keep the objects that are used during the asynchronous operations alive until those operations complete. So we increment a counter on the socket object and also on the buffer object when we initiate an operation and then decrement the counters when the operation completes. I'm sure there are other ways to manage the object lifetime but this has worked well for us.

The problem is that these increments, although they look like cheap operations, can be quite expensive, especially on NUMA hardware.

Whilst there's not much we can do about the reference count on the socket object, the buffer doesn't really need to be reference counted most of the time. Or, more's the point. The initial reference can (and should) be passed along rather than each stage taking and releasing its own reference. With a buffer you generally only want to be accessing it from one thread at a time and so you allocate it and then issue an operation and pass the reference you have off to the operation. When the operation completes the code captures the reference and takes ownership of it and then the buffer can be processed. If you're lucky you can then use the same buffer for a response to the operation and pass it back to the framework again.

This requires a few changes to your code but it's fairly straight forward. Your OnReadCompleted() handler will give you a CSmartBuffer and if you want to retain ownership of it after the handler returns then you simply need to detach the buffer from the CSmartBuffer you were given.

This is only "potentially faster" as it really depends on the structure of your server and how you deal with our I/O buffers but the framework is no longer standing in the way of this kind of optimisation, and we've removed a couple of reference adjustments in the normal operation flow.

Another release is coming...

| 0 Comments

We've only just shipped Release 6.6.5 of The Server Framework but we already have another release that's just about to ship. This isn't because some horrible bug has slipped through our testing, it's because we've been planning to produce a 'clean up' release for some time. 6.7 is that release.

Lets be straight here, 6.7 is a release for us more than for you. The aim is to simplify our build/test and release process, remove dead code whilst introducing no new bugs and removing no functionality that you rely on.

So what does 6.7 achieve. Well, for a start we drop support for Visual Studio 2005 and 2008 and also for Windows XP. Removing support for these legacy compilers and operating systems means that we can remove all the code that was required just to support them. This massively simplifies our code base without removing anything that the code actually relies on to run on modern operating systems.

Windows Vista introduced massively important changes to asynchronous I/O and we have supported these changes for a long time (over 8 years!). The code required to jump through hoops to make code running on Windows XP behave was complex. For example, Windows XP would cancel outstanding I/O requests if the thread that issued them exited before the I/O request completed. We had a marshalling system in place to ensure that I/O operations were only ever executed on threads that we controlled so that you'd never be faced with unexpectedly cancelled operations. All of that can go now.

Removing XP also means we no longer need to maintain an XP machine in our build farm. It's one less configuration that needs to be built and tested before a release.

Dropping support for VS2005 and 2008 removes 4 complete sets of builds (x86 and x64 for each compiler) plus all of the conditional code that was required to support the older compilers. At last we can start moving towards a slightly more modern C++, perhaps.

Some old code has been removed; there's no need, on modern operating systems, to share locks. This worked really well back in the day, but, well, we were running on Windows NT at the time and resources were much more limited than they are now. All of the "Shared Critical Section" code is now gone. This has knock on effects into the Socket Tools library where all of the shared lock socket code has been removed. Nobody should be using that in 2016 anyway! You can no longer set a critical section's spin count in the socket allocator, it never really worked anyway as the lock was used for too many different things.

Some experimental code has also been removed; The TLS and Low Contention buffer allocators are gone. The horrible "dispatch to all threads" cludge has been removed from the I/O pools (it was only there to support pre-Vista CancelIO() calls which are no longer needed now that we have CancelIOEx()).

The original callback timer queue that was based on GetTickCount() and which spawned Len's "Practical testing" series of blog posts (back in 2004!) has gone. There's no need for the complexity when all supported operating systems have GetTickCount64().

Finally we've slimmed down our set of example servers. Removing servers which didn't add much value or which duplicated other examples. Again, this speeds our release process by speeding up the build and test stage as there are fewer servers to build and fewer tests to run.

So, what's in it for you? Well, a faster build/test/release cycle so new functionality and bug fixes can be released quicker and potentially faster code in some circumstances. There's no great rush to upgrade if you don't want to, but we'll be focusing on the 6.7 code base going forwards.

New option pack: Streaming Media

| 0 Comments
We have a new Option Pack, The Streaming Media Option Pack. This allows you to easily add streaming of H.264 and MPEG audio and video to your clients and servers using RTSP, RTP and RTCP.

With more and more Internet Of Things devices supporting rich media streaming for remote monitoring it's becoming essential to have the ability to manage these media streams within your device management servers and clients. Whether it's recording device streams for later analysis or arbitrating between multiple clients and devices, manipulating streaming media is becoming more and more important.

As always, this Option Pack integrates seamlessly with The Server Framework's Core Framework and other options and allows you to quickly and easily add rich media support.

UDP flow control and asynchronous writes

| 0 Comments

I don't believe that UDP should require any flow control in the sending application. After all, it's unreliable and it should be quite OK for any stage of the route from one peer to another to decide to drop a datagram for any reason. However, it seems that, on Window's at least, no datagrams will be dropped between the application and the network interface card (NIC) driver, no matter how heavily you load the system.

Unfortunately most NIC drivers also prefer not to drop datagrams, even if they're overloaded (see here for details of how UDP checksum offloading can cause a NIC to make non-paged pool usage considerably higher). This can lead to situations where a user mode application can bring a box down due to non-paged pool exhaustion simply by sending as many datagrams as it can as fast as it can. It's likely that it's actually poorly implemented device drivers that are at fault here; by failing to gracefully handle situations where non-paged pool allocations fail, but it's the application that is putting these drivers into a situation where they could fail in such a catastrophic manner.

Since the NIC driver and the operating system will not drop datagrams it's down to the application itself to do so if it senses that it's overloading the NIC. I've recently added code to The Server Framework to allow you to configure behaviour like this so that an application can prevent itself from exhausting non-paged pool due to pending datagram writes.
Release 6.6 of The Server Framework includes some breaking changes to both the IService, IServiceCallbacks and IShutdownService interfaces. Many functions now return an ServiceTools::ExitCode, either directly or by value, which allows you to fine tune the exit code returned from your service under failure conditions. This exit code is reported to the Service Control Manager (SCM) when your service shuts down and also returned from the exe if you run the service as a normal exe. These changes allow finer control of your service but can easily be completely ignored if you want things to stay the way they were.

Other functions take slightly different parameters

Slightly more efficient locking

| 0 Comments
Another performance improvement in the forthcoming 6.6 release is due to a change in our default choice for locking primitives on most platforms. Note that the perf improvement is small and, according to our testing, it doesn't materialise on all hardware (though there's no performance degradation seen).

The change is to switch from using CRITICAL_SECTION objects to using Slim Reader Writer Locks in exclusive (write) mode. You can read about the differences between these two locks in Kenny Kerr's MSDN article here. This change can't be applied to all uses of our CCriticalSection class as SRW locks are not recursive and so we have a whole new locking class hierarchy with new CLockableObject locks which use SRW locks on platforms that support it and drop back to using a CRITICAL_SECTION on XP and earlier. Then there's a CReentrantLockableObject which is, basically, a CCriticalSection with a new name and a slightly new interface. There are also classes for locks which track the thread that owns them (just so that they can tell you if you currently have them locked) as we use that functionality in a couple of places in The Server Framework for optimising code paths.

The new locks give a slight improvement in the time taken to acquire them and use fewer resources. They haven't been fully integrated into all libraries yet (in particular they are not in use throughout the Socket Tools library yet) and so the full affect of these changes cannot yet be appreciated.
I've been working on a "big" new release for some time, too long actually. It has steadily been accumulating new features for over a year but the arrival of my second son in July last year and masses of client work has meant that it has repeatedly been pushed on the back burner. Well, no more, Release 6.6 is now in the final stages of development and testing (so I won't be adding more new features) and hopefully will see a release in Q2

I'm planning a "what's new in 6.6" blog posting which will detail all of the major changes but first I'd like to show you the results of some performance tuning that I've been doing. Most people are familiar with the quote from Donald Knuth, "premature optimization is the root of all evil", and it's often used as a stick to beat people with when they want to tweak low level code to "make things faster". Yet the full quote is more interesting; "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.". I like to think that much of what I've done in 6.6 is in that 3% but even if it isn't it's still optimisation that comes for free to users of the framework. What's more, the first change also removes complexity and makes it much easier to write correct code using the framework.

Since explaining the changes is pretty heavy going lets jump to the pictures of the results first.

Performance of a 6.5.9 Open SSL Server

This is the "before" graph; a v6.5.9 Open SSL Server. 6.5.9-OpenSSLServerPerf.png

Performance of a 6.6 Open SSL Server

And this is the "after" graph; a v6.6 server. 6.6-OpenSSLServerPerf.png The important things are the red and purple lines (bytes processed/sec) where higher values are better. The next important are the faint dotted lines (thread context switches) where lower values are better.

What the graphs above show is that with the new changes in 6.6 a server can process more data in less time with fewer thread context switches.

These tests were run on a dual quad-core Xeon E5320 and pushed the box to around 80% cpu usage; the 6.6 test using slightly more cpu but for much less time. The results have been far less dramatic, but still positive, on our Core i7-3930K (12 core), but it's hard to push it above 30% cpu utilisation.
Some stream protocols have the concept of 'out of band' (OOB) data. This is a separate logical communication channel between the peers which enables data that is unrelated to the current data in the stream to be sent alongside the normal data stream. This is often a way for some data to jump ahead of the normal stream and arrive faster than if it were delivered via the the normal data stream.

Winsock supports out of band data in a protocol independent way, see here, but accessing it from networking code that uses overlapped I/O rather than the old-fashioned BSD API is somewhat under documented. By default, out of band data does not appear in the normal data stream, you have to read it explicitly by setting MSG_OOB in the flags of a call to WSARecv(). For non overlapped I/O designs you can use WSAAsyncSelect() or select() to explicitly check for the presence of out of band data. With overlapped I/O your options appear limited, it seems that you should be able to use an overlapped WSAIoctl() call with SIOCATMARK but this will return immediately when OOB is either present or not present, it doesn't wait for OOB to become available.

The solution is to post a separate, out of band, overlapped WSARecv() passing the MSG_OOB flag. This will only return on socket closure or when out of band data arrives. By using a distinct indicator in your 'per operation data' you can distinguish this read from normal reads and deal with it accordingly. Once you have processed the special out of band data read you can then post another to read subsequent out of band data.

The WebSocket protocol - Draft, HyBi 09

| 0 Comments
Due to client demand we're working on the WebSocket protocol again. Things have moved on since the work we did in December and this time the resulting option pack really will make it into the next release rather than simply being something that we tweak for each client that asks for it.

Back in December one of our gaming clients wanted WebSocket functionality in their game server so we did some work on the two versions of the spec that they wanted, the Hixie 76 draft and the HyBi 03 draft. The protocol has since gone through some changes and is now looking to be stabilising once again with the HyBi 09 draft.

The new version of the protocol still has a few outstanding questions though, such as whether the existing design of the deflate-stream extension is actually worth having in the presence of frame data masking, see here... So we expect that there will be several more revisions of our implementation as the standard solidifies and our client's usage patterns emerge.
I'm in the process of completing a custom server development project for a client. The server deals with connections from thousands of embedded devices and allows them to download new firmware or configuration data and upload data that they've accumulated in their local memory. To enable maximum scalability the server use asynchronous reads and writes to the file system as well as the network.

One feature of the server is the ability to configure it to create per session log files. This helps in debugging device communication issues by creating a separate log file for each device every time it connects. You can easily see the interaction between a specific device and the server, including full dumps of all data sent and received if desired. Again, the log system uses asynchronous file writes to allow us to scale well.

With some misconfiguration of the server and a heavy load (8000+ connected devices all doing data uploads and file downloads with full logging for all sessions) I managed to put the system into a mode whereby it was using non-paged pool in an uncontrolled manner. Each data transmission generated several log writes, all log write completions went through a single thread and we were writing to over 16000 files at once. Each asynchronous write used a small amount of non-paged pool but I was issuing writes too fast and so I watched the non-paged pool usage grow to over 2gb on my Windows 7 development box and then watched the box fall over as various drivers failed due to lack of non-paged pool. Not a good design.

Back in 2009 I wrote about how I had added the ability to restrict the number of pending writes on an instance of the CAsyncFileWriter class. The idea being that in some logging situations you can generate log lines faster than you can write them and if you have a potentially unlimited number of overlapped writes pending then you can run out of non-paged pool memory and this is a very bad situation to get into. The problem with that is that it's a limit per file writer. With this new server's per session logs we have thousands of file writers active at any one time, limiting the number of writes that each writer can have pending isn't really enough, we need to have an overall limit for all of the writers to share.

Such a limiter was pretty easy to add, you simply pass an instance of the limiter in to each file writer and they all share a single limit. Profiling can show how large you can make the limit for a given box and a general purpose "good enough for most machines" limit value is pretty easy to come up with.

Running the badly configured server again with the new limiter showed that everything was working nicely, the server simply slowed down as the limit was reached and the asynchronous logging became, effectively, synchronous. The non-paged pool memory usage stayed reasonable and the server serviced all of the clients without problem.

The changes to the CAsyncFileWriter and the new limiter will be available in the next release, 6.4, which currently doesn't have a release date.

Automatic crash dump creation.

| 0 Comments
The next release of The Server Framework, 6.4, includes code which allows a server to create a crash dump whilst it is running. Typically you might want to do this if some strange and possibly fatal exception gets thrown. Crash dumps are a great way to debug a server that is failing in production, the server generates the dump when something goes wrong and you can then load up the dump on your development machine and your debugger will be sitting on the line that caused the problem. You can read all about crash dumps here.

Being able to enable automatic crash dump creation in a server means that you don't have to rely on external tools to generate the dumps for you. By simply adjusting your Config.h you can generate a crash dump whenever an SEH exception is caught. Your server may be able to continue to run but you're generating diagnostic data that will be useful to prevent the exceptions being generated. Likewise you can enable crash dump generation from purecalls, simply install the purecall handler and adjust your Config.h to enable the dump creation.

Of course you can also use the crash dump generation code in other areas of your code if you need to. You can tune how much data is included in the dump, and hence the size of the resulting file and you can determine where the files are generated. Of course the source is available so that you can tweak the crash dump generation if you want to.

Once you are building code that can generate crash dumps you will probably want to set up a symbol server to store the exe and pdb files for every release that you create so that the dumps that your clients send to you can be used to debug properly. A good introduction to setting up and using a symbol server can be found here.

Changes to the CLR Hosting Tools library in 6.3

| 0 Comments
One of my clients has recently required .Net 4.0 hosting support and so most of the changes in the CLR Hosting Tools library in 6.3 have been driven by them.

Changes to the Service Tools library in 6.3

| 0 Comments
The development of WASP has been acting as a bit of an internal driver for new feature development in the 6.3 release of The Server Framework. Sitting down to develop a service that was easy to use for a mass market exposed some small holes in the 6.2 release; nothing too serious but pretty soon after putting together the first service shell of the WASP application I had a list of nice to have additions for the Service Tools Library. 

Follow us on Twitter: @ServerFramework

About this Archive

This page is an archive of recent entries in the New features category.

General is the previous category.

Releases is the next category.

I usually write about the development of The Server Framework, a super scalable, high performance, C++, I/O Completion Port based framework for writing servers and clients on Windows platforms.

Find recent content on the main index or look in the archives to find all content.