September 2012 Archives

Latest release of The Server Framework: 6.5.8

| 0 Comments
Version 6.5.8 of The Server Framework was released today.

This release contains one bug fix for a bug which has been present in The Server Framework since at least release 5.0 and one change to work around a bug in Windows 8 and Server 2012.

If you plan to use AcceptEx() on Windows 8 or Server 2012 or you have connections which run for a long period of time, use sequenced sockets and issue more than 2,147,483,647 writes on a socket then you need this release.

This release includes the following, see the release notes, here, for full details of all changes.

  • Changes to how CStreamSocketServerEx issues AcceptEx() calls to ensure that these calls are always made from an I/O thread. This is to help reduce the likely impact of the bug in Windows 8/Server 2012 with regard to the delay of AcceptEx() completions when the issuing thread is blocked in a synchronous ReadFile().
  • Bug fix to sequence writes, see here for more details. Due to a type size mismatch between IBuffer::SequenceNumber and the sequence numbers generated by CSequencedStreamSocket::SequenceData a write stall could occur when the sequence number that was being generated (a signed long) wrapped as the "next" sequence number in CInOrderBufferList was a IBuffer::SequenceNumber which was a size_t which wrapped far later. The buffer list would wait for buffers with sequence numbers which could never be generated.
I've just found and fixed a bug which has been present in The Server Framework from the very beginning. The problem affects connections which are "sequenced" and which have had more than 2,147,483,647 writes performed on them. The observant amongst you will be thinking that it's a counter wrap bug and you'd be correct. The annoying thing for me is that the code in question has unit tests which explicitly test for correct operation when the sequence number wraps; the tests pass but the bug is still there.

The reason that the bug survived the unit tests is that the tests were testing the obvious place for the problem; the CInOrderBufferList class which ensures that if a series of buffers with sequence numbers in them are added to the collection then they are removed in the sequenced order. The wrap test code checks that the collection doesn't get confused when it has buffers with very high sequence numbers in it at the same time as buffers with low sequence numbers are added. The test works fine and tests the correct aspects of the class under test. Unfortunately the code that generates the sequence numbers that go into the buffers in the real code was out of sync with the code that processes these sequence numbers. The CInOrderBufferList class used the IBuffer::SequenceNumber type, a size_t, throughout, the code that generated the numbers did so using a long. The type mismatch meant that the sequence numbers generated were from 0 to 2,147,483,647 and then back to 0. Unfortunately the consuming code did not wrap from 2,147,483,647 to 0 and so expected a buffer with a sequence number of 2,147,483,648 to come along next and it never would.

The symptoms of the bug are that data stops being written to the network and memory usage increases.

I expect the code was originally OK for 32-bit platforms (where size_t and unsigned long are the same size) as I'm sure the code used to use ::InterlockedIncrement() to increment the sequence numbers in the producer and this would have treated the long as an unsigned long for increment and wrap purposes. Every x64 build has always been broken through and the x86 build was broken as far back as release 5.0

I've added a test which is run once inside the sequence number generator which ensures that the numbers being generated have the same range as the IBuffer::SequenceNumber type. This prevents the broken code from compiling (signed/unsigned mismatch) and throws and exception at run time if the code can compile but the data types are incorrect. The library's unit tests then fail due to the exception.

This fix will be included in release 6.5.8 which also includes the changes for the Windows 8/Server 2012 AcceptEx() bug. This will be released this week.
Be aware that there is a known bug in Windows 8 and all Server 2012 variants which causes AcceptEx() completions to be delayed in some situations. This was confirmed by a Microsoft representative on Microsoft Connect, see the error report ticket here. An example of how to demonstrate this bug, its likely affects and the current know causes can be found here in this Stack Overflow question.

I'm a little disappointed with the official response to this bug report. I wasn't expecting a fix to be included in the imminent release of Server 2012 but it would be nice to get more information back from Microsoft about what the actual cause of the problem is. Right now the Stack Overflow question is the only source of information about what may cause this problem and which API calls it is caused by and affects. Given that it appears to be a bug in how IOCP completions are dispatched for threads that issued I/O requests but are now in some kind of blocking state (possibly something to do with APC dispatch and the fact the blocking thread is not currently in an alertable wait state - but this is all guess work).

Note that it is unlikely that code built with The Server Framework will suffer any problems from this bug. We always suggest that you never do blocking calls on I/O threads and all of our AcceptEx() calls (except the first ones which are usually issued when the server starts up) are made from an I/O thread. As long as you are following our advice and are using a separate thread pool for potentially blocking operations then your code should run just fine even on operating systems where this bug is present.

I will be issuing a bug fix release shortly which will marshal the initial AcceptEx() calls that are made when you call StartAcceptingConnections() off of the calling thread and onto one of the I/O pool threads. This is an easy change which will require no changes to non library code and will remove any chance that these initial AcceptEx() calls could suffer from delayed completion if you happen to call an inappropriate Windows API on that thread (currently just a synchronous ReadFile() or anything that itself calls down to a synchronous ReadFile()).

Note that this bug only affects AcceptEx() completions and so only affects servers that use the CStreamSocketServerEx class.