How to support 10,000 or more concurrent TCP connections - Part 2 - Perf tests from Day 0

2010-10-27

As I mentioned last time, supporting a large number of concurrent connections on a modern Windows operating system is reasonably straight forward if you get your initial design right; use an I/O Completion Port based design, minimise context switches, data copies and memory allocation and avoid lock contention… The Server Framework gives you this as a starting point and you can often use one of the many, complete and fully functionaly, real world example servers to provide you with a whole server shell, complete with easy performance monitoring and SSL security, where you simply have to fill in your business logic.

Unfortunately it can be all too easy to squander the scalability and performance that you have at the start with poor design choices along the way. These design choices are often quite sensible when viewed from a single threaded or reasonably complex desktop development viewpoint but can be performance killers when writing scalable servers. It’s also easy to over engineer your solution because of irrational performance fears; the over-engineering takes time and delays your delivery, and can often add complexity which then causes maintenance issues for years after. There’s a fine line to walk between the two and I firmly believe that the only way to walk this line is to establish realistic performance tests from day 0.

Step one on the road to a high performance server is obviously to select The Server Framework to do your network I/O ;)
Step two is to get a shell of an application up and running so that you can measure its performance.
Step three is, of course, to measure.

There is no excuse for getting to your acceptance testing phase only to find that your code doesn’t perform adequately under the desired load. What’s more, at that point it’s often too late to do anything about the problem without refactoring reams of code. Even if you have decent unit test coverage, refactoring towards performance is often a painful process. The various inappropriate design decisions that you can make tend to build on each other and the result can be difficult to unpick. The performance of the whole is likely to continue to suffer even as you replace individual poor performing components.

So the first thing you should do once you have your basic server operating, even if all it does is echo data back, is to build a client that can stress it and which you can develop in tandem with the server to ensure that real world usage scenarios can scale. The Server Framework comes with an example client that provides the basic shell of a high performance multiple client simulator. This allows you to set up tests to prove that your server can handle the loads that you need it to handle. Starting with a basic echo server you can first base line the number of connections that it can handle on a given hardware platform. Then as you add functionality to the server you can performance test real world scenarios by sending and receiving appropriate sequences of messages. As your server grows in complexity you can ensure that your design decisions don’t adversely affect performance to the point where you no longer meet your performance targets. For example, you might find that adding a collection of data which connections need to access on every message causes an unnecessary synchronisation point across all connections which reduces the maximum number of active connections that you can handle from vastly above your performance target to somewhere very close to it… Knowing this as soon as the offending code is added to the code base means that the redesign (if deemed required) is less painful. Tracking this performance issue down later on and then fixing it might be considerably harder once the whole server workflow has come to depend on it.

I’m a big fan of unit testing and agile development and so I don’t find this kind of ‘incremental’ acceptance testing to be anything but sensible and, in the world of high performance servers, essential.

You can download a compiled copy of the Echo Server test program from here, where I talk about using it to test servers developed using WASP.

Of course the key to this kind of testing is using realistic scenarios. When your test tools are written with as much scalability and performance as the server under test it’s easy to create unrealistic test scenarios. One of the first problems that clients using the echo server test program to evaluate the performance of example servers have is that of simulating too many concurrent connection attempts. Whilst it’s easy to generate 5000 concurrent connections and watch most servers fail to deal with them effectively it’s not usually especially realistic. A far more realistic version of this scenario might be to handle a peak of 1000 connections per second for 5 seconds, perhaps whilst the server is already dealing with 25,000 concurrent connections that had arrived at a more modest connection rate. Likewise it’s easy to send messages as fast as possible but that’s often not how the server will actually be used. The Echo Server test program can be configured to establish connections and send data at predetermined rates which helps you build more realistic tests.

You should also be careful to make sure that you’re not, in fact, simply testing the capabilities of the machines being used to run the test clients, or the network bandwidth between them and the server. With the easy availability of cloud computing resources such as Amazon’s EC2 it’s pretty easy to put together a network of machines to use to load test your server.

Once you have a suitable set of clients, running a reasonably number of connections each you can begin to stress your server with repeatable, preferably scripted, tests. You can then automate the gathering of results using perfmon and your server’s performance counters mixed in with the standard system counters.

Personally I tend to do two kinds of load tests. The first is to prove that we can achieve the client’s target performance for the desired number of connections on given hardware. The second is to see what happens when we drive the server to destruction. These destruction tests are useful to know what kind of gap there is between target performance and server meltdown and also to ensure that server meltdown is protected against, either by actively limiting the number of connections that a server is willing to accept or by ensuring that performance degrades gracefully rather than spectacularly.

Knowledge is power, and when aiming to build super scalable, high performance code you need to gather as much knowledge as you can by measuring and performance testing your whole system from the very start.