Avoiding non-paged pool exhaustion when using asynchronous file writers

I’m in the process of completing a custom server development project for a client. The server deals with connections from thousands of embedded devices and allows them to download new firmware or configuration data and upload data that they’ve accumulated in their local memory. To enable maximum scalability the server use asynchronous reads and writes to the file system as well as the network.

One feature of the server is the ability to configure it to create per session log files. This helps in debugging device communication issues by creating a separate log file for each device every time it connects. You can easily see the interaction between a specific device and the server, including full dumps of all data sent and received if desired. Again, the log system uses asynchronous file writes to allow us to scale well.

With some misconfiguration of the server and a heavy load (8000+ connected devices all doing data uploads and file downloads with full logging for all sessions) I managed to put the system into a mode whereby it was using non-paged pool in an uncontrolled manner. Each data transmission generated several log writes, all log write completions went through a single thread and we were writing to over 16000 files at once. Each asynchronous write used a small amount of non-paged pool but I was issuing writes too fast and so I watched the non-paged pool usage grow to over 2gb on my Windows 7 development box and then watched the box fall over as various drivers failed due to lack of non-paged pool. Not a good design.

Back in 2009 I wrote about how I had added the ability to restrict the number of pending writes on an instance of the CAsyncFileWriter class. The idea being that in some logging situations you can generate log lines faster than you can write them and if you have a potentially unlimited number of overlapped writes pending then you can run out of non-paged pool memory and this is a very bad situation to get into. The problem with that is that it’s a limit per file writer. With this new server’s per session logs we have thousands of file writers active at any one time, limiting the number of writes that each writer can have pending isn’t really enough, we need to have an overall limit for all of the writers to share.

Such a limiter was pretty easy to add, you simply pass an instance of the limiter in to each file writer and they all share a single limit. Profiling can show how large you can make the limit for a given box and a general purpose “good enough for most machines” limit value is pretty easy to come up with.

Running the badly configured server again with the new limiter showed that everything was working nicely, the server simply slowed down as the limit was reached and the asynchronous logging became, effectively, synchronous. The non-paged pool memory usage stayed reasonable and the server serviced all of the clients without problem.

The changes to the CAsyncFileWriter and the new limiter will be available in the next release, 6.4, which currently doesn’t have a release date.