Saturday, May 12, 2007

The underlying connection was closed: The request was canceled.

The Problem:
Couple of weeks back I was investigating a problem with an ASP.NET web application. During the course of normal day and under normal load the server would suddenly stop responding, shooting up the response time and ultimately causing timeouts for the end users. ASP.NET performance counters showed that at this time all the requests were getting queued up and the CPU usage dropped to zero!

Now request queuing is not a bad thing if the CPU is being utilized properly - it just means that the server has reached it's capacity and we need to add more servers to scale the application. But in my case the CPU was not being used at all. This suggested that there was some kind of "thread locking". All the threads were waiting on something to get released. This symptom lasted for 20-30 seconds and then the server recovered with request Queue reducing to 0 and CPU utulization going back to it's original 30%-40%.

Investigation:
A little background on our web application: It consists of a page made up of 4-5 sections which displays content from a variety of external sources (all web services). So on every request it was making about 4-5 XML/HTTP calls to external web services and then applied XSLT to generate HTML snippets and finally assemble the page. As it involved multiple remote calls, the developer used asynchronous delegates to execute these calls in parallel and then waited for these calls to return using the "WaitHandle.WaitAll()". So this piece of code was my prime suspect for the following reasons: Asynchronous delegates use thread from the worker thread pool to make the async call and in case of our application it was making 4-5 async calls, which means each request has the potential for using 4-5 more threads from the thread pool. And if the web services were slow in responding, as more requests come in more worker threads will be used from the queue and ultimately lead to request queuing. (BTW Fritz Onion has written an excellent article about asynchronous delegates and it's pitfalls)

To test this theory I modified the code and made all the calls synchronous in an attempt to reduce the usage of threads from the worker thread pool per request. But when I perf tested this code I saw the same behavior!! The CPU dropped again!

I did another perf testing and this time I was watched the socket connections opened from our app to the external systems using the "netstat -ap TCP" command. And I saw this interesting behavior; at the beginning of the perf test, as I was slowly ramping up the virtual users (total 30 vusers, ramping 1 per sec), even with 10 vusers there were only 3-4 outgoing socket connections. This means that the web services were responding so fast that 3 socket connections were sufficient to serve 10 users. The CPU was being properly utilized at this time (10%-15%). When the ramp reached 20 vusers everything just froze!! The CPU suddenly dropped, requests started queuing and the number of socket connections were still 3! Then the next instant the number of socket connections suddenly jumped to 20, the CPU went back to normal 20% utilization and there were no request queuing!

The Cause:
My theory based on the above observation is that when the application first starts there are very few threads available in the pool to service remote requests. As load increases the few threads cannot handle the requests and so ASP.NET runtime tries to create more threads and when this happens it freezes all other threads (don't know why) and you see the CPU drop.

The Solution : "minFreeThreads"!
I searched a lot on the internet without much success until at last I found this Microsoft Knowledge Base Article 821268 which confirmed my theory. The solution is to add a new attribute called "minWorkerThreads" to the section of machine.config
 // minWorkerThreads = maxWorkerThreads/2

ASP.NET by default has only 1 thread allocated for remote calls! So even if you configured your machine.config with the recommended settings of "maxConnections=24" and "maxWorkerThreads=100" (see my post below for more info) you are just setting the maximum value! It doesn't mean the pool starts with 24 threads ready for making outgoing calls! And the most surprising thing is that by default this entry is missing from the machine.config!

So if you are facing with a similar problem with your ASP.NET app that makes remote calls try adding this entry. This simple fix may just solve your problem!

P.S : There are different kinds of "The underlying connection was closed: " errors. Here's a link to a Microsoft blog (http://blogs.msdn.com/engelsr/articles/497902.aspx) that explains most of these errors and how to fix them. Unfortunately it doesn't cover the error message that I encountered ("the request was canceled").