Lessons from a Live Server Issue in .NET
(Updated Below)
A couple of months back my company was asked to upgrade the MCMS based public site of a client. We performed the upgrade for a subsite from .NET 1.1 to .NET 2.0. When we rolled out the upgraded code, we also changed the .NET runtime in their main rootsite and other virtual directories to .NET 2.0 runtime.
The load balanced site ran for 4 hours on both servers and then started giving "server very busy" (HTTP 503) errors. We were alerted and then worked off and on for 3 weeks and found the problem.
The problem had to do with HttpWebRequest-Response objects in .NET 1.1 and 2.0.
The 1.1 and 2.0 documentation for HttpWebResponse clearly states that not closing the response stream will cause a server to run out of connections. The behavior thereafter is not defined, but we found the behavior alright - subsequent threads that try to open the response stream just HANG. They block the entire request execution and the server starts throwing HTTP 503 errors.
Here is the offending code:
This code does NOT close the response object. It should call “resp.Close();”.
Yet, this code worked fine in the client's live environment for 2 years. Why?
We found that .NET 1.1 actually recovers the response stream even though the documentation says otherwise. That is, there is a loophole in .NET 1.1 for people who do not close the response stream.
In .NET 2.0, that loophole is closed. Therefore, the above codebase hangs the server if upgraded to .NET 2.0.
Lessons we learnt
This was a contentious engagement with accusations and defenses about what went wrong. But we learnt a few lessons that I am documenting here:
1. When we upgrade or make a change in an environment and the server starts misbehaving ALWAYS suspect the upgrade or the change. With automatic windows updating and external system dependencies, it is easy to dismiss a problem as coincidental to an upgrade. Most of the time that is not the case. Most of the time it is a change that was consciously made.
In our case, the code that we actually upgraded to 2.0 was working fine. It was the legacy 1.1 code that was running in the 2.0 runtime that showed issues. Therefore, from the first, we thought the upgrade was NOT the cause. Of course, it turned out it was.
2. Always suspect code. If threads hang or we get weird errors in an environment with load, it is natural to suspect that the code is not the problem. We tend to suspect external system configurations, any change to configuration, or hardware issues. In my experience, things keep coming back to the code. Suspect the code first and do not be afraid of debugging in the server with trace statements, file logs - anything.
3. Always check connections to external systems. More specifically, suspect resources such as database connection pools, socket connections, web services. The above point (point 2) should be used in conjunction with this - instead of randomly checking the code or code reviews, check wherever external connections such as sockets or database readers are opened.
4. One thing that helps immensely is a simple stress test with a tool such as Web Application Stress Tool (WAST) from Microsoft. It is free and there is no reason not to record a session and run a test with 50 threads for 30 minutes every time you make major changes. I also suggest running such a test in every environment - not just staging or live. Run it in your own computer, if possible.
5. Do not hesitate to put trace statements or Response.Writes or writes to a file in code in the server - if time is of essence. This is an advantage with ASP.NET 2.0's dynamic compilation or even legacy ASP and PHP. I found the threads hanging by using System.IO.File.WriteAllText method. It takes a path+filename and a string and writes the string to that file. By putting debug statements in different parts of the code, I could find the last place where the code was stuck.
Update
The performance counters that were most useful during this issue were the ASP.NET 2.0 (or ASP.NET 1.1) object with the counters Requests Queued and Requests Current.
During stress testing the criteria that was most useful in WAST was Socket Timeouts. There were many timeouts and after fixing the issue there were none.
A couple of months back my company was asked to upgrade the MCMS based public site of a client. We performed the upgrade for a subsite from .NET 1.1 to .NET 2.0. When we rolled out the upgraded code, we also changed the .NET runtime in their main rootsite and other virtual directories to .NET 2.0 runtime.
The load balanced site ran for 4 hours on both servers and then started giving "server very busy" (HTTP 503) errors. We were alerted and then worked off and on for 3 weeks and found the problem.
The problem had to do with HttpWebRequest-Response objects in .NET 1.1 and 2.0.
The 1.1 and 2.0 documentation for HttpWebResponse clearly states that not closing the response stream will cause a server to run out of connections. The behavior thereafter is not defined, but we found the behavior alright - subsequent threads that try to open the response stream just HANG. They block the entire request execution and the server starts throwing HTTP 503 errors.
Here is the offending code:
string sURL = "http://localhost/NR/...";
HttpWebRequest wReq = (HttpWebRequest)WebRequest.Create(sURL);
wReq.Method = "HEAD";
HttpWebResponse resp = (HttpWebResponse)(wReq).GetResponse();
string cntLen = resp.Headers["Content-length"];
This code does NOT close the response object. It should call “resp.Close();”.
Yet, this code worked fine in the client's live environment for 2 years. Why?
We found that .NET 1.1 actually recovers the response stream even though the documentation says otherwise. That is, there is a loophole in .NET 1.1 for people who do not close the response stream.
In .NET 2.0, that loophole is closed. Therefore, the above codebase hangs the server if upgraded to .NET 2.0.
Lessons we learnt
This was a contentious engagement with accusations and defenses about what went wrong. But we learnt a few lessons that I am documenting here:
1. When we upgrade or make a change in an environment and the server starts misbehaving ALWAYS suspect the upgrade or the change. With automatic windows updating and external system dependencies, it is easy to dismiss a problem as coincidental to an upgrade. Most of the time that is not the case. Most of the time it is a change that was consciously made.
In our case, the code that we actually upgraded to 2.0 was working fine. It was the legacy 1.1 code that was running in the 2.0 runtime that showed issues. Therefore, from the first, we thought the upgrade was NOT the cause. Of course, it turned out it was.
2. Always suspect code. If threads hang or we get weird errors in an environment with load, it is natural to suspect that the code is not the problem. We tend to suspect external system configurations, any change to configuration, or hardware issues. In my experience, things keep coming back to the code. Suspect the code first and do not be afraid of debugging in the server with trace statements, file logs - anything.
3. Always check connections to external systems. More specifically, suspect resources such as database connection pools, socket connections, web services. The above point (point 2) should be used in conjunction with this - instead of randomly checking the code or code reviews, check wherever external connections such as sockets or database readers are opened.
4. One thing that helps immensely is a simple stress test with a tool such as Web Application Stress Tool (WAST) from Microsoft. It is free and there is no reason not to record a session and run a test with 50 threads for 30 minutes every time you make major changes. I also suggest running such a test in every environment - not just staging or live. Run it in your own computer, if possible.
5. Do not hesitate to put trace statements or Response.Writes or writes to a file in code in the server - if time is of essence. This is an advantage with ASP.NET 2.0's dynamic compilation or even legacy ASP and PHP. I found the threads hanging by using System.IO.File.WriteAllText method. It takes a path+filename and a string and writes the string to that file. By putting debug statements in different parts of the code, I could find the last place where the code was stuck.
Update
The performance counters that were most useful during this issue were the ASP.NET 2.0 (or ASP.NET 1.1) object with the counters Requests Queued and Requests Current.
During stress testing the criteria that was most useful in WAST was Socket Timeouts. There were many timeouts and after fixing the issue there were none.
Labels: .NET 1.1, .NET 2.0, HttpWebResponse, Stress Testing
1 Comments:
Very nice situational analysis. The key take away's should be structured more like best practices though. That way, it provides a much quicker snapshot.
By mukund, at 6:10 PM
Post a Comment
<< Home