Home > Alagad > Troubleshooting ColdFusion Performance: The Analysis

Troubleshooting ColdFusion Performance: The Analysis

September 21st, 2009

OK, time for a follow up to my 1st post in this troubleshooting series for our Project X, and I am going to talk about some steps we can take to better identify and isolate the issues today. I had some great participation on the last post, thanks to all who contributed (and you are entered for the backpack!). If you have not read the first post outlining the hypothetical scenario, do that before reading the rest here.
So the first step I usually take in any project is as simple as observation. Looking at task manager, and enabling the ’show kernel times’, will give us an instant pointer. Lets take a peek at one of the web servers while it is under heavy load, and we see this:

This graph tells us a few things. First of all, notice that the server is nowhere near using 100% of its possible CPU utilization, and notice the red line (which is the kernel or system usage). Kernel usage that is extremely high, or seems to follow the CPU usage like this one does, is often indicative of an I/O issue.  An I/O issue can be slow or overloaded disk resources, waiting on a network thread to complete, or waiting on database requests. As noted in the prior post, we are seeing Out of Memory exceptions after a time as well, but lets chase down issues one at a time here.
So, reading the CPU graph on the web server makes us suspect an I/O issue, lets try and confirm this.  One of my favorite tools for this is the use of hprof, a java code profiler that is built into the JVM.  I went over using a few parts of hprof in this post, and the part we will use to isolate this issue is the hprof=cpu=samples. Note, that making this change will require stopping and starting a production instance, but in this case the site is already going up or down.  I look at this as triage - errors are already occurring, and taking steps on a production server to more quickly isolate the issue is worth generating a few more errors on an already unstable system… I digress =)  Anyway, lets open up the jvm.config, find the line that starts with "java.args=", and add to it the following:
-agentlib:hprof=cpu=samples
After saving the jvm.config, you need to find the name of one of your coldfusion instances.  By default the first multi server instance is called ‘cfusion’.  Open up a command line, and navigate to your coldfusion root, then into the bin folder.  Enter the command:  jrun -stop cfusion.  After it is stopped, enter:  jrun -start cfusion.  You should see a lot of command line information spit out to the screen while it is starting.  Once it is done starting up, do whatever is necessary to get some production traffic going to the instance.  After it runs for a while (it will run a bit slower with the profiler running), press <control> C to end the instance.  Once the instance has shut down, you will find a file in the /{coldfusion}/bin folder called java.hprof.txt which has our method report in it.  In this hypothetical instances, we see our top 95% of method time (the CPU time spent in a given method) spent within java.net.SocketInputStream.socketRead0.  This is telling us that, for 95% of the time the CPU was being utilized, we were waiting on a socket to return data from a remote system.  In this case, the only system Java/ColdFusion is talking to is the database.  Now we have a 2nd confirmation (first the CPU graph, and now the hprof report) that is isolating an issue to the database, so lets move our investigation there.
Upon remoting into the database system, we open up task manager, and we see that the system is running at nearly 100% CPU usage. This can mean that is simply has too much traffic to keep up, or the queries that are being run are non-optimal.  As indicated by Geoff’s comment on my last post, at this point I would use SQL Profiler to capture some live traffic going to your database server, then run it through the Database Tuning Advisor to get index recommendations. I have seen production servers that have dropped, while under load, from 99% to less than 40% CPU usage, simply by applying the recommendations outlined by the SQL Profiler. In this hypothetical situation, lets assume we came back with something like that (estimated 70% improvement), and that bad SQL indexes were the root cause of ColdFusion being hung up in the database waiting for table scans. Re-running the hprof report after this shows a marked decrease in the time spent waiting on a socket response, so we can pat ourselves on the back for getting one point of contention resolved.
As several users noted in the last post, having a UNC path aliased from each web server to the shared SQL box is non-optimal. In this instance it was not the cause of our major issues, so we are ignoring it for now (yet noting it for later resolution).
There was some really great feedback on my last post, and next week (after CFUnited) I will post part 3 here, how to isolate and resolve the out of memory issues and use java debugging tools to trace them back. I had intended to go over the memory bits in today’s post, but its getting a bit longer than I wanted already =)  I share these and many other debugging techniques in my ‘Performance Tuning ColdFusion: Before the JVM’ presentation at CFUnited, which is Friday afternoon.I look forward to seeing some of you there!

Source:Troubleshooting ColdFusion Performance: The Analysis

admin Alagad