I’ll Be At CFUnited After All!
It looks like the stars have aligned for me, after all and I will be attending this year’s (and the last ever) CFUnited. I managed to schedule a couple client meetings at the same time and that really helps …
It looks like the stars have aligned for me, after all and I will be attending this year’s (and the last ever) CFUnited. I managed to schedule a couple client meetings at the same time and that really helps …
Just wanted to put a quick post out there. I host about 50 trac environments on an EC2 server, several rather large ones of which are public and pretty high traffic. Yesterday the trac server started having issues, and most of the logged errors had a seemingly random type of python trace data, followed by:
IOError: failed to write data
I dug around online for a bit but did not see much in the way of help for this issue, then I noticed that nearly all the errors were showing from a client that started with 65.55.X.X. Flipping through other logs, it appeared to be an MSN bot agent hitting Trac pages, and in some cases even hitting an infinite recursion (somehow linking to /report2/report2/report2/report2/……). The quick and dirty fix to solve this issue was to block the entire subnet from accessing my vhosts, which was done with this little gem inside of each vhost <directory> tag in Apache:
Order Allow,Deny
Allow from all
Deny from 65.55.0.0/16
Granted this is not a perfect solution, but keeping the site up for real users was more important at the moment than allowing a rampant recursive spider to keep slamming the site as soon as I brought it back up (for nearly 2 days straight now).
Happy Friday!
Source:MSN bot was smashing Trac server, here is how I blocked it
OK, time for a follow up to my 1st post in this troubleshooting series for our Project X, and I am going to talk about some steps we can take to better identify and isolate the issues today. I had some great participation on the last post, thanks to all who contributed (and you are entered for the backpack!). If you have not read the first post outlining the hypothetical scenario, do that before reading the rest here.
So the first step I usually take in any project is as simple as observation. Looking at task manager, and enabling the ’show kernel times’, will give us an instant pointer. Lets take a peek at one of the web servers while it is under heavy load, and we see this:
This graph tells us a few things. First of all, notice that the server is nowhere near using 100% of its possible CPU utilization, and notice the red line (which is the kernel or system usage). Kernel usage that is extremely high, or seems to follow the CPU usage like this one does, is often indicative of an I/O issue. An I/O issue can be slow or overloaded disk resources, waiting on a network thread to complete, or waiting on database requests. As noted in the prior post, we are seeing Out of Memory exceptions after a time as well, but lets chase down issues one at a time here.
So, reading the CPU graph on the web server makes us suspect an I/O issue, lets try and confirm this. One of my favorite tools for this is the use of hprof, a java code profiler that is built into the JVM. I went over using a few parts of hprof in this post, and the part we will use to isolate this issue is the hprof=cpu=samples. Note, that making this change will require stopping and starting a production instance, but in this case the site is already going up or down. I look at this as triage - errors are already occurring, and taking steps on a production server to more quickly isolate the issue is worth generating a few more errors on an already unstable system… I digress =) Anyway, lets open up the jvm.config, find the line that starts with "java.args=", and add to it the following:
-agentlib:hprof=cpu=samples
After saving the jvm.config, you need to find the name of one of your coldfusion instances. By default the first multi server instance is called ‘cfusion’. Open up a command line, and navigate to your coldfusion root, then into the bin folder. Enter the command: jrun -stop cfusion. After it is stopped, enter: jrun -start cfusion. You should see a lot of command line information spit out to the screen while it is starting. Once it is done starting up, do whatever is necessary to get some production traffic going to the instance. After it runs for a while (it will run a bit slower with the profiler running), press <control> C to end the instance. Once the instance has shut down, you will find a file in the /{coldfusion}/bin folder called java.hprof.txt which has our method report in it. In this hypothetical instances, we see our top 95% of method time (the CPU time spent in a given method) spent within java.net.SocketInputStream.socketRead0. This is telling us that, for 95% of the time the CPU was being utilized, we were waiting on a socket to return data from a remote system. In this case, the only system Java/ColdFusion is talking to is the database. Now we have a 2nd confirmation (first the CPU graph, and now the hprof report) that is isolating an issue to the database, so lets move our investigation there.
Upon remoting into the database system, we open up task manager, and we see that the system is running at nearly 100% CPU usage. This can mean that is simply has too much traffic to keep up, or the queries that are being run are non-optimal. As indicated by Geoff’s comment on my last post, at this point I would use SQL Profiler to capture some live traffic going to your database server, then run it through the Database Tuning Advisor to get index recommendations. I have seen production servers that have dropped, while under load, from 99% to less than 40% CPU usage, simply by applying the recommendations outlined by the SQL Profiler. In this hypothetical situation, lets assume we came back with something like that (estimated 70% improvement), and that bad SQL indexes were the root cause of ColdFusion being hung up in the database waiting for table scans. Re-running the hprof report after this shows a marked decrease in the time spent waiting on a socket response, so we can pat ourselves on the back for getting one point of contention resolved.
As several users noted in the last post, having a UNC path aliased from each web server to the shared SQL box is non-optimal. In this instance it was not the cause of our major issues, so we are ignoring it for now (yet noting it for later resolution).
There was some really great feedback on my last post, and next week (after CFUnited) I will post part 3 here, how to isolate and resolve the out of memory issues and use java debugging tools to trace them back. I had intended to go over the memory bits in today’s post, but its getting a bit longer than I wanted already =) I share these and many other debugging techniques in my ‘Performance Tuning ColdFusion: Before the JVM’ presentation at CFUnited, which is Friday afternoon.I look forward to seeing some of you there!
This is part 3 in my Troubleshooting Coldfusion Performance series. It was briefly interrupted by a great trip to CFUnited 2009, then catching up, but here it finally is! Several folks approached me at CFUnited with encouragement and enthusiasm for this series, I cannot tell you how much I appreciate that! I’m glad I can finally give something back to a community that has given me so much over the years!
As this is part 3, you will most likely want to take a quick read over the 1st 2 posts in this series.
Troubleshooting Coldfusion Performance: The Problem
Troubleshooting Coldfusion Performance: The Analysis
So at this point, we have isolated and resolved a SQL performance issue using some of the tools that come with MS SQL. While this does not mean that our SQL configuration is now completely optimal, it has resolved the largest issue we are observing here, so we move on. Now its time to tackle our out of memory issues.
When your system is getting an Out of Memory exception, it can actually make it easier to debug than errors that only happen sporadically. There is a jvm argument that you can add to your jvm.config file that will cause a heap dump to be generated every time your server runs out of memory. Add this argument to your jvm.config file:
-XX:+HeapDumpOnOutOfMemoryError
Once you have this argument added, restart ColdFusion. Then you have the fun part of waiting for (or causing) an out of memory condition to occur. For Project X, we have 4 servers in our cluster, so we will add this argument to 1 server and bring it back into the cluster. Once it runs out of memory, the server will take a few minutes to shut down after it crashes, as it will build a text file in your {coldfusion}/bin folder with the contents of your memory at the time that the server went crashed.
Now you have a file, but if you tried to open your file in notepad, you may cause yourself more issues (the files are most often hundreds of megs, if not gigs in size). There are several tools you can use now to open your heap dump and identify possible memory leaks. Here are 3 that I
find myself using.
Netbeans - Neatbeans is a full java IDE with tons of plugins, and includes a memory profiler and tools to view and explore a heap dump.
VisualVM - VisualVM is a tool dedicated to profiling, monitoring, and analyzing java application performance.
Memory Analyzer (MAT) - MAT can be run standalone or as a plugin integrated to Eclipse. MAT has a slick feature that will automatically try to analyze a heap dump to find possible memory leaks. While it has been hit or miss for me with ColdFusion code thus far, it has some great reporting and can make the heap dump a bit less intimidating.
So, this being a hypothetical project and all, I wrote a quick test case to show a simple memory leak. You can download it here and check it out yourself to verify that these methods work for tracking down objects that are staying in heap too long. After one request of the index.cfm file in my sample, capture a heap dump from within VisualVM. Load it up, and switch to ‘classes’ view. On the bottom of the classes view, you will see a filter. Type in ‘cfl’ then click the green check-mark. All of the CFC’s you create that go into memory are prefaced by ‘cf’, followed by your filename, then some compiler hash looking string of letters and numbers. In this simple test case, we are looking for ‘leaky.cfc’, so filtering on ‘cfl’ will find them. After 1 request, we will see something like the following.
Notice that the cfleaky2ecfc482229893 object shows 101 instances in memory. The 1 instance is our ‘application’ scope variable, and the 100 instances are as a result of us running ‘doSomething()’ against it. When you run the index.cfm file another few times, then capture another heap dump, you see something like this:
So lets say we are observing this on projectX, where we have a class file that its instances count keeps growing like this on every request. This can obviously start chewing up memory, and even after 20 minutes this memory did not get reclaimed in my local tests. Now, in ProjectX, our codebase will obviously be much larger than the 3 files in my test, and isolating the code that creates this object would be much harder than looking at 10 lines of code. To help us isolate this, we will use another technique I blogged about here, using hprof with ’sites’ to show objects that are in heap, and their stack trace associated to their creation. For those who have not used a stack trace, its basically a road map on how the object in question was created. Think of a request, where one method calls another, which in turn calls another, and so on. Each method request will be a line item in your stack trace, which will eventually show you a file from which your object was created. After you stop your ColdFusion instance, you will get a file that is named {coldFusion root}/bin/java.hprof.txt. On my test, after making 20 requests or so, the file generated was about 32 megs in size. When you open it up, you will see a ton of stack trace objects at the top of the file. Scroll ALL the way down till you get to a list of classes and their count of instances in memory, and you will see:
I highlighted 2 items here, the right most one is the class name (the name of the compiled template called leaky.cfc), and the number to the left of it is the stack trace index. Copy that number 368230 into the clipboard, then search for it in this file (above the list we are looking at now). You will come to the stack trace that created this object, which looks like:
This technique can help you quickly locate the source of leaky objects. Sometimes you will have to alter the depth of the stack trace to properly locate your object (this was run with a depth of 6, so it shows only the most recent 6 method calls). In this very simple test case, changing line 5 of index.cfm will resolve the extra objects in memory.
Before: cfset request.leaky = application.leaky
After: cfset request.leaky = duplicate( application.leaky )
This simple example is just an illustration of how you can get yourself into trouble when pointing a reference to objects from a shared scope like application or server, into a local scope like request. When we make a link from the request scope into the application scope, it maintains a reference such that the request variable would never go away until the referencing application variable gets garbage collected (which would wait for the application timeout). By duplicating the object into the request scope instead, this ensures there is no reference to the application scope variable, which allows it to be cleaned up after the request completes.
This technique is great for identifying poor code practices, or errors in your objects. With ProjectX, this technique will let you find and squash any memory leaks that are present. Running this through several times until you no longer observe a constantly growing heap size will eliminate your memory leak, and let us proceed to other issues!
Next post I will talk about the iterative process of tuning your JVM configuration, using jmeter to simulate load while you observe some of your system metrics.
Remember, anyone who comments on any post in this series will be entered into the drawing for an Alagad backpack, which anyone who has one can attest they are totally sweet. Please feel free to contribute here, any comments are welcome to clarify anything I left hanging, and I will try and answer any questions that are posted!
Source:Troubleshooting ColdFusion Performance: Analysis Part II
As promised, I have posted the presentation files - including all the code - for my CFunited presentation.
You can get the files here.
Source:'ColdFusion and jQuery: Perfect Together' Presentation Files
I’m just throwing that question out there. Please don’t expect to find the answer to this or really any quandary from reading one of my blog posts.
Up until this past weekend I hadn’t really played around with the Papervision open source 3d engine in over a year. I remember Papervision being the new hotness back them. Seems like every other blog post was some sort of cool new 3D texturing globe component or amazing 3D navigation using it. Then Flash 10 came out with it’s own somewhat native 3D functionality and the Papervision buzz wore off.
So what became of this popular 3D engine?
Well, from what I can gather, Papervision is continuing to be used on various 3D projects across a wide variety of Flash application online. Check out the New York Times 3D rendering of the 15th hole at the US Open Golf Tournament here. Or check out the radical 3D navigation in this UFC promotional site here.
So why is it still being used, isn’t it native in Flash 10?
A. Flash 10’s 3D capabilities fall short of the expectations for your client’s particular project. The new 3D capabilities of the Flash 10 are more of a distortion of a bitmap with perspective and not truly a 3D engine. In fact the developers of Papervision are activily looking at how to better take advantage of the new player to speed up and increase the performance of their 3D engine.
B. The penetration numbers for Flash 10 aren’t there yet. While the Flash browser plug in is pretty ubiquitous among internet-enabled desktops, the latest version with 3D support is only being reported on about 87% of devices compared with Flash 9 at 99%.
Is the project still evolving?
Papervision’s blog seems to be pretty active and the source code continues to be actively updated. However the actual project download (swc or zip) is dated at March of 0, there hasn’t been an issue reported in a while and the project wiki hasn’t been updated for about half of a year. So it would appear that some of the hype has at least slowed.
My $0.02?
I’m a fan of using some subtle 3D techniques to enhance an application. Flash player 10 provides me with enough resources to create some interesting transitions and navigation elements without the overhead of adding an additional truly 3D engine. And as for penetration, that would differ from client to client. So maybe Flash player 10’s 3D hacks are going to cover me for 90% of my needs. That being said, I’d still turn to Papervision for any project that actually had any involved 3D elements. I’m thinking of applications that would take advantage of spacial dimensions such as a cargo loading application or maybe a 3D product design application like a build your own wine bottle website.
What about you guys? Where do you draw the line? Do you think Flash 10 killed the Papervision 3D engine or maybe it’s momentum?
I just wanted to remind folks of my upcoming presentation at CFUnited, entitled Performance Tuning Coldfusion: Before the JVM. I will go over tooling and techniques you can use to identify difficult to find source code issues, as well as the methodoligies I use when diagnosing issues on a customers server that is having performance difficulties. I will show you how to obtain and analyze a heap dump using VisualVM, how to review method usage and trace method calls using HPJmeter, how to isolate and resolve MS SQL performance issues using the Database Engine Tuning Advisor, and generally how to isolate problematic code issues and apply logical thinking to arrive at conclusions which will allow you to resolve performance issues.
I had the oppertunity to give this presentation last night at the Mid Michigan CFUG in Lansing, and I believe it was very well received. Much note taking was going on, and I think everyone walked away with a few new idea’s they could apply to their servers right away. Hope to see you at CFUnited 2009!
Source:Upcoming CFUnited Presentation - Performance Tuning: Before the JVM
Before we dig into Hibernate, I wanted to take a quick textbook look at what ORM is since Hibernate is an example of an ORM framework. ORM spelled out is object/relational mapping. One text I read said that the slash between object and relational is supposed to emphasize the mismatch problem that occurs when the object oriented world meets the relational world.
The way the book Java Persistence with Hibernate defines ORM is:
.. the automated (and transparent) persistence of objects in a Java application to the tables in a relational database, using metadata that describes the mapping between the objects and the database.
This sounds pretty simple – so basically an ORM knows enough about the object model in your application to know how to take a Person object and its related Address etc. instances and transform that data into your relational database structure and vice versa.
Again, according to Java Persistence with Hibernate, an ORM solution is made up of the following 4 components:
An API for performing basic CRUD expressions – think SQL
A language or API for creating queries that refer to object and their properties instead of tables and column
Capabilities for describing mappings between objects
Capabilities for functionality including dirty checking, lazy fetching, and other optimization
There were a lot of buzz words in that last point, but the optimization reference is one I am interested in. ORM solutions in the past have always seemed to just add an extra layer of overhead to an application and thus suffer a performance penalty. If through Hibernate we can optimize the queries that are run and the way data is returned, that could make a big difference. On the flip side, there are quite a few benefits to a properly implemented ORM solution like Hibernate.
Productivity
As much as the DBAs out there love to write and optimize SQL queries, like any other coding, it still takes time to write and time to debug. In addition, you may have to deal with cross platform issues in dealing with different relational database systems. An ORM framework such as Hibernate can drop right into place and with a few tweaks to the object model, handle all of this persistence layer for us. I expect there will be some optimization work and a few queries that still need to be written, but for the most part, all of that redundant CRUD work is taken care of.
Maintenance
Maintenance goes along with productivity – fewer lines of code to be written in the first place means fewer lines of code to maintain down the road. In the case of Hibernate, you are also dealing with an open source, battle tested application rather than starting from scratch with your own persistence layer – thus hopefully reducing the bugs you have to deal with.
Performance
This one is where I have the most questions. With hand coded SQL queries, you can always tweak the query to get just the data you need. You can cache those query results and more to get every little bit of performance you can. With an ORM solution, it is argued in the book that via Hibernate, these optimizations are even easier to achieve and faster to implement. There is also the argument that with an open source project like Hibernate, there were many more minds looking at the code and had more time to investigate optimizations than you have on any given project. This argument I can see and agree with.
Vendor Independence
This benefit can be argued as with most custom applications I have ever worked on, you very rarely see a database platform change. However, if you were developing software that was sold commercially and had to support multiple database platforms, I could see this being a big benefit. In order to support multiple platforms without an ORM, you would either be limited to very generic SQL or you would have built multiple persistence layer implementations which would just add to the maintenance overhead. With a vendor independent solution such as Hibernate, this dependency is taken care of for you.
That is enough for today. Next up, the requisite Hello World project – what book would be complete without one? The book obviously sets this up in Java, but my plan is to duplicate it on ColdFusion 9 using the new Hibernate integration.
In this post series, I would like to put forward a hypothetical situation involving poor ColdFusion application performance, the investigative steps to take to isolate the issues, and the remedial steps to perform in order to solve those issues. I would really like some feedback from readers as well here, to hypothesize on possible issues, possible resolutions, and supply other tools or methods which may identify or solve the issues we discuss. I hope that this post series will not only help you identify and deal with ColdFusion issues, but also help to identify database, network, or hardware issues as they may arise. Note: This hypothetical situation, while pulled from my experiences, is not a direct parallel to any of my previous customers, and is instead a combination of factors from several different projects. Lets call it Project X. The Environment Project X is setup to run across (4) Coldfusion 8 Enterprise edition servers, in a load balanced cluster behind a hardware load balancer. Sticky sessions are configured, so once a user makes a request to a given server, their subsequent requests should continue on the same server. Project X has a single MS SQL 2005 database on a 32 bit Windows 2003, which has 4 GB of ram. This server has (3) 15K 75 gig SCSI hard drives in RAID 5, upon which the operating system and the MS SQL binaries are installed. There is an iSCSI connected device which has (8) 10K 147 gig SCSI hard drives. The iSCSI device contains the MS SQL data and log files. Each Project X web server is a 32 bit Windows 2003 server with 4 GB of ram, and (3) 15K 75 gig SCSI hard drives in RAID 5. Each web server is running 2 instances of ColdFusion in a local cluster (using round robin to split requests between instances), and each ColdFusion instance is using the default JVM configuration that ships with ColdFusion 8. There is a shared folder on the MS SQL server which contains all shared page assets (files uploaded by the users, PDF documentation, and images). Each web server is running Apache, and has an alias pointing to the shared folder on the SQL box (using a UNC path). All servers are connected to a 24 port gigabit switch, and is hosted on an OC3 line. Project X is a web based file sharing application which allows users to upload and share files of many types (images, pdf’s, office documents, and more). It makes use of Application.cfm to load site variables, and uses several CFC objects to encapsulate database queries and user information. The ProblemFor several years this configuration has worked fine for the customer, with stable servers and acceptable response times. Project X has recently run an ad campaign in the national media, which has increased their site traffic by a factor of 2. Since the campaign, users have been complaining of slow web response times, as well as error messages. Investigating the server logs also shows that the coldfusion instances have been crashing with out of memory errors.You are tasked with uncovering the issues that are causing the slow page rendering, and the out of memory server issues. In my next post I will share both my techniques I use to identify these issues, as well as a selection of idea’s in comment responses to this post. Thoughts?Bonus Now to encourage participation here, anyone who contributes in comments to this chain of blog posts by commenting (with a relevant comment to this discussion) will be entered into a drawing to win an Alagad backpack (trust me, these are the best backpacks ever, I use mine for travel, school, everything). I will do a drawing for the backpack in a connect room after this blog series is completed, so lets bring on the idea’s!
I have a few quick things I think people might be interested in, but rather than writing a bunch of small blog entries, I figured I’d wrap them up into one shot.
First off, Alagad has released an all-new alagad.com website. This new website separates the developer-focused content from the corporate marketing information. This should probably make both audiences happier. I personally love the new design and hope you do too. We’ll be adding more content including project descriptions, client information, event information, press releases and more. All things in good time!
This week five of the Alagad team members (Scott Stroz, Brian Kotek, Chris Peterson, Vicky Ryder and myself, Doug Hughes) are at the CFUnited conference. We’re what I’d call a roaming sponsor, which means we didn’t spring for a booth. However, we’re not skimping on our conference give away. If you’re at CFUnited and you find an Alagad employee you can get entered into a raffle to win one of ten Alagad bags stuffed with a Mac Mini, a Wii, an iPod Touch, or an Amazon gift card. All you need to do is answer a short survey!
Speaking of conferences, I’ve decided that we’re going to be sponsoring the Adobe Max conference this year. I’m excited since we’ve never done that before. I’m not sure we’ll do the big give away, but I’ll try to do something exciting. (Any ideas?)
Finally, we have nine of the very nice Alagad backpacks left over from our conference giveaways. So we’re planning on holding a series of Blog-based contests with the winner getting one of the bags. Chris Peterson is running the first one right now. Keep your eyes on the Alagad blog for more information.
Source:Alagad Quickies