We've been using AWstats at work for generating traffic reports from our Apache server logs. I like AWstats and think it does a nice job in capturing trends and helping with capacity planning. However, there's one area where it provides zero information - peak traffic. Because of the way AWstats aggregates the web traffic into daily/hourly/monthyl blocks, there is no way to know the distribution of that traffic within a specific hour, minute, or second.
This generally isn't a big deal. However, sometimes you need to answer the following question: "What is the maximum number of requests per second we handled in the last (day|month|year)?" With the aggregation provided by log analysis tools, the best you can do is use a formula to try and average the number of requests for the peak hour assuming a standard distribution of requests. For example, if you had 600 requests in your peak hour, you would have to assume (600 / 60 minutes = 10 requests per minute = 1 request every six seconds). In that example, though, it's entirely possible (but unlikely) that those requests all happened in a three second period and your servers sit idle for much of the time and then get clobbered with traffic intermittently.
We recently needed to answer the peak traffic question so I set to writing a script to figure it out for myself. The results are a Python script which is available here and which does the following:
- Recursively locates all web server access logs within a given directory structure
- Opens each log file and parses it for requests matching a regular expression (see why below)
- Sums the request counts into per-second buckets
- Stores the per-second totals in a Sqlite database
In my case, the web server log files were gzipped so I had to uncompress the files prior to parsing them. Also, I wasn't interested in total hits but hits which involved our application servers (since the question was related to application server sizing and not Apache sizing). This was the reason for the regular expression during parsing - it identified requests which contained URIs which had been JkMounted to JBoss.
After some tweaks for performance, the script was able to parse and report on about 1.5 GB worth of compressed log files in about an hour on my laptop. The generated Sqlite database contains request totals and was easy to use for generating peak traffic reports. I think the next step will be to add some type of visualization to the reporting and graphically show the traffic patterns.
Comments [0]