Analyzing Server Logs With HTTP-Analyze

Find out how popular your site really is.

Looking Closer

Things sure change fast on the Internet.

A few years ago, it didn't matter how wacky your idea was, or how unlikely it was to succeed. So long as you had a cool Web site, a lot of attitude and a business plan (even one which didn't necessarily add up), people were lining up to throw money at you. Advertisers particularly - all you could hear was talk of "eyeballs" and "impressions per day", with little attention paid to the details.

All that's changed now. Today, before an advertiser will even consider your site as a medium for his (or her) message, he (or she) wants a presentation on the numbers. How many hits you get per day/per hour/per week, which are the most popular sections of your site, who's linking to you and sending you traffic, how many of the impressions you get are unique and result in a "clickthrough" - these are just some of the factors an advertiser will consider before condescending to pay you for billboard space. And if you don't have the numbers to back up your promises, you might as well kiss the contract goodbye.

Clicks, hits and unique impressions are the staples of this new landscape - and so, before you can pitch the advantages of your site to a potential advertiser, you need to have concrete data to present. To this end, you need a tool that will track traffic on your site, and give you detailed reports and numbers which you can use in your sales presentations.

And that's where http-analyze comes in.

The Number Game

According to its official Web site at http://www.http-analyze.org/, http-analyze is "a logfile analyzer for Web servers...it creates a comprehensive statistics report in HTML...of all accesses sorted by various criteria in tabular and in graphical form". Developed and sold under a commercial license, http-analyze is one of the most popular tools for calculating the popularity of your Web site. You must buy a license for the software if you intend to use it for commercial purposes; however, non-commercial users and educational institutions may use it for free, with restrictions.

http-analyze works by parsing the access log file generated by your Web server and converting the data within it into a detailed HTML report which you can read in your browser. Its greatest strength is its report generation module - http-analyze can produce reports showing you the number of hits and page views per day, week and month; the ten most requested items on your site; the number of files cached; detailed referrer information and client signatures (browser type, requesting IP et al); information on total bytes sent; and monthly report summaries for comparison purposes.

If the thought of wading through raw numbers scares you, http-analyze also has the ability to draw a picture for you. Since the software is closely linked with the open-source GD library, it breaks up the raw data with easy-to-understand graphs and pie charts so as to offer a Webmaster a visual picture of traffic and site usage. Reports are fully configurable - you can alter the layout of the reports, and even suppress information in them - and filter out irrelevant URLs or referrers for greater clarity. Finally, a built-in 3D module even supports report generation in VRML, so if you have a VRML plug-in for your browser, you can navigate your traffic statistics in 3D!

The software is stable, easy to customize and well supported by its developers. It's also standard with most Web hosting accounts, and if you have your own Internet server, you can easily install it yourself. Flip the page, and let me show you how.

Back To Basics

The first order of business to install http-analyze on the Linux box you plan to use as a Web server. Drop by the official http-analyze Web site at http://www.http-analzye.org/ and get yourself the latest stable release of the software (this tutorial uses http-analyze 2.4). Note that you will need a C compiler and some Web server logs in order to use the software (this tutorial uses GCC 3.2 and Apache 1.3.26), as well as an installed copy of the GD, PNG and ZLIB libraries.

Once you've downloaded the source code archive to your Linux server, log in as "root"

$ su -
Password: ****

and extract the source to a temporary directory.

$ cd /tmp
$ tar -xzvf /home/me/http-analyze2.4-src.tar.gz

The next step is to configure the package. Unlike most packages, which come with a "configure" script, http-analyze is to be configured by editing its "Makefile". Most of the time, the only thing needed to be done is to set the location where the binary will be installed. Pop open this file and alter the HA_* variables so that they look like this (put different paths if you like, of course):

# Local conventions
# -----------------
#
# HA_BINDIR is the directory where the executable is installed.
# HA_LIBDIR is the directory where the required files are installed.
# HA_MANDIR defines the directory for the manpage in source or
#       pre-formatted format.
#
HA_BINDIR = /usr/local/http-analyze/bin
HA_LIBDIR = /usr/local/http-analyze/lib
HA_MANDIR = /usr/local/http-analyze/man/cat1

Next, compile and install it.

$ make
$ make install

http-analyze should now have been installed to the directory "/usr/local/http-analyze".

You can verify this by doing a quick directory scan of that directory - here's what you should see.

$ ls -lR /usr/local/http-analyze/
total 12
drwxr-xr-x    2 root     root         4096 Mar 29 20:30 bin/
drwxr-xr-x    8 root     root         4096 Mar 29 20:29 lib/
drwxr-xr-x    3 root     root         4096 Mar 29 20:29 man/

./bin:
total 284
-rwxr-xr-x    1 bin      bin        282656 Mar 29 20:29 http-analyze*

./lib:
total 196
-rw-r--r--    1 root     root        53335 Mar 29 20:29 3Dlogo.wrl.gz
-rw-r--r--    1 root     root        70649 Mar 29 20:29 3Dprolog.wrl
-rw-r--r--    1 root     root        24059 Mar 29 20:29 3DshelfMotion.wav
-rw-r--r--    1 root     root         3914 Mar 29 20:29 TLD
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 btn/
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 btn-de/
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 btn-de_BY/
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 btn-fr/
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 btn-it/
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 btn-pt_BR/
-rw-r--r--    1 root     root         6606 Mar 29 20:29 bugreport.html

./man:
total 4
drwxr-xr-x    2 root     root         4096 Mar 29 20:29 cat1/

Once you've got http-analyze installed, the next step is to configure it and start building reports. Let's look at that next.

Test Drive

http-analyze is entirely console-based - that is, the tool is invoked from the command-line and provided a set of options that tell it how to process the server's log file. To illustrate, let's do a quick test run using the binary you just installed:

$ /usr/local/http-analyze/bin/http-analyze -vf -o stats/  /usr/local/apache/logs/access_log
http-analyze 2.4pl3 (i586; Linux 2.4.20; XPG4 MNLS; PNG)
Copyright 2004 by RENT-A-GURU(TM)
Generating full statistics in output directory `stats/'
Reading data from `/usr/local/apache/logs/access_log'
Best blocksize for I/O is set to 64 KB
Start new period at 18/Aug/2003
Common Logfile Format (CLF) detected
Creating full statistics for August 2003
... processing URLs
... processing hostnames
Total entries read: 154, processed: 154
Clear almost all counters at 19/Aug/2003
Start new period at 01/Oct/2003
Creating full statistics for October 2003
... processing URLs
... processing hostnames
Total entries read: 474, processed: 474
Clear almost all counters at 23/Oct/2003
Start new period at 01/Nov/2003
Creating full statistics for November 2003
... processing URLs
... processing hostnames
Total entries read: 10, processed: 10
Clear almost all counters at 06/Nov/2003
Start new period at 01/Mar/2004
No more hits since 26/Mar/2004
Creating full statistics for March 2004
... processing URLs
... processing hostnames
[WARNING]: Invalid TLD, add `localhost' to unresolved
Creating short statistics for March 2004
Total entries read: 35, processed: 35
Statistics complete until 30/Mar/2004

In case you're wondering, the "-v" option turns on verbosity, the "-f" option tells http-analyze to generate a frames-based interface for the HTML reports, and the "-o" option specifies the output directory into which the reports must go. The last argument is, of course, the path to the access log file that is to be processed.

http-analyze will now parse the log file and create HTML reports of the data within it. These reports, together with associated images, will be linked together and placed in the output directory. To view them, point your browser to the "index.html" file in the output directory - you should see something like this:

Select one of the options from the menu, and you'll be taken to a report summary page displaying hits, pageviews, session, files sent, files cached and KB sent per month. The summary information is also available in graphical form.

You can click on any of the months in the list to obtain a more detailed view of the most (and least) popular items for that month, as well as hits by country, by weekday, by month; top ten lists of browsers, referrers and URLs; exhaustive list of files and URLs accessed; and much more.

For a complete explanation of what each item in the list means, refer to http://www.http-analyze.org/results.html, which does a great job of explaining the report in a non-technical manner.

A Custom Job

Every time http-analyze runs, it processes your log files and creates a set of reports in the named output directory. These reports contain a default set of headings - hits, pageviews, sessions, referrers, URLs - but you can customize them extensively, simply by adding appropriate options when invoking the program.

For example, you can restrict the report to a particular period with the "-I" and "-E" options, which control the duration of the report. The following example generates a report for only November 2003:

$ /usr/local/http-analyze/bin/http-analyze -I 11/2003 -E 12/2003 -o /tmp/stats /usr/local/apache/logs/access_log

The "-f" option creates a frames-based interface for easier navigation:

$ /usr/local/http-analyze/bin/http-analyze -f -o /tmp/stats /usr/local/apache/logs/access_log

Add the "-v" option to tell http-analyze to print status messages as it's working:

$ /usr/local/http-analyze/bin/http-analyze -vf -o /tmp/stats /usr/local/apache/logs/access_log

For more detailed status messages, add a second "-v":

$ /usr/local/http-analyze/bin/http-analyze -vf -v -o /tmp/stats /usr/local/apache/logs/access_log

To tell http-analyze not to group all the images together as one item in the report, add the "-x" option:

$ /usr/local/http-analyze/bin/http-analyze -vfx -o /tmp/stats /usr/local/apache/logs/access_log

If you have multiple domains on the same server (virtual hosts), you can tell http-analyze to restrict its analysis to individual domains by specifying the document root to filter against when generating the reports - just add the "-R" option and follow it with the document root:

$ /usr/local/http-analyze/bin/http-analyze -vf -R /vhosts/site1 -o /tmp/stats /usr/local/apache/logs/access_log

To suppress the display of certain heads in the report, add the "-s" option together with a list of what is to be suppressed (look in the manual for a complete list of which heads can be suppressed). The following example removes the list of user agents, URLs and sites from the report:

$ /usr/local/http-analyze/bin/http-analyze -s Agents,Urls,SiteList  -o /tmp/stats /usr/local/apache/logs/access_log

To control the "ten" in http-analyze's top ten lists and replace it with a custom number, use the "-t" option, followed by a number and alphabetic code indicating which list to manipulate. For example, the following line generates a top five list of clients, a top fifty list of referrers and a top ten list of URLs:

$ /usr/local/http-analyze/bin/http-analyze -vf -t 10U50R5A -o /tmp/stats/ /usr/local/apache/logs/access_log

Want to see your report in 3D VRML? Simply activate VRML reporting in your command line with the "-3" option, as below, and then point your VRML-capable browser to the statistics directory. Look for the "3D model" link at the bottom of the main summary page, and select it to see a 3D report of traffic and pageviews.

$ /usr/local/http-analyze/bin/http-analyze -v3 -o /tmp/stats /usr/local/apache/logs/access_log

Rank And File

Of course, you don't have to keep passing http-analyze options on the command line - it's very unsightly, plus you might make a typo and blow up your Web server (kidding!). Instead, you can simply have the program read them from a configuration file, which is both easier on the eyes and offers greater customization capabilities.

To generate a new configuration file filled with default values, run http-analyze with the "-i" option and the location where the file is to be saved, as below:

$ http-analyze -i /usr/local/http-analyze/ha.config

Take a quick peek inside the file - you'll see something like this:

--snip--

# The name of your server (defaults to the local hostname).
# Must be a full qualified domain name (FQDN) here, not an URL.
ServerName  medusa

# The prefix to use in URLs for hotlinks. Needed only if the
# statistics report is hosted on another server than the pages
# listed in the report. If the protocol specifier is not given,
# 'http' is assumed. Trailing slashes are removed.
#ServerURL  http://www.mycompany.com

# The name of the default logfile to be used if none given
# at invocation of http-analyze.
#LogFile    /usr/ns-home/www.foo.com/logs/access

# The format of the logfile:
#   auto    Automatically determine format (default)
#   clf Common Logfile Format (CLF)
#   dlf Combined Logfile Format (CLF + Referrer + UserAgent)
#   elf Extended Logfile Format (CLF + UserAgent + Referrer)
#LogFormat  auto

# The default mode of operation now is full statistics ("monthly") mode.
#DefaultMode    monthly

# The name of the directory where the output files of the statistics
# report are to be created.
#OutputDir  /usr/www/stats

# Use symlinks for the required files and buttons if missing.
# Requires installation of all files/buttons in HA_LIBDIR.
#BtnSymlink No

--snip--

This file is well-commented and has many options you can use to customize your http-analyze installation. Most of these options are available on the command line also, as you've seen on the previous page - using a configuration file is preferable, as (a) you don't need to remember what "-v3f" stands for and (b) it provides a central place for you to make configuration changes.

Here's a list of the important variables in this file (for a complete list, see http://www.http-analyze.org/manual2.4/man06.html#ConfigurationFile):

DefaultMode - sets whether statistics should be generated "daily" or "monthly"

DocRoot - sets the document root for which statistics should be generated (useful if you have multiple virtual servers on the same physical machine)

ServerUrl - sets the URL to use when linking the reports together

IgnURL - adds a URL to the ignore list (useful if you want to suppress report generation for certain sections of your site)

LogFile - sets the path to the server's log file

LogFormat - tells http-analyze what format the log file is in, "clf", "dlf" or "elf". Use "auto" to have http-analyze figure it out automatically.

OutputDir - sets the location for the HTML reports to be saved. To make the reports browseable through a Web browser, set this to a location under your server's document root.

Sessions - sets the time period within which accesses by the same client from the same IP are counted as a single session

NavWinSize, NavigFrame, CustLogoW, CustLogoB, HeadFont, TextFont and ListFont - all these variables can be used to customize the look and feel of the reports

Once you've made your changes to this file, save it and then run "http-analyze" with the options from it, by adding the "-c" option followed by the location of the file:

$ http-analyze -c /usr/local/http-analyze/ha.config

Time Out

http-analyze's reports are built for the period between the first and last timestamps in the access log, and are organized into separate directories, one for each year. Within the directory for the year, http-analyze stores separate reports, organized by month, for things like hits, pageviews and files; average load; top ten lists; originating country data; and other components of the report. The reports are linked with each other and with the main index file, which is refreshed on every http-analyze run with the latest numbers.

To ensure that your server statistics are always up-to-date, the http-analyze manual recommends creating time windows of 1 month, and running the program once a day during each window to regenerate the reports with the latest numbers. At the end of every month, the access log file should be backed up and cleared in readiness for the next month (this helps to reduce the processing time required for each month's log file and is also neater). http-analyze can then be run once on the previous month's log file to generate a summary report for that month, which then gets added to the output directory and becomes available for viewing at a later date.

If this seems somewhat complicated, don't worry - you can combine free tools provided by the http-analyze distribution with the "cron" utility to automate the entire process. Here is a sample "crontab" for the above:

# update statistics once a day at 6 PM
00 18 * * * /usr/local/http-analyze/bin/http-analyze -f -c /usr/local/http-analyze/ha.config

# rotate logs once a month
0 0 1 * * /usr/local/http-analyze/bin/rotate-httpd

The "rotate-httpd" script ships as part of the http-analyze distribution, and can be found in the "tools/" directory. Look inside it to see what it does, and while you're at it, also look inside the "contrib/" directory, which contains some Perl scripts designed to rotate log files for virtual hosts and generate statistics reports on a regular basis.

Link Out

And that's about all we have time for. In this article, I introduced you to http-analyze, one of the most popular log analyzers for Web server log files. Compact, powerful and feature-rich, http-analyze usually comes pre-installed with most hosting accounts...or you can even install it yourself.

After a brief description of how to install and test the package, I moved on to a description of what it can do. I showed you the basic syntax needed to build a set of reports from log files, and also showed you how to customize the reports by suppressing or adding extra information. In order to reduce errors, I also showed you how to store your configuration in a separate file instead of passing options to the program on the command line, and also described how to automate the process of daily report generation with the "cron" and "rotate-httpd" scripts.

Of course, there's a lot more you can do with http-analyze - for more information, consider visiting the following links:

The official http-analyze home page, at http://www.http-analyze.org/

The http-analyze manual, at http://www.http-analyze.org/manual2.4/index.html

A description of how to read an http-analyze report, at http://www.http-analyze.org/results.html

The http-analyze FAQ, at http://www.http-analyze.org/faq/

Until next time...have fun, and drop me a line when you hit 100,000 users a month!

This article was first published on09 Mar 2004.