IPfind log file scanner

If you run your own website, you want to know what's happening. With a small site, it's easy to keep a grip on operations. But when the HTML pagecount exceeds 100, things can get a bit hairy. And if you perform maintenance (with the occasional structural overhaul), an error is easy to make. All in a sudden the 404 count goes up, 'for no reason in particular' :o).
It helps if you either do your own webhosting on a home based server or when you have leased webspaced from a professional internet webhoster. Like I do. So I need to periodically check the site statistics. One of the points of interest is the status code overview like the one above. Most important is the 404 count. 404 Is the error code for 'Page not found'. But also '500: Internal server error' should be kept as low as possible.

Webalizer access logs

There are many tools for the active webpage maintainer. My webhost uses the Webalizer traffic manager. It keeps track of all pages served by the apache webserver and generates nice overviews and graphs. Plus a logfile per day. At the end of the day, the log file for that day is gzipped and appended to the yearly tar file.

The yearly file for this year is called 'access_log_2007.tar' and it contains daily log files with names like 'access_log_20070927.gz'. In general: 'access_log_YYYYmmDD.gz' in which YYYY is the year, mm is the month and DD is the day. For processing it is convenient to unzip the files:

	jan@beryllium:~/tmp/stats$ gzip -d access_log_200*
   
You end up with quite a lot of log files, but all of them are easy to process now with less, grep or any other text utillity. Still, you end up with mighty long lines spanning several lines on a 120 column display. One example:
g193139.upc-g.chello.nl - - [17/May/2007:04:43:55 +0200] "GET /fotos/kermisP.jpg HTTP/1.1" 200 59055
"http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=4041408&MyToken=3e56f124-
8e05-4f50-889a-9f3d7e122a42" "Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.8.0.11) Gecko/20070312
 Firefox/1.5.0.11 Creative ZENcast v1.02.08"
   
In fact, this is ONE line of text. For the explanation I will use a shorter logfile entry:
livebot-65-55-209-220.search.live.com - - [17/May/2007:01:53:42 +0200] "GET /robots.txt HTTP/1.0" 404 280 "-"
"msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
   
It contains the following 9 datafields:

Field Value Meaning
IP livebot-65-55-209-220.search.live.com The IP address or IP alias supplied to you by your webhost.
?? - Unknown
?? - Unknown
Date [17/May/2007:01:53:42 +0200] The date the data was requested.
URL "GET /robots.txt HTTP/1.0" This is the command received by the webserver, from the webbrowser (client). In this case, the client was a spider from a search engine and it asks for the file robots.txt which has directions for spiders.
Status 404 This is the status information, also known as the ERROR code. In this case the status information is an error code: 404 = Page not Found.
Bytes 280 The length of the bytes sent. In this case it was the length of the 404.html file that was returned to the client.
Referrer "-" The previous webpage, on which was a link that caused this GET command. In this case, the spider has performed a direct request.
Browser ID "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" This is the ID string returned by the browser or spider requesting the webpage. These strings can be extremely long and contain lots of gobbledigook.

Processing the access log files

Of course you can go through the files one by one with 'less' and do a search for a specific errorcode. But this is very time consuming and error prone. Suppose you are looking for a 404 pattern. Then you get all the 404 status codes, but also all the files that happened to be 404 bytes long, or when the time string had 404 in it.
I could have made a utillity in Tcl/Tk but this would cause some problems since the log files contain Tcl critical tokens as delimiters. So I switched back to my beloved Mocka Modula-2 compiler and made IPfind. With IPfind you can search for specific occurrences in the access logs.

IPfind.mod

In order to get through the webalizer files easily, I made the IPfind program. It is meant to be used in a pipeline: it expects all data via stdin and exports all results via stdout. Below is the sourcecode:

MODULE IPfind;

(*  Scanner for finding error codes from Webalizer log files	*)
(*  CopyLeft Jan Verhoeven, Tilburg (NL)	Nov 2007	*)

IMPORT	Arguments, InOut, Strings, NumConv;

TYPE	Identifier			= ARRAY [0..255] OF CHAR;

VAR	option, IP, Date, URL, Ref, Dum		: Identifier;
	Code, value				: CARDINAL;
	Exhausted, ok, ip, url, code		: BOOLEAN;
	buffer					: Arguments.ArgTable;
	count					: SHORTCARD;


PROCEDURE SkipItem;

VAR	ch	: CHAR;

BEGIN
   REPEAT  
      InOut.Read (ch);
      Exhausted := InOut.EOF ();
      IF  Exhausted  THEN  RETURN  END
   UNTIL  ch > ' ';
   REPEAT  
      InOut.Read (ch);
      Exhausted := InOut.EOF ();
      IF  Exhausted  THEN  RETURN  END
   UNTIL  ch <= ' '
END SkipItem;


PROCEDURE ReadName (VAR str : ARRAY OF CHAR);

VAR	n		: CARDINAL;
	ch, endch	: CHAR;

BEGIN
   REPEAT  
      InOut.Read (ch);
      Exhausted := InOut.EOF ();
      IF  Exhausted  THEN  RETURN  END
   UNTIL  ch > ' ';		(* Eliminate whitespace		*)
   IF  ch = '['  THEN  endch := ']'  ELSE  endch := ch  END;
   n := 0;
   REPEAT
      InOut.Read (ch);
      Exhausted := InOut.EOF ();
      IF  Exhausted  THEN  RETURN  END;
      IF  n <= HIGH (str)  THEN  str [n] := ch  ELSE  ch := endch  END;
      INC (n)
   UNTIL ch = endch;
   IF  n <= HIGH (str)  THEN  str [n-1] := 0C  END
END ReadName;


PROCEDURE ReadString (VAR str : ARRAY OF CHAR);

VAR	n		: CARDINAL;
	ch, endch	: CHAR;

BEGIN
   REPEAT  
      InOut.Read (ch);
      Exhausted := InOut.EOF ();
      IF  Exhausted  THEN  RETURN  END
   UNTIL  ch > ' ';		(* Eliminate whitespace		*)
   n := 0;
   REPEAT
      IF  n <= HIGH (str)  THEN  str [n] := ch  END;
      InOut.Read (ch);
      INC (n)
   UNTIL (ch <= ' ') OR InOut.EOF ();
   IF  n <= HIGH (str)  THEN  str [n-1] := 0C  END;
END ReadString;


PROCEDURE Condition () : BOOLEAN;

BEGIN
   IF  ip AND (Strings.pos (option, IP) <= HIGH (IP))  THEN
      RETURN TRUE
   ELSIF  url AND (Strings.pos (option, URL) <= HIGH (URL))  THEN
      RETURN TRUE
   ELSIF  code AND (value = Code)  THEN
      RETURN TRUE
   END;
   RETURN FALSE
END Condition;


BEGIN
   ip := FALSE;			url := FALSE;		code := FALSE;

   Arguments.GetArgs (count, buffer);
   IF  count < 3  THEN
      InOut.WriteString ("Usage : IPFIND IP xxx | URL xxx | CODE xxx");
      InOut.WriteLn;
      HALT
   END;
   Strings.Assign (option, buffer^[1]^);
   IF  Strings.StrEq (option, 'IP')  THEN
      ip := TRUE;
      Strings.Assign (option, buffer^[2]^)
   ELSIF  Strings.StrEq (option, 'URL')  THEN
      url := TRUE;
      Strings.Assign (option, buffer^[2]^)
   ELSIF  Strings.StrEq (option, 'CODE')  THEN
      code := TRUE;
      Strings.Assign (option, buffer^[2]^);
      NumConv.Str2Num (value, 10, option, ok);
   ELSE
      InOut.WriteString ('Illegal option. Aborting.');
      InOut.WriteLn;
      HALT
   END;
   
   REPEAT
      ReadString (IP);		SkipItem;		SkipItem;
      ReadName (Date);
      ReadName (URL);		InOut.ReadCard (Code);	SkipItem;
      ReadName (Dum);		ReadName (Dum);
      IF  Condition ()  THEN
         InOut.WriteString (Date);			InOut.WriteCard (Code, 12);
	 InOut.WriteLn;		InOut.Write (11C);	InOut.WriteString (URL);
	 InOut.WriteLn;		InOut.Write (11C);	InOut.WriteString (IP);
	 InOut.WriteLn;
	 InOut.WriteLn
      END
   UNTIL  Exhausted;
   InOut.WriteBf
END IPfind.
   
One example of the usage:
	jan@beryllium:~/modula/div$ cat access_log_20071117 | IPfind CODE 404
	
	17/Nov/2007:04:15:24 +0100         404
	        GET /etc/telef.html HTTP/1.0
		speedyspider.entireweb.co

	jan@beryllium:~/modula/div$
			
   
At the moment IPfind will recognize three search options, to be supplied in the correct case:

option action
CODE Look for one specific status code, as a decimal number.
URL Look for a text pattern to be present in the URL field of Webalizer. It doesn't matter where in the URL field the text pattern occurs.
IP Look for a text pattern to be present in the IP field of Webalizer. It doesn't matter where in the IP field the text pattern occurs.

IPfind expects the data to be delivered via a pipe or through STDIN. Output is sent to STDOUT or a pipe, as you wish. In the download section you can download the executable, the source and an example access log file.

Page created 28 September 2007,