IPfind log file scanner
If you run your own website, you want to know what's happening. With a small site, it's easy to keep a grip on
operations. But when the HTML pagecount exceeds 100, things can get a bit hairy. And if you perform
maintenance (with the occasional structural overhaul), an error is easy to make. All in a sudden the 404 count
goes up, 'for no reason in particular' :o).
It helps if you either do your own webhosting on a home based server or when you have leased webspaced from a
professional internet webhoster. Like I do. So I need to periodically check the site statistics. One of the
points of interest is the status code overview like the one above. Most important is the 404 count. 404 Is the
error code for 'Page not found'. But also '500: Internal server error' should be kept as low as possible.
Webalizer access logs
There are many tools for the active webpage maintainer. My webhost uses the Webalizer traffic manager. It keeps track of all pages served by the apache webserver and generates nice overviews and graphs. Plus a logfile per day. At the end of the day, the log file for that day is gzipped and appended to the yearly tar file.
The yearly file for this year is called 'access_log_2007.tar' and it contains daily log files with names like 'access_log_20070927.gz'. In general: 'access_log_YYYYmmDD.gz' in which YYYY is the year, mm is the month and DD is the day. For processing it is convenient to unzip the files:
jan@beryllium:~/tmp/stats$ gzip -d access_log_200*You end up with quite a lot of log files, but all of them are easy to process now with less, grep or any other text utillity. Still, you end up with mighty long lines spanning several lines on a 120 column display. One example:
g193139.upc-g.chello.nl - - [17/May/2007:04:43:55 +0200] "GET /fotos/kermisP.jpg HTTP/1.1" 200 59055 "http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=4041408&MyToken=3e56f124- 8e05-4f50-889a-9f3d7e122a42" "Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11 Creative ZENcast v1.02.08"In fact, this is ONE line of text. For the explanation I will use a shorter logfile entry:
livebot-65-55-209-220.search.live.com - - [17/May/2007:01:53:42 +0200] "GET /robots.txt HTTP/1.0" 404 280 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"It contains the following 9 datafields:
| Field | Value | Meaning |
|---|---|---|
| IP | livebot-65-55-209-220.search.live.com | The IP address or IP alias supplied to you by your webhost. |
| ?? | - | Unknown |
| ?? | - | Unknown |
| Date | [17/May/2007:01:53:42 +0200] | The date the data was requested. |
| URL | "GET /robots.txt HTTP/1.0" | This is the command received by the webserver, from the webbrowser (client). In this case, the client was a spider from a search engine and it asks for the file robots.txt which has directions for spiders. |
| Status | 404 | This is the status information, also known as the ERROR code. In this case the status information is an error code: 404 = Page not Found. |
| Bytes | 280 | The length of the bytes sent. In this case it was the length of the 404.html file that was returned to the client. |
| Referrer | "-" | The previous webpage, on which was a link that caused this GET command. In this case, the spider has performed a direct request. |
| Browser ID | "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" | This is the ID string returned by the browser or spider requesting the webpage. These strings can be extremely long and contain lots of gobbledigook. |
Processing the access log files
Of course you can go through the files one by one with 'less' and do a search for a specific errorcode. But
this is very time consuming and error prone. Suppose you are looking for a 404 pattern. Then you get all the
404 status codes, but also all the files that happened to be 404 bytes long, or when the time string had 404
in it.
I could have made a utillity in Tcl/Tk but this would cause some problems since the log files contain Tcl
critical tokens as delimiters. So I switched back to my beloved Mocka Modula-2 compiler and made IPfind. With
IPfind you can search for specific occurrences in the access logs.
IPfind.mod
In order to get through the webalizer files easily, I made the IPfind program. It is meant to be used in a pipeline: it expects all data via stdin and exports all results via stdout. Below is the sourcecode:
MODULE IPfind;
(* Scanner for finding error codes from Webalizer log files *)
(* CopyLeft Jan Verhoeven, Tilburg (NL) Nov 2007 *)
IMPORT Arguments, InOut, Strings, NumConv;
TYPE Identifier = ARRAY [0..255] OF CHAR;
VAR option, IP, Date, URL, Ref, Dum : Identifier;
Code, value : CARDINAL;
Exhausted, ok, ip, url, code : BOOLEAN;
buffer : Arguments.ArgTable;
count : SHORTCARD;
PROCEDURE SkipItem;
VAR ch : CHAR;
BEGIN
REPEAT
InOut.Read (ch);
Exhausted := InOut.EOF ();
IF Exhausted THEN RETURN END
UNTIL ch > ' ';
REPEAT
InOut.Read (ch);
Exhausted := InOut.EOF ();
IF Exhausted THEN RETURN END
UNTIL ch <= ' '
END SkipItem;
PROCEDURE ReadName (VAR str : ARRAY OF CHAR);
VAR n : CARDINAL;
ch, endch : CHAR;
BEGIN
REPEAT
InOut.Read (ch);
Exhausted := InOut.EOF ();
IF Exhausted THEN RETURN END
UNTIL ch > ' '; (* Eliminate whitespace *)
IF ch = '[' THEN endch := ']' ELSE endch := ch END;
n := 0;
REPEAT
InOut.Read (ch);
Exhausted := InOut.EOF ();
IF Exhausted THEN RETURN END;
IF n <= HIGH (str) THEN str [n] := ch ELSE ch := endch END;
INC (n)
UNTIL ch = endch;
IF n <= HIGH (str) THEN str [n-1] := 0C END
END ReadName;
PROCEDURE ReadString (VAR str : ARRAY OF CHAR);
VAR n : CARDINAL;
ch, endch : CHAR;
BEGIN
REPEAT
InOut.Read (ch);
Exhausted := InOut.EOF ();
IF Exhausted THEN RETURN END
UNTIL ch > ' '; (* Eliminate whitespace *)
n := 0;
REPEAT
IF n <= HIGH (str) THEN str [n] := ch END;
InOut.Read (ch);
INC (n)
UNTIL (ch <= ' ') OR InOut.EOF ();
IF n <= HIGH (str) THEN str [n-1] := 0C END;
END ReadString;
PROCEDURE Condition () : BOOLEAN;
BEGIN
IF ip AND (Strings.pos (option, IP) <= HIGH (IP)) THEN
RETURN TRUE
ELSIF url AND (Strings.pos (option, URL) <= HIGH (URL)) THEN
RETURN TRUE
ELSIF code AND (value = Code) THEN
RETURN TRUE
END;
RETURN FALSE
END Condition;
BEGIN
ip := FALSE; url := FALSE; code := FALSE;
Arguments.GetArgs (count, buffer);
IF count < 3 THEN
InOut.WriteString ("Usage : IPFIND IP xxx | URL xxx | CODE xxx");
InOut.WriteLn;
HALT
END;
Strings.Assign (option, buffer^[1]^);
IF Strings.StrEq (option, 'IP') THEN
ip := TRUE;
Strings.Assign (option, buffer^[2]^)
ELSIF Strings.StrEq (option, 'URL') THEN
url := TRUE;
Strings.Assign (option, buffer^[2]^)
ELSIF Strings.StrEq (option, 'CODE') THEN
code := TRUE;
Strings.Assign (option, buffer^[2]^);
NumConv.Str2Num (value, 10, option, ok);
ELSE
InOut.WriteString ('Illegal option. Aborting.');
InOut.WriteLn;
HALT
END;
REPEAT
ReadString (IP); SkipItem; SkipItem;
ReadName (Date);
ReadName (URL); InOut.ReadCard (Code); SkipItem;
ReadName (Dum); ReadName (Dum);
IF Condition () THEN
InOut.WriteString (Date); InOut.WriteCard (Code, 12);
InOut.WriteLn; InOut.Write (11C); InOut.WriteString (URL);
InOut.WriteLn; InOut.Write (11C); InOut.WriteString (IP);
InOut.WriteLn;
InOut.WriteLn
END
UNTIL Exhausted;
InOut.WriteBf
END IPfind.
One example of the usage:
jan@beryllium:~/modula/div$ cat access_log_20071117 | IPfind CODE 404 17/Nov/2007:04:15:24 +0100 404 GET /etc/telef.html HTTP/1.0 speedyspider.entireweb.co jan@beryllium:~/modula/div$At the moment IPfind will recognize three search options, to be supplied in the correct case:
| option | action |
|---|---|
| CODE | Look for one specific status code, as a decimal number. |
| URL | Look for a text pattern to be present in the URL field of Webalizer. It doesn't matter where in the URL field the text pattern occurs. |
| IP | Look for a text pattern to be present in the IP field of Webalizer. It doesn't matter where in the IP field the text pattern occurs. |
IPfind expects the data to be delivered via a pipe or through STDIN. Output is sent to STDOUT or a pipe, as you wish. In the download section you can download the executable, the source and an example access log file.
Page created 28 September 2007,
Page equipped with GoogleBuster technology