IPfind log file scanner
If you run your own website, you want to know what's happening. With a small site, it's easy to keep a grip on
operations. But when the HTML pagecount exceeds 100, things can get a bit hairy. And if you perform maintenance
(with the occasional structural overhaul), an error is easy to make. All in a sudden the 404 count goes up, 'for no
reason in particular' :o).
It helps if you either do your own webhosting on a home based server or when you have leased webspaced from a professional internet webhoster. Like I do. So I need to periodically check the site statistics. One of the points of interest is the status code overview like the one above. Most important is the 404 count. 404 Is the error code for 'Page not found'. But also '500: Internal server error' should be kept as low as possible.
Webalizer access logs
There are many tools for the active webpage maintainer. My webhost uses the Webalizer traffic manager. It keeps track of all pages served by the apache webserver and generates nice overviews and graphs. Plus a logfile per day. At the end of the day, the log file for that day is gzipped and appended to the yearly tar file.
The yearly file for this year is called 'access_log_2007.tar' and it contains daily log files with names like 'access_log_20070927.gz'. In general: 'access_log_YYYYmmDD.gz' in which YYYY is the year, mm is the month and DD is the day. For processing it is convenient to unzip the files:
jan@beryllium:~/tmp/stats$ gzip -d access_log_200*You end up with quite a lot of log files, but all of them are easy to process now with less, grep or any other text utillity. Still, you end up with mighty long lines spanning several lines on a 120 column display. One example:
g193139.upc-g.chello.nl - - [17/May/2007:04:43:55 +0200] "GET /fotos/kermisP.jpg HTTP/1.1" 200 59055 "http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=4041408&MyToken=3e56f124- 8e05-4f50-889a-9f3d7e122a42" "Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:18.104.22.168) Gecko/20070312 Firefox/22.214.171.124 Creative ZENcast v1.02.08"In fact, this is ONE line of text. For the explanation I will use a shorter logfile entry:
livebot-65-55-209-220.search.live.com - - [17/May/2007:01:53:42 +0200] "GET /robots.txt HTTP/1.0" 404 280 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"It contains the following 9 datafields:
|IP||livebot-65-55-209-220.search.live.com||The IP address or IP alias supplied to you by your webhost.|
|Date||[17/May/2007:01:53:42 +0200]||The date the data was requested.|
|URL||"GET /robots.txt HTTP/1.0"||This is the command received by the webserver, from the webbrowser (client). In this case, the client was a spider from a search engine and it asks for the file robots.txt which has directions for spiders.|
|Status||404||This is the status information, also known as the ERROR code. In this case the status information is an error code: 404 = Page not Found.|
|Bytes||280||The length of the bytes sent. In this case it was the length of the 404.html file that was returned to the client.|
|Referrer||"-"||The previous webpage, on which was a link that caused this GET command. In this case, the spider has performed a direct request.|
|Browser ID||"msnbot/1.0 (+http://search.msn.com/msnbot.htm)"||This is the ID string returned by the browser or spider requesting the webpage. These strings can be extremely long and contain lots of gobbledigook.|
Processing the access log files
Of course you can go through the files one by one with 'less' and do a search for a specific errorcode. But
this is very time consuming and error prone. Suppose you are looking for a 404 pattern. Then you get all the
404 status codes, but also all the files that happened to be 404 bytes long, or when the time string had 404
I could have made a utillity in Tcl/Tk but this would cause some problems since the log files contain Tcl critical tokens as delimiters. So I switched back to my beloved Mocka Modula-2 compiler and made IPfind. With IPfind you can search for specific occurrences in the access logs.
In order to get through the webalizer files easily, I made the IPfind program. It is meant to be used in a pipeline: it expects all data via stdin and exports all results via stdout. Below is the sourcecode:
MODULE IPfind; (* Scanner for finding error codes from Webalizer log files *) (* CopyLeft Jan Verhoeven, Tilburg (NL) Nov 2007 *) IMPORT Arguments, InOut, Strings, NumConv; TYPE Identifier = ARRAY [0..255] OF CHAR; VAR option, IP, Date, URL, Ref, Dum : Identifier; Code, value : CARDINAL; Exhausted, ok, ip, url, code : BOOLEAN; buffer : Arguments.ArgTable; count : SHORTCARD; PROCEDURE SkipItem; VAR ch : CHAR; BEGIN REPEAT InOut.Read (ch); Exhausted := InOut.EOF (); IF Exhausted THEN RETURN END UNTIL ch > ' '; REPEAT InOut.Read (ch); Exhausted := InOut.EOF (); IF Exhausted THEN RETURN END UNTIL ch <= ' ' END SkipItem; PROCEDURE ReadName (VAR str : ARRAY OF CHAR); VAR n : CARDINAL; ch, endch : CHAR; BEGIN REPEAT InOut.Read (ch); Exhausted := InOut.EOF (); IF Exhausted THEN RETURN END UNTIL ch > ' '; (* Eliminate whitespace *) IF ch = '[' THEN endch := ']' ELSE endch := ch END; n := 0; REPEAT InOut.Read (ch); Exhausted := InOut.EOF (); IF Exhausted THEN RETURN END; IF n <= HIGH (str) THEN str [n] := ch ELSE ch := endch END; INC (n) UNTIL ch = endch; IF n <= HIGH (str) THEN str [n-1] := 0C END END ReadName; PROCEDURE ReadString (VAR str : ARRAY OF CHAR); VAR n : CARDINAL; ch, endch : CHAR; BEGIN REPEAT InOut.Read (ch); Exhausted := InOut.EOF (); IF Exhausted THEN RETURN END UNTIL ch > ' '; (* Eliminate whitespace *) n := 0; REPEAT IF n <= HIGH (str) THEN str [n] := ch END; InOut.Read (ch); INC (n) UNTIL (ch <= ' ') OR InOut.EOF (); IF n <= HIGH (str) THEN str [n-1] := 0C END; END ReadString; PROCEDURE Condition () : BOOLEAN; BEGIN IF ip AND (Strings.pos (option, IP) <= HIGH (IP)) THEN RETURN TRUE ELSIF url AND (Strings.pos (option, URL) <= HIGH (URL)) THEN RETURN TRUE ELSIF code AND (value = Code) THEN RETURN TRUE END; RETURN FALSE END Condition; BEGIN ip := FALSE; url := FALSE; code := FALSE; Arguments.GetArgs (count, buffer); IF count < 3 THEN InOut.WriteString ("Usage : IPFIND IP xxx | URL xxx | CODE xxx"); InOut.WriteLn; HALT END; Strings.Assign (option, buffer^^); IF Strings.StrEq (option, 'IP') THEN ip := TRUE; Strings.Assign (option, buffer^^) ELSIF Strings.StrEq (option, 'URL') THEN url := TRUE; Strings.Assign (option, buffer^^) ELSIF Strings.StrEq (option, 'CODE') THEN code := TRUE; Strings.Assign (option, buffer^^); NumConv.Str2Num (value, 10, option, ok); ELSE InOut.WriteString ('Illegal option. Aborting.'); InOut.WriteLn; HALT END; REPEAT ReadString (IP); SkipItem; SkipItem; ReadName (Date); ReadName (URL); InOut.ReadCard (Code); SkipItem; ReadName (Dum); ReadName (Dum); IF Condition () THEN InOut.WriteString (Date); InOut.WriteCard (Code, 12); InOut.WriteLn; InOut.Write (11C); InOut.WriteString (URL); InOut.WriteLn; InOut.Write (11C); InOut.WriteString (IP); InOut.WriteLn; InOut.WriteLn END UNTIL Exhausted; InOut.WriteBf END IPfind.One example of the usage:
jan@beryllium:~/modula/div$ cat access_log_20071117 | IPfind CODE 404 17/Nov/2007:04:15:24 +0100 404 GET /etc/telef.html HTTP/1.0 speedyspider.entireweb.co jan@beryllium:~/modula/div$At the moment IPfind will recognize three search options, to be supplied in the correct case:
|CODE||Look for one specific status code, as a decimal number.|
|URL||Look for a text pattern to be present in the URL field of Webalizer. It doesn't matter where in the URL field the text pattern occurs.|
|IP||Look for a text pattern to be present in the IP field of Webalizer. It doesn't matter where in the IP field the text pattern occurs.|
IPfind expects the data to be delivered via a pipe or through STDIN. Output is sent to STDOUT or a pipe, as you wish. In the download section you can download the executable, the source and an example access log file.
Page created 28 September 2007,