The CGI 404 handler

I've had my share of 404 handlers on my website. In hindsight I must say: the most spartan 404 handler is the best. If you make a 404 page in HTML you end up being silly. Joking about a 404 isn't very nice. Or funny. A human made an error and gets punished for it immediately. By a piece of silicon. Instructed by a smartass on clogs and a tulip between his cheeks.

For a while I tried to please my 404 clients to serve them my sitemap file. But that's like forcing people to read a phonebook when they're just asking for your business card. So the sitemap file wasn't on for long as well.

This topic is about making a smart 404 handler. It will be smart since it will serve a dedicated page and it will inform you about your error. It will be a mix of Modula-2 (for interrogating the webserver and adding 'speed' to the subject) and JavaScript (for the user interface).

Preparations.

It is my intention to have this 404 handler installed on the server of my webhost. If I make a sloppy or clumsy program, I may well cripple the webserver on which my webhost hosts my site. But I'm not the only user on that server. So in the worst case, I will bring down the full server, thereby making a lot of websites inaccessible. Please visit the section 'CGI loop' to get an indication what could happen. So we must be careful from the very first moment.
If this isn't your first visit to Fruttenboel, you will know that I run Linux wherever possible. So I run a webserver for standard on all machines in this house. On Beryllium, I run apache-perl. You need to customize one line in /etc/apache-perl/httpd.conf. This is how it should look like:

# This controls which options the .htaccess files in directories can
# override. Can also be "All", or any combination of "Options", "FileInfo",
# "AuthConfig", and "Limit"
#
    AllowOverride All
   
From now on, you will be able to use '.htaccess' files for instructing the webserver how to handle ErrorDocuments. The webserver will try to find a '.htaccess' file in the current directory. If there isn't one there, Apache will travel up the directory tree and use the first one it finds.

In my case, I have a custom '.htaccess' file in /fruttenboel/cgi and it reads:

ErrorDocument 400 /errors/400.html
ErrorDocument 401 /errors/401.html
ErrorDocument 402 /errors/402.html
ErrorDocument 403 /errors/403.html
ErrorDocument 404 /cgi-bin/testCGI
   
All file references are relative to the document root of Apache (which in my case is '/var/www'). The first four error handlers are more or less silly HTML documents. The 404 handler at this moment is a file we know from history: testCGI (it all started with this one).
This '.htaccess' file shows that you can put a lot of different things in your '.htaccess' file. In this case, 400 thru 403 are handled by an HTML file. The 404 is handled by a CGI executable. But if you would have written
ErrorDocument 404 'You made a typo, Asssmart!'
   
that specific text would end up in a blank webpage! Apache is a very powerful webserver and you just need to read some books about it. Websearches don't reveal everything. This webpage not taken into account of course.

At this moment, I have my test environment set up. I only need a controlled way to trigger a 404. To do so, I added the following link to the navigator frame (on the right):

    o <a href="404handler.html"		Target="main">404 handler</a>	<br>
   ....
    o <a href="farm3.html"		Target="main">Create a 404</a>	<br>
   
Make the modifications and give it a try. You will be pleasantly surprised....

404 handler: first tests

Our first test is with the 'testCGI' executable in the right place and with the right privileges. This will trap and show all CGI environment variables and show them on screen in a formatted and controlled way. The most striking part is the following:

Apart from the well known CGI environment variables (which we discovered in the previous experiments and projects in this section), there are now four or five more variables:

Variable name Purpose or payload
REDIRECT_ERROR_NOTES File does not exist: /var/www/net/fruttenboel/cgi/farm3.html
An error message followed by the full, absolute, serverpath to the name of the requested file
REDIRECT_REQUEST_METHOD GET
REDIRECT_STATUS 404
This is the status number code. In this case, a 404. But this same handler could be extended to handle an arbitrary numer of other status codes. Just inspect this variable.
REDIRECT_URL /net/fruttenboel/cgi/farm3.html
This is the relative path (from /var/www down) to the file that triggered this status code handler
REQUEST_URI /net/fruttenboel/cgi/farm3.html
In this case, URI is URL.

The first thing that comes to mind is: damned! I need to extend the CGI module! It lacks some CGI types which we're going to need for this project.

Changing the CGI module

We need the following CGItypes which are not in the present CGI module:

So I am going to incorporate them into the CGI module sources. At this very moment I am tempted to switch to sloppy programmer mode. Why take the burden of CGItypes, whereas it would be a lot easier to just pass the CGI environment name along to get the payload back?
Well, that would be against the nature of Modula-2. Modula-2 is a classy and object based rogramming language but it doesn't use these terms. It uses the modules and the types instead. And it will compile them so they will be fixed in time.

Here's the new CGI.DEF file:

DEFINITION MODULE cgi;

FROM  Strings     IMPORT  String;

TYPE   ServerDataType = (Text, Html, Gif, Jpeg, PS, Mpeg);
       CGItype        = (ContentLength, GatewayInterface, HttpHost,
       		         HttpReferer,	QueryString,	  RemoteAddress,
		  	 RemoteHost,	RemotePort,	  RequestMethod,
		  	 ScriptName,	ServerAddress,	  ServerName,
		  	 ServerPort,	ServerProtocol,	  ServerSignature,
		  	 DocumentRoot,	RedirectStatus,	  RedirectUrl,
			 RedirectNotes,	none);


PROCEDURE InformServer (dataType : ServerDataType);

PROCEDURE CheckType (str : String) : CGItype;

PROCEDURE GetEnvVar (kind : CGItype; VAR  res : String) : BOOLEAN;

END cgi.
   
The implementation module of CGI changes not much. At the end of 'CheckType' I added there more checks:
IMPLEMENTATION MODULE cgi;

...

PROCEDURE CheckType (str : String) : CGItype;

BEGIN
   CAPS  (str);					(*  Convert entire string to capitals.	 *)
   IF  pos ('CONTENT_LENGTH', str) = 0  THEN  
      RETURN  ContentLength
   ...
   ELSIF  pos ('REDIRECT_STATUS', str) = 0  THEN  
      RETURN  RedirectStatus
   ELSIF  pos ('REDIRECT_URL', str) = 0  THEN  
      RETURN  RedirectUrl
   ELSIF  pos ('REDIRECT_ERROR_NOTES', str) = 0  THEN  
      RETURN  RedirectNotes
   ELSE
      RETURN none
   END
END CheckType;
   
That's all! The environment value extractor gets a cgitype as a parameter and returns a string. So we need not intrude that procedure. We only have added three more classes in the CGI object! Yikes. Where's the dettol? I need to rinse my mouth! Classes and objects....

By the way: compile the enhanced modules as usual:

jan@beryllium:~/modula/cgi$ mocka
Mocka 0608m
>> d cgi
>> i cgi
>> c cgi
.. Compiling Definition of cgi
.. Compiling Implementation of cgi I/0004 II/0004
>>
   
Done! I just love Modula-2 and the Mocka compiler (as adapted by Dr Maurer).

404 handler : a first attempt

On a normal webpage, you would be presented with the final solution. But not here. Fruttenboel is not about solutions to problems. Fruttenboel is about the ROAD that was taken to come to a solution. Including all the detours and dead ends. So I'm going to be fair right now:

The most important clue was: Apache serves extra CGI variables when in a status report! At first I wasn't sure how to access these variables. But then I got my bright moment of the week (on a Monday, not very assuring for the rest of the week): install testCGI as the error handler!

So that's in a nutshell how we got here. In the mean time I already made some kind of 404 handler. It works nice, but it is based on analysis of the wrong CGI environment variable: the referrer! Which isn't present in the first place, when the 404 generating phrase was typed in the URL bar!
Still, it's a nice program and here it is:

MODULE S404;

(*  Attempt to make a smart 404 handler				January 2008	*)

IMPORT	cgi, InOut, Strings;

TYPE	Target	    	    		= (fam, frutt);

VAR	path, Title, content		: Strings.String;
	Frame				: ARRAY [0..2] OF Strings.String;
	target				: Target;


PROCEDURE CreateHTML;

BEGIN
   InOut.WriteString ("<html><head>");
   InOut.WriteString ("</head>");				InOut.WriteLn;
   InOut.WriteString ("<body><center><h1>");
   InOut.WriteString ("404<p>");
   InOut.WriteString ("You will be redirected to the most probable main section.<p>");
   InOut.WriteString ("You wanted to access the following page:<p>");
   InOut.WriteString (content);
   InOut.WriteString ("</h1></center></body>");
   InOut.WriteString ('<script language="JavaScript">');	InOut.WriteLn;
   InOut.WriteString ("<!--");					InOut.WriteLn;
   InOut.WriteString ("alert ('Page not found!');");
   InOut.WriteString ("parent.location.href = ");
   IF  target = frutt  THEN
      InOut.WriteString ("'/fruttenboel/index.html'")
   ELSE
      InOut.WriteString ("'/net/verhoeven272/index.html'")
   END;
   InOut.WriteLn;
   InOut.WriteString ("// -->");				InOut.WriteLn;
   InOut.WriteString ("</script>");
   InOut.WriteString ("</html>");  				InOut.WriteLn;
   InOut.WriteBf
END CreateHTML;


BEGIN
   cgi.InformServer (cgi.Html);
   IF  cgi.GetEnvVar (cgi.HttpReferer, content)   = FALSE  THEN
      content := 'fruttenboel'
   END;
   IF  Strings.pos ('fruttenboel', content) > HIGH (content)  THEN
      target := fam
   ELSE
      target := frutt
   END;
   CreateHTML
END  S404.
   
Just compile the program as usual and copy the executable (as root) to the correct place in the directory tree. In my case
beryllium:/home/jan/# cp /home/jan/modula/cgi/S404 /usr/lib/cgi-bin/
   
Change the '.htaccess' file to
ErrorDocument 400 /errors/400.html
ErrorDocument 401 /errors/401.html
ErrorDocument 402 /errors/402.html
ErrorDocument 403 /errors/403.html
ErrorDocument 404 /cgi-bin/S404
   
This is enough to test the new 404 handler. It produces the following screen:

The page will remain on screen until you press the 'OK' button. Then the second section of the dynamically generated JavaScript starts redirecting:

   InOut.WriteString ("parent.location.href = ");
   IF  target = frutt  THEN
      InOut.WriteString ("'/fruttenboel/index.html'")
   ELSE
      InOut.WriteString ("'/net/verhoeven272/index.html'")
   END;
   
So, depending on the origin of the error, the erroneous webvisitor will be redirected to the family site or to the technical site. Not very smart yet, but this is just a start.

The biggest problem at the moment is the way I try to determine the intended target based on a CGI environment variable which isn't always present.

404 handler : improved version

Below is an improved, not to say: superior, version of the 404 handler. Main differences:

MODULE S4041;

(*  Attempt to make a smart 404 handler				15 January 2008	*)
(*  It works, but it is based on the wrong CGI environment variables		*)
(*  The CGI module has been changed. This version is based on the modifications	*)
(*  This version uses the REDIRECT related variables		15 January 2008	*)

IMPORT	cgi, InOut, Strings;

TYPE	Target	    	    		= (fam, frutt);

VAR	path, Title, content, status	: Strings.String;
	Frame				: ARRAY [0..2] OF Strings.String;
	target				: Target;


PROCEDURE CreateHTML;

BEGIN
   InOut.WriteString ("<html><head>");
   InOut.WriteString ("</head>");				InOut.WriteLn;
   InOut.WriteString ("<body><center><h2>");
   InOut.WriteString (status);
   InOut.WriteString ("<p>You will be redirected to the most probable main section.<p>");
   InOut.WriteString ("You wanted to access the following page:<p>");
   InOut.WriteString (content);
   InOut.WriteString ("</h2></center></body>");
   InOut.WriteString ('<script language="JavaScript">');	InOut.WriteLn;
   InOut.WriteString ("<!--");					InOut.WriteLn;
   InOut.WriteString ("alert ('Page not found : redirecting');");
   InOut.WriteString ("parent.location.href = ");
   IF  target = frutt  THEN
      InOut.WriteString ("'/fruttenboel/index.html'")
   ELSE
      InOut.WriteString ("'/net/verhoeven272/index.html'")
   END;
   InOut.WriteLn;
   InOut.WriteString ("// -->");				InOut.WriteLn;
   InOut.WriteString ("</script>");
   InOut.WriteString ("</html>");  				InOut.WriteLn;
   InOut.WriteBf
END CreateHTML;


PROCEDURE ShowContent (str  : ARRAY OF CHAR);

BEGIN
   InOut.WriteString ("<h2>");
   InOut.WriteString (str);
   InOut.WriteString (" = ");
   InOut.WriteString (content);
   InOut.WriteString ("</h2>");
   InOut.WriteLn
END ShowContent;


BEGIN
   cgi.InformServer (cgi.Html);
   IF  cgi.GetEnvVar (cgi.RedirectUrl, content)   = FALSE  THEN
      content := "RedirectUrl not found";
      ShowContent ("ERROR");
      HALT
   END;
   IF  cgi.GetEnvVar (cgi.RedirectStatus, status) = FALSE  THEN
      content := "RedirecStatus not found";
      ShowContent ("ERROR");
      HALT
   END;
   IF  Strings.pos ('fruttenboel', content) > HIGH (content)  THEN
      target := fam
   ELSE
      target := frutt
   END;
   CreateHTML
END  S4041.
   
Compile it. Let 'root' copy it to the cgi-bin. Then create the error. This new version Of course, the new content of '.htaccess' now is:
ErrorDocument 400 /errors/400.html
ErrorDocument 401 /errors/401.html
ErrorDocument 402 /errors/402.html
ErrorDocument 403 /errors/403.html
ErrorDocument 404 /cgi-bin/S4041
   

404 handler : extensions

The current version works. Locally. I want do some more testings before I upload it to the cgi-bin of my webhost but I can't wait to do so. This is the end of lost visitors due to 404's. From now (then) on visitors are glued to my site. With Loctite.

Things to do:

S4042 and beyond

S4041 was the local version. S4042 was the version intended to be placed in my rented webspace. S4042 was uploaded and the '.htaccess' file was adapted. Still, it did not run. So I sent a mail to my webhost's support team. Their answer: compiled executables are not acceptable error handlers for our servers.

Scripts, written in Perl and such, were acceptable. But I know no Perl or Ruby. So I had the second bright moment of this week and I rewrote the 404 handler to JavaScript only. This is how it looks like. It is called 'H404.html'.

<html>
 <head>
  <title>Smart 404 handler in JavaScript</title>
 </head>
 <body>
  <center>
   <h2>
    Error 404
    <p>
    You will be redirected to the most probable main section.
    <p>
    You wanted to access the following page:
    <p>
    <script language="JavaScript">
     <!--
      document.write (document.URL);
      document.write ("<p>");
      document.write ("You will be redirected to ");
      
      loca = new String;			target = new String;
      loca = document.URL;			pos = loca.indexOf ("fruttenboel");

      if  ( pos < 0)
        target = "/net/verhoeven272/index.html"
      else
        target = "/fruttenboel/index.html"
      ;
      document.write (target);
      document.write ("<p>");
      alert ('Resistance is futile!');
      parent.location.href = target;
     -->
    </script>
   </h2>
  </center>
 </body>
</html>
   
This is also some kind of script. I uploaded it and changed '.htaccess' once more. This is the result:

Cookies my foot. This would be the third redirector in a row. Apparently this has been disabled by my webhost. So I will have to learn to live with it. Still, at home, on the personal Apache server, it runs very well.

In the mean time I found out that .htaccess files at De Heeg need to be of the form:

ErrorDocument 401 /errors/401.html
ErrorDocument 402 https://www.verhoeven272.nl/errors/402.html
ErrorDocument 403 https://www.verhoeven272.nl/errors/403.html
ErrorDocument 404 https://www.verhoeven272.nl/errors/H404.html
   
So, all the error documents need a fully qualified URL, except the handler for the 401.

Wait!

At this moment I see what might have caused the error:
      if  ( pos < 0)
        target = "/net/verhoeven272/index.html"
      else
        target = "/fruttenboel/index.html"
      ;
   
This is a relative path and it might just as well be that there should be fully qualified URL's there as well. So I changed these lines to
      if  (pos < 0)
        target = "https://www.verhoeven272.nl/index.html";
      else
        target = "https://fruttenboel.verhoeven272.nl/index.html";
   
I changed the .htaccess file, went to this CGI section and clicked 'Create a 404'. I didn't get the redirection error message anymore. Yet, this is what happened:

This however is very assuring. It means that the Javascript method works. The only problem is: this is a redirected 404 file. So the H404 file does not get the data that caused the 404 to happen. In a nutshell:

Still, the relative path method is also what I applied in S4042. So I changed the related section in S4042.mod to be:
   IF  target = frutt  THEN
      InOut.WriteString ("'https://fruttenboel.verhoeven272.nl/index.html'")
   ELSE
      InOut.WriteString ("'https://www.verhoeven272.nl/index.html'")
   END;
   
Recompiled, uploaded to the cgi-bin directory and changed the .htaccess file. Crossed my fingers and went to the CGI section and clicked once more on 'Create a 404'. There was no chain reaction. The atmosphere did not ignite. Instead one silly line appeared on screen:

ERROR = RedirectUrl not found

This shows that the technical guy at my webhost isn't very familiar with compiled executables and their interactions with Apache. It ran. It ran errorfree! The small message in the topleft corner is the error message from the source of S4042:
MODULE S4042;

(*  Attempt to make a smart 404 handler				15 January 2008	*)
(*  It works, but it is based on the wrong CGI environment variables		*)
(*  The CGI module has been changed. This version is based on the modifications	*)
(*  This version uses the REDIRECT related variables		15 January 2008	*)
(*  S4042 is the version that runs on my hosted website		15 January 2008 *)

IMPORT	cgi, InOut, Strings;

TYPE	Target	    	    		= (fam, frutt);

VAR	content, status			: Strings.String;
	target				: Target;


PROCEDURE CreateHTML;

BEGIN
   InOut.WriteString ("<html><head>");
   InOut.WriteString ("</head>");				InOut.WriteLn;
   InOut.WriteString ("<body><center><h2>");
   InOut.WriteString (status);
   InOut.WriteString ("<p>You will be redirected to the most probable main section.<p>");
   InOut.WriteString ("You wanted to access the following page:<p>");
   InOut.WriteString (content);
   InOut.WriteString ("</h2></center>");
   InOut.WriteString ('<script language="JavaScript">');	InOut.WriteLn;
   InOut.WriteString ("<!--");					InOut.WriteLn;
   InOut.WriteString ("alert ('Page not found : redirecting');");
   InOut.WriteString ("parent.location.href = ");
   IF  target = frutt  THEN
      InOut.WriteString ("'https://fruttenboel.verhoeven272.nl/index.html'")
   ELSE
      InOut.WriteString ("'https://www.verhoeven272.nl/index.html'")
   END;
   InOut.WriteLn;
   InOut.WriteString ("// -->");				InOut.WriteLn;
   InOut.WriteString ("</script>");
   InOut.WriteString ("</body></html>");  			InOut.WriteLn;
   InOut.WriteBf
END CreateHTML;


PROCEDURE ShowContent (str  : ARRAY OF CHAR);

BEGIN
   InOut.WriteString ("<h2>");
   InOut.WriteString (str);
   InOut.WriteString (" = ");
   InOut.WriteString (content);
   InOut.WriteString ("</h2>");
   InOut.WriteLn
END ShowContent;


BEGIN
   cgi.InformServer (cgi.Html);
   IF  cgi.GetEnvVar (cgi.RedirectUrl, content)   = FALSE  THEN
      content := "RedirectUrl not found";
      ShowContent ("ERROR");
      HALT
   END;
   IF  cgi.GetEnvVar (cgi.RedirectStatus, status) = FALSE  THEN
      content := "RedirectStatus not found";
      ShowContent ("ERROR");
      HALT
   END;
   IF  Strings.pos ('fruttenboel', content) > HIGH (content)  THEN
      target := fam
   ELSE
      target := frutt
   END;
   CreateHTML
END  S4042.
   
The required CGI Environment Variable is only present after the first redirection. After the second redirection, the usual CGI variables are present and the cause of the error has been flushed. Down the toilet. Gone forever.

The .htaccess file at THIS moment looked like:

ErrorDocument 401 /errors/401.html
ErrorDocument 402 https://www.verhoeven272.nl/errors/402.html
ErrorDocument 403 https://www.verhoeven272.nl/errors/403.html
ErrorDocument 404 https://www.verhoeven272.nl/cgi-bin/S4042
   
I think it's time to run 'testCGI' as error handler, to see what we all got.

Tests run on the remote webserver

I changed the .htaccess file to:

ErrorDocument 401 /errors/401.html
ErrorDocument 402 https://www.verhoeven272.nl/errors/402.html
ErrorDocument 403 https://www.verhoeven272.nl/errors/403.html
ErrorDocument 404 https://www.verhoeven272.nl/cgi-bin/testCGI
   
and created the 404 in the (by now) usual way. I got the familiar table on screen and one line was assuring:

The CGI referrer was still pointing to https://fruttenboel.verhoeven272.nl/cgi/cgicontent.html and that's the directory in which the 404 was forced. This gives me one more idea: what will testCGI produce when I force a 404 from the URL bar? I have NO reason not to test it with testCGI. 'testCGI' is reliable, tested and rugged. It never hangs. Why should I not use it? Perl and Python scripts are much more dangerous. So I am going to reinstall the .htaccess file we saw above and see what happens. Hang in there just one more second.
Of course it ran errorfree. But the output resulted in a lot of information but not a single clue to what the user might have typed into the URL bar. One thing of interest was, that the HTTP_REFERER was absent. We can use that. But how? Time for a night's rest....

S4043

So I took the S404 handler and changed it into this:

MODULE S4043;

(*  Attempt to make a smart 404 handler				January 2008	*)
(*  After S4041 and S4042, it became clear that S404 wasn't 
    that bad after all						January 2008	*)

IMPORT	cgi, InOut, Strings;

TYPE	Target		= (fam, frutt);

VAR	content		: Strings.String;
	target		: Target;


PROCEDURE CreateHTML;

BEGIN
   InOut.WriteString ("<html><head></head>");			InOut.WriteLn;
   InOut.WriteString ("<body>S4043<center><h2>");
   InOut.WriteString ("404<p>");
   InOut.WriteString ("You will be redirected to the most probable main section.<p>");
   InOut.WriteString ("You wanted to access the following page:<p>");
   InOut.WriteString (content);
   InOut.WriteString ("</h2></center></body>");
   InOut.WriteString ('<script language="JavaScript">');	InOut.WriteLn;
   InOut.WriteString ("<!--");					InOut.WriteLn;
   InOut.WriteString ("alert ('Resistance is futile!');");
   InOut.WriteString ("parent.location.href = ");
   IF  target = frutt  THEN
      InOut.WriteString ("'https://fruttenboel.verhoeven272.nl/index.html'")
   ELSE
      InOut.WriteString ("'https://www.verhoeven272.nl/index.html'")
   END;
   InOut.WriteLn;
   InOut.WriteString ("// -->");				InOut.WriteLn;
   InOut.WriteString ("</script>");
   InOut.WriteString ("</html>");  				InOut.WriteLn;
   InOut.WriteBf
END CreateHTML;


BEGIN
   content := "URL typed into URL bar";
   cgi.InformServer (cgi.Html);
   IF  cgi.GetEnvVar (cgi.HttpReferer, content) = FALSE  THEN
      target := frutt
   ELSE
      IF  Strings.pos ('fruttenboel', content) > HIGH (content)  THEN
	 target := fam
      ELSE
	 target := frutt
      END
   END;
   CreateHTML
END S4043.
   
Compiled it, upladed to the cgi-bin and changed the .htaccess file. First I tried it on y own system. It ran errorfree. Then I forced the 404 on the hosted system. I got an internal server error. So that concludes this topic for the time being. The .htaccess file has been erased again. I prefer a spartan 404 handler over a silly one. I'd like to have one of the handlers that run on my private server, running on my leased webspace. But that is not possible at this time. Perhaps later....

Below is what S4043 looks like when ran on my private Apache system. When I click the button I'm redirected to the right URL. But when I upload the same executable to my webhost, things go crazy.

Page created on 15 January 2008 and