Why gzipped HTML?

I have been intrigued by how the webbrowsers work 'under the hood'. As a spin off, I have been making CGI executables with the Mocka compiler (qv) and then it struck my eye that just about any browser or webserver was compatible to gzipped formats.
Well, that could have been a meaningless remark, but it could also mean that this kind of software would be such that it would decompress gzipped files on the fly.
At that moment it came to me again, that Netscape does not always retrieve files that are simply gzipped data. In many occasions, it doesn't start a download operation, but it just unzips the file and presents the contents on-screen (Mr Sulu). Apparently, Netscape (which is gzip compatible as well) filters gzipped data through a gzip pipe and then processes what comes out of the pipe (at the other end).

Now, that would open up possibillities. Suppose I could just gzip big HTML files and store them as such on the webhost? This would mean that I could park at least twice as many files in the same webspace as I do now... Hmm. It's getting interesting.


Field trials with gzipped HTML files.

The GNU abbreviation makers will like this test. As you read this, you are reading a gzipped HTML file (like there are more on this website; try to locate them). It's a cyclic reference...
On the other hand, if you have a Windows system with WinZip or WinRar incorrectly installed, you will not see this text, since your unzipper processed the file and did not send its contents back to the webbrowser. So let's consider this a Linux only party... I like it more and more....

Below you see how I processed the HTML file:

      
  bash-2.05$ ls gz* -l 
  -rw-r--r--    1 jan      users        4291 Jan  8 20:35   gzipped.html 
  bash-2.05$ gzip gzipped.html 
  bash-2.05$ ls gz* -l 
  -rw-r--r--    1 jan      users        1664 Jan  8 20:35   gzipped.html.gz
  bash-2.05$
   
This is not yet the full file. But you see the gain in diskspace: 1700 instead of 4300 bytes. That means less diskspace and faster loading. The slower your internet connection, the more time you gain accessing this file. Decompression is fast and on the local machine.

Below you see an HTML anchor for a normal file and one for a gzipped HTML file. As you see, you must tell the webbrowser which file to retrieve from the server. Which isn't particular unlogical.
      
  <br>
   o <a href="testform.html"         Target = "main">Testform</a>
  <br>
   o <a href="test.html.gz"          Target = "main">Hello.gz</a>
  <br>
   
That's all there is to it. A child can do the laundry (dutch proverb to indicate how easy things can be).


Google and gzipped HTML files.

Google does a great job in indexing data on the internet. Consider the internet as a giant encyclopedia. The searchengines create an index on the fly, each and every time we click on their icons. The internet would be inaccessable without these systems.

But now, what will happen if Google tries to index a gzipped HTML file?

If Google were compliant with all browsers and webservers, it would just send the data through a gzip pipe and process the results. But is it this simple?
Questions, questions. And although I do have the answer I won't give it away immediately. Here are the searchresults for a Google survey for a gzipped test file on the Fruttenboel domain:

See?

Google (and probably most other websearch engines) just open the file for read access and try to 'read' it like a human being would. And since it will see a lot of 'difficult' letters, it will assume there are oriental tokens in the message instead of western letters.

So the answer is: Google will not index gzipped HTML files. Now we could sit back and send this idea to the garbage can, but we could also try to convince Google to pipe all gzipped data through a filter and then start the indexing.

On the other hand, if we don't tell Google, this is a fine way of hiding data for the search engines. They will still find all references to these files in raw HTML files, but the crawlers will not see the contents of the files anymore.
For the time being, I'm not gonna tell Google about this breach. I will give the paranoids a chance to publich their stuff on the web, using this trick, thereby saving precious diskspace on their webhosts. See how long we can have our secret pages on the web with this tool.


Will search engines index gzipped HTML files?

The following text is dutch slang, so don't try to translate it if this is not your native language. The english equivalent of the text would be something like Eenyweeny rulez. The essence was to use a sentence that is not available elsewhere on the internet. And I found one:

Pineukeltje rulez! En de frutselkip legt een ei.
Pineukeltje rulez! En de frutselkip legt een ei.

The meaning of this ridiculous test, is to see if search engines will really index gzipped HTML files. Google reported they do, but I seriously doubt it. Until proven otherwise. Check out this option by visiting your favorite search engine and entering the text "pineukeltje rulez" or "frutselkip legt een ei". Neither frutselkip nor Pineukeltje rulez occur on the current internet.
You should end up with this same page if search engines really index gzipped HTML files. Time will tell.


The results of a month of waiting...

It took a month or so until Google re-indexed my site. They found this particular file, as you can see in the screenshot.... Only they didn't read the contents.
The proof, you say? Well, here's the proof. Eat it!
The proof is in following the lead. It will present this same page... Enough proof for me. I'm convinced: Google does not process gzipped files, no matter what they tell to users.
To the right you see some more proof. Remember the lines above with pineukeltje and frutselkip? You won't find them as such on the internet. What more proof do you need? Gondoleesa Rice telling you it really is so?
It's clear. Google will find gzipped html files if they are listed in another webpage, but the contents are secure. Which also means that Mrs Gondoleesa Rice will never be able to find this webpage just by entering her name in a search engine's search window.

Hence: this is a secret webpage!


You can run, but you cannot hide!

You can run indeed, but hiding is very difficult. As of March 2006, at least Google searches and indexes gzipped HTML files. As can be seen in the snapshot at the left. Google found the silly sentence.
If hiding is not possible anymore by using the html.gz trick, then the only purpose for gzipping html files is space reduction on the webserver plus faster transmissions since the browser unzips the file. Not the webserver.

That's all folks! End of topic and project.


Page created on February 25, 2005