div Translations are available Most
Found at: gopher.blog.benjojo.co.uk:70/building-a-search-engine-for-gopher
Building a legacy search engine for a legacy protocol
Most users who use the internet today, are mostly focused
on three protocols, HTTP, TLS, and DNS.
While these users may not care these days how their pages
are displayed to them, there was once a competing protocol to
the one that we known and love that is HTTP.
Located just 10 ports down from HTTP is Gopher, on TCP
port 70. A protocol that looks a lot like a much more basic
While a basic HTTP/1.0 request may look like this:
$ nc blog.benjojo.co.uk 80
GET / HTTP/1.0
HTTP/1.1 403 Forbidden
Date: Wed, 03 May 2017 19:11:28 GMT
Content-Type: text/html; charset=UTF-8
A gopher request is even more basic:
$ echo "/archive" | nc gopher.floodgap.com 70
1Floodgap Systems gopher root
iWelcome to the Floodgap gopher files archive. This
imirrors of popular or departed archives that we believe
This is great for basic file transfers, as basic bash
utilities can be used to download files!
However in the end, HTTP won out over gopher and it
became the protocol that most of us use to do things on the
internet. The reasons why are interesting, however others have told
that story much better, you can find a good write up here.
Search engines exist for HTTP, and gopher itself has
support for searching in the protocol, but all the search engines
for gopher are in "gopherspace". None really existed in HTTP
(that I could find).
I had a friend run a zmap scan over the internet for
port 70, and then filtered the results for real gopher servers:
[ben@aura tmp]$ pv gopher.raw | grep 'read":"i' | jq .ip |
sort -n | uniq -c | wc -l
2.14GiB 0:00:08 [ 254MiB/s] [================>] 100%
A sad 370 servers are left on the internet that serve
## Building a crawler
I wrote a simple RFC1436 implementation and crawler, and
slowly (there are very old servers behind some of these hosts)
began crawling all the menus, known as selectors, and text files
I could find.
At this point I started to explore gopher space itself,
and I have to say, it's a wonderful place of just pure
content, a far cry away from the modern internet where CSS and
adtech are stuffed in every corner.
A ttygif of using gopher
## Indexing the content
Given that gopher is from the 1990s, it feels only right
to use search engine tech from the era. As it happens
AltaVista once sold a personal/home version of their search engine
desktop computers. The issue however is that it's win32
only software. I didn't try running it on wine, instead I aimed
for a more authentic experience of running the software: using
an already fantastic guide from NeoZeed I ended up provisioning
my very own Windows 98 search "server".
dir="ltr">I never thought I be doing this again, and yet, here weare! Join me in the adventures of "oh god we areinstalling windows 98 again"href="https://t.co/Ub83PQ4sV7">pic.twitter.com/Ub83PQ4sV7— Ben Cox (@Benjojo12)href="https://twitter.com/Benjojo12/status/860893238831480833">May 6, 2017
altavista being installed
The search engine works:
altavista test search
As NeoZeed found out, the search interface only listens on
loopback (hardcoded), which is very annoying if you want to expose
it to the wider world! To solve this stunnel
was deployed to listen on * and relay connections back to
the local instance, with added SSL! Using a pretty questionable
(but CT logged) default SSL certificate too (A)!
## Providing data to the indexer
Because the indexer and backend is a Windows 98 QEMU VM,
there has to be a way of giving that VM approximately 5GB of
tiny files for the AltaVista indexer to see and include in the
index. For this, I chose to make a FAT32 file system every 24
hours as a snapshot of the crawled data, and then restart the
crawling VM to see the new files. This worked really quite well in
testing with a small amount of files, however a few issues became
apparent: first of all FAT32 has file limits that need to be paid
attention to. For example, FAT32 is not able to have more than 255
files in the root directory, so you have to be sure to spread
out your files in folder structures.
Another issue to keep in mind is that the maximum drive
size for FAT32 is 32GB (approx). This means the amount of
textual content can't go bigger than that (or we would have to
spawn more virtual drives). Fortunately, the size of the crawled
content is far below that so this is a non issue.
While crawling a "production size" data set, the system
would reset at a random point during the indexing program
gif of windows 98 and the cralwer randomly crashing
After tweaking the settings involving the disk caches, a
slightly more constructive error was obtained:
A blue screen of death
This is good! I can search for `exception 05 has occurred
at 0028:C2920074`! Right? As it happens, there is very little
information about this kind of crash on the internet (it may have
existed at some point in the past, but since been removed; after
all, the OS is 20 years old), however the one piece of
information I could gather from searching is that it was VFAT driver
related. Suspecting a bad combo between high IO load and QEMU's
INT_13 implementation, I went for the only other file system/data
input system available, CD/DVD ROM!
After doing a small 500MB test (a test that FAT32 could
not pass) we had a small index!
The test index built
At this point we had to scale up the solution to the
300k / 4 GB of files. I discovered that Windows 98 does
support DVDs, even though the UI can only display that the drive
is 2GB, even if the drive is much larger than that. Despite
that, all content was accessible on the drive and an initial
index was (slowly) built.
## Sanitise the index interface
The only problem with using a 20 year old indexer, is
that it's likely a **very** bad idea to expose directly to the
internet. The other issue is that on most of the pages the
interface serves references local (as in, `file://`) assets, meaning
that a simple reverse proxy would not work.
In addition, local paths are not very useful to people
searching. For this `alta-sanitise` was written to provide a sane
front end to it, while still keeping the Windows 98 AltaVista
index as it's backend.
To do this, I produce a file system containing all the
files that were downloaded, and name them by their database ID:
A sample search on the unaltered interface
However in alta-sanitise, we use the database we formed
using crawling, to rewrite the URL's into something viewable:
A sample search on the production interface
To ensure the VM could be used for more than one project,
lighttpd was put in front as a reverse proxy of `alta-sanitise`,
and Cloudflare used for cache. This leaves the final flow
looking like this:
# Monitoring Windows 98
Most of my servers are monitored using collectd.
Unfortunately there is no Windows 98 client for collectd (!?), so I
decided to make one.
A simple Visual Basic 6 application will poll every 10
seconds and output collectd command strings over the serial port
(where it can be passed on to collectd on the hypervisor):
VB6 and the serial output
This code can be found separately at:
(Some troubled soul may find this useful outside of a
# Giving back to the community
Now that I have a sizeable index of the gopher space, I
feel like I should give back to gopher space:
my blog on gopher
You can now find my blog on gopher at
`gopher.blog.benjojo.co.uk` (** only accessible on gopher, Lynx supports gopher if that
The search engine can be used at:
And the code can be found at: