div Translations are available Most

Found at: gopher.blog.benjojo.co.uk:70/building-a-search-engine-for-gopher

 Building a legacy search engine for a legacy protocol
Translations are available in:
 Most users who use the internet today, are mostly focused
on three protocols, HTTP, TLS, and DNS.
 While these users may not care these days how their pages
are displayed to them, there was once a competing protocol to
the one that we known and love that is HTTP.


 Located just 10 ports down from HTTP is Gopher, on TCP
port 70. A protocol that looks a lot like a much more basic
 While a basic HTTP/1.0 request may look like this:
 $ nc blog.benjojo.co.uk 80
 GET / HTTP/1.0
 HTTP/1.1 403 Forbidden
 Date: Wed, 03 May 2017 19:11:28 GMT
 Content-Type: text/html; charset=UTF-8
 Connection: close
 A gopher request is even more basic:
 $ echo "/archive" | nc gopher.floodgap.com 70
 1Floodgap Systems gopher root
 iWelcome to the Floodgap gopher files archive. This
 imirrors of popular or departed archives that we believe
 This is great for basic file transfers, as basic bash
utilities can be used to download files!

you can find a good write up here

 However in the end, HTTP won out over gopher and it
became the protocol that most of us use to do things on the
internet. The reasons why are interesting, however others have told
that story much better, you can find a good write up here.
 Search engines exist for HTTP, and gopher itself has
support for searching in the protocol, but all the search engines
for gopher are in "gopherspace". None really existed in HTTP
(that I could find).



 I had a friend run a zmap scan over the internet for
port 70, and then filtered the results for real gopher servers:
 [ben@aura tmp]$ pv gopher.raw | grep 'read":"i' | jq .ip |
sort -n | uniq -c | wc -l
 2.14GiB 0:00:08 [ 254MiB/s] [================>] 100%
 A sad 370 servers are left on the internet that serve
 ## Building a crawler
 I wrote a simple RFC1436 implementation and crawler, and
slowly (there are very old servers behind some of these hosts)
began crawling all the menus, known as selectors, and text files
I could find.
 At this point I started to explore gopher space itself,
and I have to say, it's a wonderful place of just pure
content, a far cry away from the modern internet where CSS and
adtech are stuffed in every corner.

A ttygif of using gopher

 A ttygif of using gopher
 ## Indexing the content


 Given that gopher is from the 1990s, it feels only right
to use search engine tech from the era. As it happens
AltaVista once sold a personal/home version of their search engine

fantastic guide from NeoZeed

 desktop computers. The issue however is that it's win32
only software. I didn't try running it on wine, instead I aimed
for a more authentic experience of running the software: using
an already fantastic guide from NeoZeed I ended up provisioning
my very own Windows 98 search "server".

altavista being installed

 altavista being installed
 The search engine works:

altavista test search

 altavista test search


 As NeoZeed found out, the search interface only listens on
loopback (hardcoded), which is very annoying if you want to expose
it to the wider world! To solve this stunnel

Using a pretty questionable (but CT logged) default SSL certificate too


 was deployed to listen on * and relay connections back to
the local instance, with added SSL! Using a pretty questionable
(but CT logged) default SSL certificate too (A)!
 ## Providing data to the indexer
 Because the indexer and backend is a Windows 98 QEMU VM,
there has to be a way of giving that VM approximately 5GB of
tiny files for the AltaVista indexer to see and include in the
index. For this, I chose to make a FAT32 file system every 24
hours as a snapshot of the crawled data, and then restart the
crawling VM to see the new files. This worked really quite well in
testing with a small amount of files, however a few issues became
apparent: first of all FAT32 has file limits that need to be paid
attention to. For example, FAT32 is not able to have more than 255
files in the root directory, so you have to be sure to spread
out your files in folder structures.
 Another issue to keep in mind is that the maximum drive
size for FAT32 is 32GB (approx). This means the amount of
textual content can't go bigger than that (or we would have to
spawn more virtual drives). Fortunately, the size of the crawled
content is far below that so this is a non issue.
 While crawling a "production size" data set, the system
would reset at a random point during the indexing program

gif of windows 98 and the cralwer randomly crashing

 gif of windows 98 and the cralwer randomly crashing
 After tweaking the settings involving the disk caches, a
slightly more constructive error was obtained:

A blue screen of death

 A blue screen of death


 This is good! I can search for `exception 05 has occurred
at 0028:C2920074`! Right? As it happens, there is very little
information about this kind of crash on the internet (it may have
existed at some point in the past, but since been removed; after
all, the OS is 20 years old), however the one piece of
information I could gather from searching is that it was VFAT driver
related. Suspecting a bad combo between high IO load and QEMU's
INT_13 implementation, I went for the only other file system/data
input system available, CD/DVD ROM!
 After doing a small 500MB test (a test that FAT32 could
not pass) we had a small index!

The test index built

 The test index built
 At this point we had to scale up the solution to the
300k / 4 GB of files. I discovered that Windows 98 does
support DVDs, even though the UI can only display that the drive
is 2GB, even if the drive is much larger than that. Despite
that, all content was accessible on the drive and an initial
index was (slowly) built.
 ## Sanitise the index interface
 The only problem with using a 20 year old indexer, is
that it's likely a **very** bad idea to expose directly to the
internet. The other issue is that on most of the pages the
interface serves references local (as in, `file://`) assets, meaning
that a simple reverse proxy would not work.
 In addition, local paths are not very useful to people
searching. For this `alta-sanitise` was written to provide a sane
front end to it, while still keeping the Windows 98 AltaVista
index as it's backend.
 To do this, I produce a file system containing all the
files that were downloaded, and name them by their database ID:

A sample search on the unaltered interface

 A sample search on the unaltered interface
 However in alta-sanitise, we use the database we formed
using crawling, to rewrite the URL's into something viewable:

A sample search on the production interface

 A sample search on the production interface
 To ensure the VM could be used for more than one project,
lighttpd was put in front as a reverse proxy of `alta-sanitise`,
and Cloudflare used for cache. This leaves the final flow
looking like this:

Final flow

 Final flow
 # Monitoring Windows 98
 Most of my servers are monitored using collectd.
Unfortunately there is no Windows 98 client for collectd (!?), so I
decided to make one.

collectd command strings

 A simple Visual Basic 6 application will poll every 10
seconds and output collectd command strings over the serial port
(where it can be passed on to collectd on the hypervisor):

VB6 and the serial output

 VB6 and the serial output

Grafana screenshot

 Grafana screenshot
 This code can be found separately at:
 (Some troubled soul may find this useful outside of a
gopher crawler.)
 # Giving back to the community
 Now that I have a sizeable index of the gopher space, I
feel like I should give back to gopher space:

my blog on gopher

 my blog on gopher
 You can now find my blog on gopher at
`gopher.blog.benjojo.co.uk` (** only accessible on gopher, Lynx supports gopher if that
helps **).
 The search engine can be used at:
 And the code can be found at: