BACKSTRIP

Jun 14

The Internet of 1901. Via Maria Popova


The Internet of 1901. Via Maria Popova

Jun 11

Tombstones

I need a way to read tombstones. Let’s not get bogged down in why - there’s more to come on that shortly, I hope — so just nod along when I say every tombstone in the world should be recorded and analysed.

The biggest hurdle is that OCR technology is rubbish. We can’t even figure out a way to read books, and they have relatively uniform fonts, grammar, structure and, mostly, high contrast black-on-white. Even Google’s massive book scanning monster chokes when it sees a word with a speck of dust on it.

Now let’s look at tombstones. They’re unstructured, they use proper names, they have inconsistent date formats with mixed, irregular fonts - sometimes on an angle - and it all sits on material that’s mottled at best, or cracked and eroded at worst. In fact, you could probably use them to replace Captchas.

Now I’m not smart enough to build a new OCR system, but I can probably build a system that learns how to be better. So after some early morning hackery, here’s what I came up with.

Let’s start with this:

Admittedly this is a fairly clean tombstone, but Tesseract, still only managed to pull out HERMAN WHITE and IN LOVIN. Names are great, but dates are better. So how to alter the image to make it more OCR friendly? I have no idea, so I put together a small conversion pipeline that reads the pixel data, separates the colours, OCRs the text, then checks the results against a rough ‘good fit’ algorithm (it’s really just my style guide for tombstones — each result is parsed and scored against some basic grammar rules). If not, it makes some tweaks and tries again.

After many iterations, here’s the best conversion it came up with:

And here’s the processed result:

HERMAN WHITE

MAY 19h 1929

JUNE 25, 2007

Not perfect, and it missed IN LOVING MEMORY (because my best fit score favours dates), but it’s more useful.

What now? This took about 10 minutes to process, which is too inefficient. Next step is to collect data from a wide range of tombstones and then build that into the initial conversion. I have doubts that it’s even possible to build something that could be used on all tombstones, but it’s an inch closer.

Jun 06

“It is a space ship that will take you to the farthest reaches of the Universe.” Isaac Asimov on the opening of a new public library. More at Letters of Note. Via BB.


“It is a space ship that will take you to the farthest reaches of the Universe.” Isaac Asimov on the opening of a new public library. More at Letters of Note. Via BB.

Jun 04

IiB’s SnakeOil chart has been updated.

Some interesting usage stats:


  • most popular filter (by a long way): sex (followed by cancer, anti-viral and mental health)
  
  • most popular supplements (above the worth-it-line): green tea, fish oil, vitamin D, St John’s wort, probiotics


IiB’s SnakeOil chart has been updated.

Some interesting usage stats:

• most popular filter (by a long way): sex (followed by cancer, anti-viral and mental health)

• most popular supplements (above the worth-it-line): green tea, fish oil, vitamin D, St John’s wort, probiotics

May 13

Statistical distribution… pillows! (via @5310)


Statistical distribution… pillows! (via @5310)

May 11

What I saw when I visited the Sydney Morning Herald: ads on top of crap on top of ads. Garbage.


What I saw when I visited the Sydney Morning Herald: ads on top of crap on top of ads. Garbage.

May 09

Logo trends for 2011. More at Logo Lounge.

Logo trends for 2011. More at Logo Lounge.

Apr 28

Software progress outpaces hardware

From the New York Times:

A report [pdf] by an independent group of science and technology advisers to the White House, published last December, cited research showing that performance gains in doing computing tasks that result from improvements in software algorithms often far outpace the gains attributable to faster processors.

Apr 23

“Why not expand the landmass of Europe by draining part of the Mediterranean?” Via io9


“Why not expand the landmass of Europe by draining part of the Mediterranean?” Via io9

Apr 14

[video]

Apr 08


I’ve been tooling around with street-making algorithms. While street plans (if they are planned) are made according to different kinds of rules (like, say, the length and distribution of contiguous paths), most of the time they end up just looking like a maze.

And so that got me thinking about the history of maze-making algorithms, and in particular, the smallest and most efficient way to build one. And that led me to this:


  10 PRINT CHR$(205.5+RND(1)); : GOTO 10


which, when punched into a Commodore 64 (I used Vice for this), produces a maze like the one above. With a few tweaks, it could be Sydney’s Inner West.


I’ve been tooling around with street-making algorithms. While street plans (if they are planned) are made according to different kinds of rules (like, say, the length and distribution of contiguous paths), most of the time they end up just looking like a maze.

And so that got me thinking about the history of maze-making algorithms, and in particular, the smallest and most efficient way to build one. And that led me to this:

10 PRINT CHR$(205.5+RND(1)); : GOTO 10

which, when punched into a Commodore 64 (I used Vice for this), produces a maze like the one above. With a few tweaks, it could be Sydney’s Inner West.

Apr 07

Blogs to ebooks

So, I’ve been experimenting with ebooks. Specifically, I’ve been sneaking around other people’s websites, nicking their content, and then sticking it on my Kindle.

For example, I just grabbed all the articles from Tim Rogers’ very excellent Action Button:

Action Button is a good candidate because the images are black and white, the articles are long (some are many thousands of words), and the site and HTML are relatively clean and well-structured.

Note that this process is automated — I’ve built a small application that dives into Action Button’s review archive, creates a table of contents, grabs the HTML, and then parses and formats each article, which are then converted into a single epub file (or Mobi, in the case of the Kindle). All up: 15 minutes to turn a website full of high quality articles into a 3MB ebook, ready to be published and sold*.

So if you’re sitting on some quality content, especially if it’s already in a blog, then you’re about 95% of the way there. Go to it!

* Okay, so you might want to check over it, probably edit it. But you know, in theory.

Mar 30

In a flood, spiders climb trees and go apeshit with their webs. Incredible. More pics at Wired UK.

In a flood, spiders climb trees and go apeshit with their webs. Incredible. More pics at Wired UK.

Mar 24

A metahoroscope, compiled from 22,000 individual horoscopes. More analyses here.

(The most unique words for my sign — Libra — are ‘reason’, ‘attention’, ‘learning’, ‘stars’ and ‘almost’. Spooky!)

A metahoroscope, compiled from 22,000 individual horoscopes. More analyses here.

(The most unique words for my sign — Libra — are ‘reason’, ‘attention’, ‘learning’, ‘stars’ and ‘almost’. Spooky!)

Vintage Classics’ 3D Call of Cthulhu cover. According to The Bookseller, it comes with free 3D glasses.

Via @YSDC

Vintage Classics’ 3D Call of Cthulhu cover. According to The Bookseller, it comes with free 3D glasses.

Via @YSDC