Digital Repositories and Preservation

Thoughts on digital repositories, digital preservation, and scholarly communication.

Digital Repositories and Preservation header image 2

Fantastic solution to a difficult problem

August 15th, 2008 by Shane

While this may be old news to some, it was new to me, and thus I think good to share. Yesterday on All Things Considered there was a story about a Carnegie Mellon project called reCAPTCHA, a security program designed to assist with the OCR’ing of words that computer programs are unable to recognize. In use by 40,000 sites including Ticketmaster, Facebook and Craigslist, reCAPTCHA shows a real security word along with a word from a scanned document that fooled the OCR software. The difficult word is entered by a number of different users, and when they agree, the word is incorporated into the scanned document. In this way, over 1.3 BILLION words have been entered, which are being used to assist with the digitization and OCR’ing of The New York Times, back to 1851.

THIS IS FREAKING BRILLIANT. Take all the wasted effort people spend entering in security words and convert it into something useful. If only this could be applied to any number of things… time spent reading your feeds on Google Reader or Bloglines (why?!) could run a protein folding Java applet for cancer research; every 20 minutes spent on Flickr triggered a mandatory photography exercise to assist with computer image recognition; listening to one of the free music stations on Last.fm required a user to first listen to a music sample and describe it’s characteristics… I can see the issue with these being that it’s nowhere near as fast as typing a word - humans have been recognizing strings of characters for a loooong time now - we see letters and words where they don’t even exist.

I was simply blown away by the ingenuity of the idea and the successful execution. Three cheers.

Tags: 1 Comment

Leave a Comment

1 response so far ↓