PcBerg: 2009-08-16

Summary

OCR (Optical Character Recognition) can really come in handy. For example, I previously wrote about how I use Timesnapper as a black box to recover work which would otherwise be lost. Since most of my work is text based (C#, SQL, HTML, documentation, communications, etc.), the obvious next step is to grab the code from a screenshot. Of course I can retype it, but OCR would be better.

There are some great commercial OCR packages out there. My company recently used OmniPage Pro in a project which loaded data from hundreds of PowerPoint slides into SQL Server for reporting and analysis¹. OmniPage is great software, but it costs $149 for the basic version, which doesn't really make sense if you're just using it to avoid retyping a little text from a screenshot every now and then.

I looked around for free OCR software, and was a little bit surprised that there wasn't much out there. Here's a rundown of what I found, wrapping up with a program that wasn't technically free, but I already had it. There's a good chance you've got it, too.

GOCR

I first tried out GOCR (a.k.a. JOCR). The easiest way to try it out is the GOCR Win Frontend, which installs GOCR as well. My opinion matched Pitor's:

To let things be clear - gocr is not ready, to say the least. Personally I'd even say the effect of trying to OCR a page is so crappy it is not even worth installing the gocr engine (seems like the total rewrite in 0.40 did not help much). And I am talking about an ascii black text on a white page, without other elements. Gocr seems to go all the way down here - error in 98% of recognized characters, randomly added spaces, etc. For example: content is C unrir in gocr, sounds like drunken elvish to me.

Tesseract OCR

Yeah, there's been some chatter in the blogospheres and internets about Tesseract since Google assisted in re-releasing it as an open source project. I have no doubts that the press alone (not to mention Google's involvement) will propel Tesseract towards OCR fame and fortune, but it sounds like it's not usable at this point:

It only is configured to build under MSVC++6 for Windows.
It only accepts uncompressed bitonal tiffs.
It's command-line only.
No GUI.
It performed abysmally on the provided testimage.tif
But it did build. :)

Microsoft Office Document Imaging

On accident, I stumbled across Microsoft Office Document Imaging. It's included Microsoft Office Tools ("Microsoft Office \ Microsoft Office Tools" folder in the start menu, default installation location is "C:\Program Files\Common Files\Microsoft Shared\MODI\11.0\"). The interface looks a "My First VB5 Application" reject, but it works great.

It handles scanned documents via TWAIN. The image import's a bit lame - it only handles TIF files. You can convert to TIF in just about any graphics application (e.g. MSPAINT - open the file, Save As TIF file). An easier method is to just copy the image to the clipboard and paste as a new page into MODI.

Here's a quick walkthrough of how I grabbed some text from a PDF².

Step 1. I selected the text I wanted to OCR with Cropper (output set to Clipboard)

Step 2. I opened Microsoft Office Document Imaging and loaded my image with Page / Paste Page

Step 3. I ran the OCR process by clicking on the "funky eye" toolbar button (or in the Tools menu)

Step 4. Click the Export to Word toolbar button

Step 5. Copy the text and paste it where you want it

In this case, it was an e-mail. I've done the same thing to grab SQL or C# code which I then paste into the editor and compile (Ctrl-F5 for SQL, Ctrl-Shift-B for C#) to catch the things that didn't make it through the OCR cleanly.

I haven't tried it, but apparently you can automate MODI from .NET.

¹ Yes, it sounds insane, but it actually worked, and the business value of the data more than justified it.
² Yes, you can select and copy text in a PDF. This is just an example, but in this case the final result of the OCR'd text was a lot cleaner than the oddly mangled and mis-formatted text I got from the PDF select / copy approach.

What is PDF Spam?First there was email, then came spam - unsolicited commercial email - hawking pharmaceuticals, stock trades, sex, and more. Spam filtering became smarter with keyword and bayesian filtering, and the spam was minimized for awhile. Then image spam began, the emails with little more than a link to an image on a server. When the email is opened with an HTML email reader the spam appears a few seconds after viewing the email. Since there weren't keywords to analyze, most image spam slipped through spam filters with ease. However, now spam filtering tools have added OCR capabilities to "read" an image and search for keywords and phrases just like text emails. So what's next for the spammers to try...PDF Spam.

Spammers have now resorted to attaching PDFs to emails to entice users to open the PDFs and read their ads. Very annoying, since almost all spam including a PDF is much larger in size than a normal email. At first, I wondered if a virus writer had been able to inject a PDF file with a virus and was infecting computers. I received literally hundreds of these types of emails a few weeks ago. Luckily it does not appear that way. Although many of the newest viruses are hijacking computers and sending these PDF spams from these drone machines.

What Does a PDF Spam look like?

Most common PDF spam has very little in the body of the message, just a subject and the PDF file. You can see a copy of this type of spam below:

Can A PDF File Contain a Virus?

Well, yes and no. Back in 2001, a virus named Peachy was created that distributed via PDF. Fortunately, it could not be activated by someone viewing it with Acrobat Reader, only users with the full version of Adobe Acrobat were susceptible to this virus. Peachy exploited the fact that PDF files could contain executable files, in this case a VBScript file, that users of Adobe Acrobat could actually open. Virus scanners were updated and the virus didnt have a huge effect on the internet.

Luckily, up to this point there has not been a way for a virus writer to infect a PDF file so that a person viewing it with Adobe Reader would be harmed. Although its still best to scan ANY file including a PDF file with an up-to-date virus scanner before attempting to open it.

Can PDF Spam Be Stopped?

Although PDF Spam is a huge problem currently, spam filtering programs will catch up and start to filter this garbage email out. Unfortunately, the attachment spam will morph into other types of files, and I've already seen Excel files (.xls) being used for spam as well. Using a reliable spam filter from your ISP or business and being careful not to open ANY attachment you are not sure of will keep you the safest. Although PDF spam may not contain a virus, the best advice is to not open it and just delete it.

What About Greeting Card Spams?

A new round of electronic greeting cards contains viruses are making the rounds as well. These ecards want you to download a file called msdataaccess.exe to view the card. Click here to read more about these dangerous cards and how to remove the virus

PcBerg

Free OCR software? You may already have it...