[Leaplist] Scanner Setup Help?

Steve Litt slitt at troubleshooters.com
Tue Jul 14 11:47:44 EDT 2009


On Tuesday 14 July 2009 11:34:26 am Bruce Metcalf wrote:

> I need to be able to stack up a sheaf of papers in the document feeder
> and load them all in a batch, then OCR them in a batch as well (or
> simultaneously, whatever).
>
> I've been testing things like OCRAD, Clara, and Kooka without
> satisfactory results, but that could just be me. (Their instructions are
> remarkably well hidden, if extant.) So far, the fastest approach is
> retyping, which I find philosophically unacceptable.

Hi Bruce,

In a previous life, I was the main programmer on a project to convert typed 
paper timesheets to electronic form suitable for inclusion in the timekeeping 
and accounting systems. I found out more than I ever wanted to about 
scanning. This was 1988, so it's possible scanning is a quantum leap better 
now, but I doubt it.

If you're working with clean, unfolded paper, and if the typed material is a 
good monospace typewriter font, you stand a ghost of a chance of doing it 
with only a few human interventions per page. As the fonts become more iffy, 
as the paper becomes more dirty or folded or skewed, mistakes start being the 
rule rather than the exception.

Our scanner system didn't do well at all with text in boxes and the like. And 
of course, illustrations really messed it up.

I don't know the condition of your source paperwork, but the more spots, 
shaddows, smudges, folds, illustrations, and margin changes, the tougher it's 
going to be. The font will make a tremendous difference. I mean think of 
it -- even as a human, the only way you differentiate between the number 1 
and a lower case l is by context. With fonts that don't put a line through 
the middle of a zero, the number 0 and the letter capital O can't be 
differentiated except by context. If you, with those infinityBytes of carbon 
based processor can't differentiate, how could a commodity scanner, computer 
and software? The slightest bit of smudging makes 5 and 6 look identical in 
some fonts.

Be sure you set your resolution for the highest optical resolution your 
scanner can do. If your papers have previously been folded, unfold them and 
weigh them down under 50 pounds of books for a few days to get as much of the 
folds out as you can. Clean the glass on the scanner before each run, using 
eyeglass cleaner and eyeglass cloth (both available at Costco).

If it were me, I'd do about 4 runs, and use the diff command to find areas 
where they disagree, and resolve them by hand.

I don't know what your budget is, but one company I worked for had two human 
typists transcribe material as fast as possible, and then ran the two 
transcriptions through something like diff to find all mistakes that weren't 
made by both.

SteveT

Steve Litt
Recession Relief Package
http://www.recession-relief.US
Twitter: http://www.twitter.com/stevelitt


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Leaplist mailing list