[Leaplist] Scanner Setup Help?
Steve Litt
slitt at troubleshooters.com
Tue Jul 14 11:47:44 EDT 2009
On Tuesday 14 July 2009 11:34:26 am Bruce Metcalf wrote:
> I need to be able to stack up a sheaf of papers in the document feeder
> and load them all in a batch, then OCR them in a batch as well (or
> simultaneously, whatever).
>
> I've been testing things like OCRAD, Clara, and Kooka without
> satisfactory results, but that could just be me. (Their instructions are
> remarkably well hidden, if extant.) So far, the fastest approach is
> retyping, which I find philosophically unacceptable.
Hi Bruce,
In a previous life, I was the main programmer on a project to convert typed
paper timesheets to electronic form suitable for inclusion in the timekeeping
and accounting systems. I found out more than I ever wanted to about
scanning. This was 1988, so it's possible scanning is a quantum leap better
now, but I doubt it.
If you're working with clean, unfolded paper, and if the typed material is a
good monospace typewriter font, you stand a ghost of a chance of doing it
with only a few human interventions per page. As the fonts become more iffy,
as the paper becomes more dirty or folded or skewed, mistakes start being the
rule rather than the exception.
Our scanner system didn't do well at all with text in boxes and the like. And
of course, illustrations really messed it up.
I don't know the condition of your source paperwork, but the more spots,
shaddows, smudges, folds, illustrations, and margin changes, the tougher it's
going to be. The font will make a tremendous difference. I mean think of
it -- even as a human, the only way you differentiate between the number 1
and a lower case l is by context. With fonts that don't put a line through
the middle of a zero, the number 0 and the letter capital O can't be
differentiated except by context. If you, with those infinityBytes of carbon
based processor can't differentiate, how could a commodity scanner, computer
and software? The slightest bit of smudging makes 5 and 6 look identical in
some fonts.
Be sure you set your resolution for the highest optical resolution your
scanner can do. If your papers have previously been folded, unfold them and
weigh them down under 50 pounds of books for a few days to get as much of the
folds out as you can. Clean the glass on the scanner before each run, using
eyeglass cleaner and eyeglass cloth (both available at Costco).
If it were me, I'd do about 4 runs, and use the diff command to find areas
where they disagree, and resolve them by hand.
I don't know what your budget is, but one company I worked for had two human
typists transcribe material as fast as possible, and then ran the two
transcriptions through something like diff to find all mistakes that weren't
made by both.
SteveT
Steve Litt
Recession Relief Package
http://www.recession-relief.US
Twitter: http://www.twitter.com/stevelitt
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Leaplist
mailing list