[Leaplist] Scanner Setup Help?
Bruce Metcalf
bruce.metcalf at figzu.com
Tue Jul 14 12:33:58 EDT 2009
Steve Litt wrote:
>> I need to be able to stack up a sheaf of papers in the document
>> feeder and load them all in a batch, then OCR them in a batch as
>> well (or simultaneously, whatever).
>>
>> I've been testing things like OCRAD, Clara, and Kooka without
>> satisfactory results.... So far, the fastest approach is retyping,
>> which I find philosophically unacceptable.
>
> If you're working with clean, unfolded paper, and if the typed
> material is a good monospace typewriter font, you stand a ghost of a
> chance of doing it with only a few human interventions per page. As
> the fonts become more iffy, as the paper becomes more dirty or folded
> or skewed, mistakes start being the rule rather than the exception.
The source is a 50-year run of magazines. Early issues were monospace
typewriter, but reproduced on mimeograph, and some later photocopied.
Ugly. As we approach current issues, the fonts went proportional, then
justified, and multi-column formats with illustrations became steadily
more common. Also ugly for OCR.
> Our scanner system didn't do well at all with text in boxes and the
> like. And of course, illustrations really messed it up.
Yes, we have both.
> I don't know the condition of your source paperwork, but the more
> spots, shaddows, smudges, folds, illustrations, and margin changes,
> the tougher it's going to be.
Some is butt-ugly, some is like new. I'd be content with a system that
worked for even 10% of the range, provided that 10% ran faster than
retyping.
> Be sure you set your resolution for the highest optical resolution
> your scanner can do. ... Clean the glass on the scanner before each
> run, using eyeglass cleaner and eyeglass cloth....
>
> If it were me, I'd do about 4 runs, and use the diff command to find
> areas where they disagree, and resolve them by hand.
Now there's a clever approach. Not sure it'll prove faster than one run
with a manual edit, since I'll want one of those anyway, but worth a try.
> I don't know what your budget is, but one company I worked for had
> two human typists transcribe material as fast as possible, and then
> ran the two transcriptions through something like diff to find all
> mistakes that weren't made by both.
The budget is pretty close to zero. I can afford to mail copies to
typists anywhere, but no salary. Fortunately, typists seem to be available.
Thanks for your thoughts.
Bruce
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Leaplist
mailing list