OCR with tesseract in Linux

I have a rather old and battered book which contains an essay by Dorothy Sayers that I rather like. The pages are yellowing, and the print quality is poor, so I decided to try and resurrect my favourite essay into a modern format.

The steps were

  1. Scan each page of the document with xsane . The quality was poor, for example here is the image from the fourth page

  1. Process that image with tesseract
tesseract image-0004.png image-004

The first argument is the image , the second argument is the output file
 it appends .txt , so we get image-004.txt.
The output looks like this

THE GREAT. MYSTERY.

its'for it-is what God wants for'us:and,.as St:
Paul says, no created thing, whether of. time,
spice. or split, can sepafate us from His love.

Bhit do we truly want it? At bottom—yes,
wé do, for itis the end for-which we were made
and without which We caniot be happy ot com-
plete, “In. every soul-that shall be saved,”
said the Lady Julian of Norwich;.“ there is a
godly’will that never assented to sin, nor ever
shall,” ‘and-it is this will- which ‘has to be set
free so that it may becomeé united to: God,

* * *

That is just the first paragraph. As you can see it is ‘noisy’. It gets letters wrong, puts an accent over ‘e’, and inserts punctuation that is not there. That example is one of the worst paragraphs.

Tesseract deals with the two facing pages in the scanned image OK.

  1. So to fix that I had to hand edit the .txt file output of tesseract for each page. I used vi. The cleaned up paragraph above looks like this
it; for it is what God wants for us and, as St.
Paul says, no created thing, whether of time,
space or spirit, can separate us from His love.

But do we truly want it? At bottom — yes,
we do, for it is the end for which we were made
and without which we cannot be happy or com-
plete, “In every soul that shall be saved,”
said the Lady Julian of Norwich, “there is a
godly will that never assented to sin, nor ever
shall,” ‘and it is this will which has to be set
free so that it may become united to God,

       * * *

OK, do that for all five pages, then

  1. Read all five pages into one .md file.
    There are still some fixes needed
  • setup the heading
  • get rid of the minuses in ‘xxx-xxx’ words which span line-breaks
  • space out those * * * breaks
  • put the italics into .md format
  • add a References section at the end

It took a couple of days of fiddling, but the result is very satisfactory ( for me) . Here is a sample as seen in the remarkable .md viewer

There are other OCR apps than tesseract available in Linux, including one that is a GUI for tesseract.
Tesseract is the leading one
 it supports many languages.

If anyone in interested in the actual Sayers essay, there is a copy of my sayers.md file here

and the book from which it came is

I believe it may still be copyrighted. You are allowed to take one copy for private use.

4 Likes

Hi Neville, :waving_hand:

congratulations on your achievements. :clap:

So the poor scan qualtity is to be blamed for the “noisy” output provided by tesseract, if I understood you correctly.

On ubuntuusers wiki I found this note:

tesseract-ocr can be “trained”; it is possible to teach completely new languages, possibly to improve existing languages (e.g. if templates are used that contain “unusual” fonts, or are not of high quality). The program Lios provides a graphical interface for this.

(underlined by me)

I was wondering if this Lios thing might be of any help.

sudo apt-get install lios

A Tesseract Trainer GUI is available within lios.

No idea if would improve things and whether itÂŽd be worth the effort.
Just wanted to let you know.

Many greetings from Rosika :slightly_smiling_face:

4 Likes

Hi Rosika,
I think so. The original book pages are damaged
 yellowed and specs of dirt.
I did not try a higher resolution scan.
I did not try to use a noise filter on the scanned images
It was good enough to get my job done
 with some hand editing.
I did not know about training tesseract
 I think OCR can be trained to read handwriting
 maybe that is where training is needed.

Regards
Neville

3 Likes

Hi Neville, :waving_hand:

Yes, I can imagine that would give tesseract a hard time dealing with them correctly.

Right. That is the main thing.
Well done, Neville :+1:

Cheers from Rosika :slightly_smiling_face:

4 Likes

Thank you. I downloaded it from your github docs.

Sheila

3 Likes

I may fiddle with this at some point too.

2 Likes

I believe you can do just that as long as its not for sale or profit. But not sure about a whole book, it used to be a set percent of the book, but that may vary from different countries

Looks like 10 % but would not be sure if that is of a whole book or just text pages

Thanks , that would cover it
There is no copyright notice in the book, but I would like to play it safe.
and
what I have made public is not copy
 it is a modified digital representation. I modified the presentation, not the words. There is an acknowledgement of the source.

1 Like

I feel one could improve it with some image processing before using tesseract
eg filter out any speckle

3 Likes

I tried a denoise filter in gimp, before using tesseract.
It improved the txt output, but still not perfect

raw image → tesseract

THE GREAT. MYSTERY.

its'for it-is what God wants for'us:and,.as St:
Paul says, no created thing, whether of. time,
spice. or split, can sepafate us from His love.

Bhit do we truly want it? At bottom—yes,
wé do, for itis the end for-which we were made
and without which We caniot be happy ot com-

raw image → noise filtered image → tesseract

THE GREAT MYSTERY.

its'for it is what God wants for us and, as St.
Paul says, no created thing, whether of. time,
space or spirit, can separate us from His love.

But do we truly want it? At bottom—yes,
wwe do, for itis the end for which we were made
and without which we cannot be happy ot com-

I am sure it could be better with more fiddling’

3 Likes