OCR with tesseract in Linux

nevj · June 17, 2025, 11:33am

I have a rather old and battered book which contains an essay by Dorothy Sayers that I rather like. The pages are yellowing, and the print quality is poor, so I decided to try and resurrect my favourite essay into a modern format.

The steps were

Scan each page of the document with xsane . The quality was poor, for example here is the image from the fourth page

Process that image with tesseract

tesseract image-0004.png image-004

The first argument is the image , the second argument is the output file… it appends .txt , so we get image-004.txt.
The output looks like this

THE GREAT. MYSTERY.

its'for it-is what God wants for'us:and,.as St:
Paul says, no created thing, whether of. time,
spice. or split, can sepafate us from His love.

Bhit do we truly want it? At bottom—yes,
wé do, for itis the end for-which we were made
and without which We caniot be happy ot com-
plete, “In. every soul-that shall be saved,”
said the Lady Julian of Norwich;.“ there is a
godly’will that never assented to sin, nor ever
shall,” ‘and-it is this will- which ‘has to be set
free so that it may becomeé united to: God,

* * *

That is just the first paragraph. As you can see it is ‘noisy’. It gets letters wrong, puts an accent over ‘e’, and inserts punctuation that is not there. That example is one of the worst paragraphs.

Tesseract deals with the two facing pages in the scanned image OK.

So to fix that I had to hand edit the .txt file output of tesseract for each page. I used vi. The cleaned up paragraph above looks like this

it; for it is what God wants for us and, as St.
Paul says, no created thing, whether of time,
space or spirit, can separate us from His love.

But do we truly want it? At bottom — yes,
we do, for it is the end for which we were made
and without which we cannot be happy or com-
plete, “In every soul that shall be saved,”
said the Lady Julian of Norwich, “there is a
godly will that never assented to sin, nor ever
shall,” ‘and it is this will which has to be set
free so that it may become united to God,

       * * *

OK, do that for all five pages, then

Read all five pages into one .md file.
There are still some fixes needed

setup the heading
get rid of the minuses in ‘xxx-xxx’ words which span line-breaks
space out those * * * breaks
put the italics into .md format
add a References section at the end

It took a couple of days of fiddling, but the result is very satisfactory ( for me) . Here is a sample as seen in the remarkable .md viewer

There are other OCR apps than tesseract available in Linux, including one that is a GUI for tesseract.
Tesseract is the leading one… it supports many languages.

If anyone in interested in the actual Sayers essay, there is a copy of my sayers.md file here

github.com/nevillejackson/Documents

sayers/sayers.md

main


### CHRISTIAN BELIEF ABOUT HEAVEN AND HELL ###

by
Dorothy Sayers

 If we are to understand the Christian doctrine
about what happens at death, we must first
rid our minds of every concept of time and space
as we know them. Our time and space have
no independent reality: they belong to the
universe and were created with it. Take down
any novel you like from the shelf. The story
it tells. may cover the events of a few hours or
of many years; it may range over a few acres
or the whole globe, But all that space-time is
contained within the covers of the book, and
has no contact at any point with the space-time
in which you are living: It, and the whole universe of
 action which goes on inside the book,

This file has been truncated. show original

and the book from which it came is

I believe it may still be copyrighted. You are allowed to take one copy for private use.

Rosika · June 17, 2025, 1:13pm

Hi Neville,

congratulations on your achievements.

So the poor scan qualtity is to be blamed for the “noisy” output provided by tesseract, if I understood you correctly.

On ubuntuusers wiki I found this note:

tesseract-ocr can be “trained”; it is possible to teach completely new languages, possibly to improve existing languages (e.g. if templates are used that contain “unusual” fonts, or are not of high quality). The program Lios provides a graphical interface for this.

(underlined by me)

I was wondering if this Lios thing might be of any help.

sudo apt-get install lios

A Tesseract Trainer GUI is available within lios.

No idea if would improve things and whether it´d be worth the effort.
Just wanted to let you know.

Many greetings from Rosika

nevj · June 17, 2025, 1:21pm

Hi Rosika,
I think so. The original book pages are damaged… yellowed and specs of dirt.
I did not try a higher resolution scan.
I did not try to use a noise filter on the scanned images
It was good enough to get my job done… with some hand editing.
I did not know about training tesseract… I think OCR can be trained to read handwriting… maybe that is where training is needed.

Regards
Neville

Rosika · June 17, 2025, 1:24pm

Hi Neville,

Yes, I can imagine that would give tesseract a hard time dealing with them correctly.

Right. That is the main thing.
Well done, Neville

Cheers from Rosika

Sheila_Flanagan · June 17, 2025, 4:07pm

Thank you. I downloaded it from your github docs.

Sheila

pdecker · June 17, 2025, 4:55pm

I may fiddle with this at some point too.

callpaul.eu · June 17, 2025, 7:29pm

I believe you can do just that as long as its not for sale or profit. But not sure about a whole book, it used to be a set percent of the book, but that may vary from different countries

Looks like 10 % but would not be sure if that is of a whole book or just text pages

nevj · June 17, 2025, 11:19pm

Thanks , that would cover it
There is no copyright notice in the book, but I would like to play it safe.
and
what I have made public is not copy… it is a modified digital representation. I modified the presentation, not the words. There is an acknowledgement of the source.

nevj · June 18, 2025, 12:36am

I feel one could improve it with some image processing before using tesseract
eg filter out any speckle

nevj · June 27, 2025, 11:14am

I tried a denoise filter in gimp, before using tesseract.
It improved the txt output, but still not perfect

raw image → tesseract

THE GREAT. MYSTERY.

its'for it-is what God wants for'us:and,.as St:
Paul says, no created thing, whether of. time,
spice. or split, can sepafate us from His love.

Bhit do we truly want it? At bottom—yes,
wé do, for itis the end for-which we were made
and without which We caniot be happy ot com-

raw image → noise filtered image → tesseract

THE GREAT MYSTERY.

its'for it is what God wants for us and, as St.
Paul says, no created thing, whether of. time,
space or spirit, can separate us from His love.

But do we truly want it? At bottom—yes,
wwe do, for itis the end for which we were made
and without which we cannot be happy ot com-

I am sure it could be better with more fiddling’