Recode utility - CLI

So - I had a un-DRM copy of an ebook I own…

But it was annoying - the author, an English language author, writing in English, chose to put accents on vowels and consonants in character names - I find this nearly as annoying as authors (science fiction / fantasy) authors using apostrophes in character or place names to make them sound more “alien”…

Not just umlauts either - things like a circumflex (caret) above vowels… lots of it - too much… I suppose I’m disrespecting the author - but I dunno… But I wouldn’t continue turning pages if I had to suffer that annoyance. AFAIK names are often just made-up gobbledy-gook phonetic syllables smashed together to sound foreign or alien… I don’t like it… Names are actually words and have meanings…

I guess I could even - if bothered enough - use sed to replace all entries with some gobbledy-gook name, with Frederick or Bartholemew - but that would going a tad too far I reckon :smiley:

So - thought - how do I get rid of these accents?

DRM epubs files are just zip files containing html.

HTML is just text.

So - unzip them into a folder - then use the “recode” utility (sudo apt install recode) :

recode -f utf8..flat < input.html > output.html

So - create a loop

for F in *.html ; do recode -f utf8..flat < $F > $F.recoded
rm *.html
rename 's/.recoded//' *.recoded  
zip ../BookTitle.epub *

That “rename” above is perl rename, sometimes called “prename” to avoid confusion with the rename utility that RedHat based distros default to.

Love the shell! :heart:

Note : if it was a historical novel and was using the correct spelling with accents, of e.g. French people, or Germans - it wouldn’t bother me. Heck - even Tolkien did it sometimes and that annoyed me - but - he was a linguist or language scholar…

P.S. I learned about “recode” here : https://unix.stackexchange.com/questions/631652/remove-accents-from-characters

5 Likes

Great idea and good work!
I’m German and we have a lot of special characters (umlaute) and weird grammar rules and don’t forget the: ß :sweat_smile:

But as you, I don’t like strange character names. I’m not a Klingon :stuck_out_tongue:

2 Likes

I like my books with alien or weird sounding names. Then again, I’m Dutch and we have lots of words with these symbols, so I don’t mind. :slight_smile:

3 Likes

I get utf8, but what does flat mean?

1 Like

The correct term for accents is Diacritics … there is no accent used in its own name

2 Likes

Good question - I’ve absolutely no idea… It just works…

Tried googling it: zip. There’s no offering in any man pages either…

3 Likes

The end of that summary from WikiPedia : “Some diacritics, such as the acute ⟨ó⟩, grave ⟨ò⟩, and circumflex ⟨ô⟩ (all shown above an ‘o’), are often called accents…” :smiley:

I did three years of French in high school and the teachers always called them accents - never mentioned diacritics…

I didn’t do German, or other language that uses the umlaut - so - maybe in German language classes they called them diacritics? :smiley:

Just to update - that “utf8..flat” also removed apostrophes from words like “who’d” (which is generally frowned upon in a narrative - i.e. in a narrative it should “who had” and e.g. “it’s” should be “it is”) - but generally okay to use in dialog.

3 Likes

No, German only has umlaut and they called it umlaut… no need for a general term when there is only one.

I think diacritics is a scholarly or a typesetting term. Normal people call them accents.

2 Likes

Yes it is, back to the days of upper and lower case and things like that. You can write books on the subject its so detailed.

Typographie used to be taught as a subject at uni to design students but no longer sure if its covered in any detail now. Its a very long time back.

In French its called police as originally the typeface was guarded by a policeman between the making and its use.

3 Likes