7

I have a large pdf file (83 MB) and I want to translate it. I tried to split the file with pdftk and translate each part with https://www.onlinedoctranslator.com/en/ (The file can't be larger than 10 MB) but it is very labor-intensive. The preferred translator would be Google engine. If there is no easy way to automatize this task It would be helpful if you give me a tip how to split pdf to parts of even size (in MBs).

terdon
  • 104,404
Przemek
  • 173
  • 6
    83mb is large? :-P Have a look at translate-shell We use that to translate from command line and it uses googleservices. I would assume you could then just a loop and use sed to replace contents in the pdf. – Rinzwind Mar 07 '19 at 09:14
  • 4
    Links https://www.2daygeek.com/translate-shell-a-tool-to-use-google-translate-from-command-line-in-linux/# Project page https://github.com/soimort/translate-shell – Rinzwind Mar 07 '19 at 09:15
  • 1
    Addendum to @Rinzwind s comment translate-shell is available through multiverse repository in xenial, bionic, cosmic, and disco. Link: https://packages.ubuntu.com/search?keywords=translate-shell&searchon=names&suite=all&section=all – Videonauth Mar 07 '19 at 09:25
  • @dessert well the good answer should contain an example how to translate pdf – Przemek Mar 07 '19 at 21:57

2 Answers2

9

Have a look at translate-shell

This installs a command called trans where you can tell it to translate. Examples from the project page.

Translate Shell (formerly Google Translate CLI) is a command-line translator powered by Google Translate (default), Bing Translator, Yandex.Translate, and Apertium. It gives you easy access to one of these translation engines in your terminal:

$ trans 'Saluton, Mondo!'
Saluton, Mondo!

Hello, World!

Translations of Saluton, Mondo!
[ Esperanto -> English ]
Saluton ,
    Hello,
Mondo !
    World!

By default, translations with detailed explanations are shown. You can also translate the text briefly: (only the most relevant translation will be shown)

$ trans -brief 'Saluton, Mondo!'
Hello, World!

You can also tell it to use a language:

trans :fr word

And there is even more:

trans -browser firefox :fr http://www.w3.org/

will open firefox with a French translation of www.w3.org.

There is no method for directly translating a PDF. Method for a file:

trans :fr file://input.txt

Now in relation to a PDF:

sudo apt install poppler-utils

with that tool you can make a text file from the pdf.

pdftotext your.pdf your.txt
  • add -layout to preserve layout
  • add -opw {password} if there is a password

and you can feed the file

trans file://your.txt

Next step: back to PDF

sudo apt-get install enscript ghostscript

and convert to postscript and then to pdf:

enscript -p output.ps your.txt
ps2pdf output.ps your2.pdf

I got this working on a PDF with some words in it. No guarantee it works on a large file so please comment below if this worked.

Rinzwind
  • 310,127
  • Well, for large files it prints an unformatted version. Also you forgot to mention encodings: cat <your.txt> | iconv -f latin1 -t iso-8859-1 | enscript -X 88591 -o - | ps2pdf - output.pdf – z3nth10n Aug 28 '22 at 22:01
1

While the other answer works if your pdf is only text, you can preserve images and text location by modifying the process ever so slightly. We will still use the same trans tool, but instead of pdftotext which gets rid of a ton of information, we will use pdftohtml from the same package.

I ran pdftohtml -c -s sample.pdf after looking at the man page and assuming 'complex' and 'single-file' were good flags to hit, but no clue what they actually do. It fills your current working directory with all the background images, so it would be good to run from an empty directory to keep yourself organized.

The fun part is next: I use gnu sed, and actually make use of the gnu extensions (the execute flag on the swap command in this case), so if you're bsd or mac or using some other sed first install gsed. If you want to just copy and run on a bash command line, here's the full command: sed -ri.bak '/<p/{h;s:<[^>]*>::g;s/'\''/'\''\'\'\''/g;s/.*/trans -b '\''&'\''/e;G;s:([^\n]*)\n([^/]*>)[^<]*(.*):\2\1\3:}' sample-html.html. The -r turns extended regex on, and -i.bak changes the file in place, but creates a backup just incase something goes wrong. The command itself goes as follows:

/<p/{                                 # run the following only on lines which are in a <p> tag 
h                                     # save the line in the hold space
s/<[^>]*>//g                          # delete all html tags
s/'\''/'\''\'\'\''/g                  # escape any single quotes
s/.*/trans -b '\''&'\''/e             # put `trans -b '` at the start and `'` at the end
                                      #     and run the line in the default shell
G                                     # append the line from the hold space after a newline
s:([^\n]*)\n([^/]*>)[^<]*(.*):\2\1\3: # splits into (translated line)\n(opening tags)original text(closing tags)
}                                     # and replaces it with (opening tags)(translated line)(closing tags)

since it makes a separate request for each line, it can take a long time; which means it would be helpful to see what's happening. if you add echo '\''translating &'\'' >/dev/tty; just before the trans command, you can see what it's actually doing anything rather than just sitting there.

This command assumes several things which you may not want to.
Some things I didn't bother with:

  • what if a <p> tag contains a newline? (only the first line gets translated, the rest are untouched)
  • what if some tag doesn't go across the full <p>? (only innermost text gets replaced, eg <p>some <b>strong</b> text</p> becomes <p>some <b>some strong text</b> text</p>)
  • what if there's a self closing tag? (same as before: <p>multiline<br />text</p> becomes <p>multilinetext<br />text</p>)
  • what if text is in something other than a <p>? (nothing happens)
  • html entities are ignored (maybe add a s/&#160;/ /g;?)

Now that you have translated everything, you just need to convert it back to a pdf. What I found easiest was just opening sample-html.html in a browser, and ctrl+p printing to pdf.

This is what the result looks like for the pdf I was translating: page of pdf with japanese text in manga-style panels

using google's translate: page of pdf with english text in manga-style panels

using firefox's translate: page of pdf with english text in manga-style panels

it's obviously still not great (for example, Linuxのコマンド -> Linux commands is split onto separate lines, Linux co and mands) but it's still more helpful than the text file the previous answer produced

guest4308
  • 111