How do I produce a multi-page sandwich pdf with hocr2pdf?

Question

I used tesseract to produce the special html to use with hocr2pdf starting from a muti-page tif.

I tried using hoc2pdf to produce a "sandwich pdf" (image + hidden text layer).

Hocr2pdf produces a one page pdf with all the pages superimposed.

Is there a way to solve this problem or an alternative solution?

score 3 · Accepted Answer · answered Mar 27 '13 at 21:04

3

I found a workaround to this issue. Hocr2pdf has issues with producing multi-page pdfs so I produced single-page tifs, ran tesseract-ocr, ran hocr2pdf then combined the results with the following script:

for f in ./*.tif; do
   tesseract "$f" "$f" -l fra hocr
   hocr2pdf -i "$f" -s -o "$f.pdf" < "$f.html"
done
pdftk *.tif.pdf cat output "output.pdf" && rm *.tif.pdf && rm *.tif.html

answered Mar 27 '13 at 21:04

To Do

15,893

Thanks, this seems to be the closest way, in Ubuntu, to convert an image-based pdf into one that has some text capabilities. Unfortunately the small set of default fonts and relatively rough positioning makes the output text line up rather poorly with the image. I wish there was a more complete solution; this is almost it! – waldyrious Jun 11 '13 at 19:11
I'm actually using PDF X-Change viewer through Wine. The result is much better. – To Do Jun 18 '13 at 15:01
Just tried the trial version. Indeed it places the text pretty much exactly, apparently by placing each word individually rather than blocks of text.
The only thing missing is the ability to tweak the OCR for the inevitable errors... is there any way to open the already OCR'd pdfs in OCRfeeder? I found its interface quite easy to use but I can't seem to get it to detect the existing text layer... I also tried PDFedit but its interface is way too advanced for mortals like me!
– waldyrious Jun 19 '13 at 05:11
@ToDo: I like your reply. I actually have a pdf file and a html file in hocr format. Can I merge the hocr file into the pdf file, to make the pdf file searchable, without converting the pdf file to single-page image files? See http://unix.stackexchange.com/questions/170133/merge-and-export-ocred-text-into-and-from-a-pdf-file – Tim Nov 27 '14 at 13:59
Why not pdfsandwich with "-enforcehocr2pdf" parameter? – Pablo Bianchi Mar 08 '17 at 15:18

How do I produce a multi-page sandwich pdf with hocr2pdf?

1 Answers1