86

I'm using pdftotext (part of poppler-utils) to convert PDF documents to text. It works, for the most part, but one thing I wish it did was to insert blank lines between separate paragraphs instead of mashing them together.

Is there way to get pdftotext to do this? And if not, is there another pdf to text utility that can do this?

dan
  • 3,133
  • 8
    In the title you say "pdftotext" (which is part of poppler-utils) and in the body you say "pdt2text" (which I don't know). Which you are you referring to? – enzotib Jul 06 '11 at 17:53
  • 1
    similar question PDF to audio software for academic papers? https://softwarerecs.stackexchange.com/questions/10640/pdf-to-audio-software-for-academic-papers – JinSnow Sep 16 '18 at 07:47

6 Answers6

139

If you are using pdftotext you can use the -layout flag to preserve the layout of the text on the pages in your input pdf file:

pdftotext -layout input.pdf output.txt
Noah
  • 1,491
  • 7
    There is also -table for table layouts specifically, works great. – P.Windridge Jun 04 '15 at 11:35
  • hi, what is units of all floating type number <page id="1" bbox="0.000,0.000,432.000,648.000" rotate="0">? means what is unit of bbox ? – Vivek Sable Mar 10 '16 at 10:07
  • I've frequently had problems converting PDF files containing tables. There's something weird about some PDF creation software, where the "reading order" of text jumps around the cells in the table, so the only way I've been able to extract data in any sensible order is to copy and paste, then fix it up in a text editor. This takes hours. The -layout option seems to fix this, thanks. – Huw Walters Jul 13 '16 at 15:28
  • 3
    @P.Windridge, where is this table option? I can't find it on version 0.48.0 from poppler-utils in Ubuntu 17.04 – gozzilli May 15 '17 at 13:11
  • 2
    @gozzilli That's way old. The latest pdftotext is v4.00, available in the Xpdf tools tarball here. – Adrian Apr 10 '18 at 02:46
  • 2
    @gozzilli Versions starting with 0. indicate that it's Popplers branch of Xpdf's original code. They started their version numbers over when they branched the code. Both groups now appear to maintain separate versions of these PDF tools. – Andrew Jun 11 '18 at 00:18
  • 1
    @VivekSable those are points (pixels) at the specified -r (resolution, default 72 dpi) – vstepaniuk Jul 14 '19 at 10:17
  • Note this solution does not improve pdftotext for anyone using it on scanned PDFs, on which you'll still get a blank output. For that you'll have better luck with an OCR-based conversion tool. – Dennis Aug 04 '23 at 11:04
28

You could try ebook-convert from Calibre.

If anything, I'd say it errs in the other direction: too many line breaks.

Another thing I'd definitely consider though is converting to HTML using pdfreflow, and then convert the HTML to TXT.

frabjous
  • 6,611
  • 1
    Note: ebook-convert cannot convert multi-column layout, it merges the columns into one column. For multi-column layout pdftotext produces much better output. Further limitations are described at https://manual.calibre-ebook.com/conversion.html#convert-pdf-documents . – asmaier May 09 '19 at 12:03
18

As a fan of open source (and automation) I hate to say this, but the best results I just got (on quite a large, complex PDF) were to open it in Adobe Reader, then choose File|Save As Text.

(I am pre-processing for text analysis experiments, not as a reader, but I think my first and second choice would be the same.)

I've been comparing the output side-by-side. My second choice is ebook-convert.

Adobe: left in FF for page breaks, left in page numbers, hasn't converted headings/paragraphs to single lines, but it has fixed hyphens. Junk that was hidden in the PDF did not get output. Correctly got the big capitals at start of sections, e.g. "The", not "T he" or even "T he".

ebook-convert: Left in page numbers, and some hidden junk in header/footer (but no FFs). Converts most paragraphs to be single lines. The ones it missed are double-spaced though! Bullets don't always line up with the text. Correctly got "The" at the start of the chapter.

pdftotext (without --layout): Not bad, bullets line up, but header/footer noise. FFs are in there. Hyphens removed. Worst for start of chapter big letters: "T\n\nhe".

pdftotext (with --layout): Similar, but more indents. "T he" for start of chapter.

pdftohtml >> pdfreflow >> htmltotext: It removed page numbers, but still junk in header/footer. "T he" for start of chapter. Hyphens removed. (It uses multiple lines per paragraph, yet they are not the same line breaks as in the other versions!)

JinSnow
  • 105
  • 1
    Acrobat reader 9 on linux generated squashed words in my case. ebook-convert worked fine. – ov7a Apr 22 '18 at 10:31
  • We really need an AI app for that, it seems perfect for that kind of task: anyone knows one? – JinSnow Sep 16 '18 at 08:14
  • 1
    Adobe reader is free, but... only to read pdf. For other things you need to pay (monthly subscription). (Pdf to text is limited to few pages). Pdfto text (or xpdf on win) is perfect for my needs. – JinSnow Sep 16 '18 at 08:58
  • For tabular data, it's now best to use the -table switch "pdftotext -table file_name.pdf output_name.txt" – Thom Ives Sep 03 '19 at 15:54
  • 2
    @ThomIves I don't see a -table option for the pdftotext command on Ubuntu (version 20.09.0 by The Poppler Developers). However, the -layout option was useful. – Flimm May 24 '21 at 16:00
7

If you have a Google account, you can use Google Drive to upload the PDF and transform it into editable text via 'Open with > Google Docs'.

Dennis
  • 413
xangua
  • 7,297
1

I also tried pypdf and compared it against pdftotext on two documents. It had more linebreaks and split some section names (REFERENCES was R E F E R E N C E S).

pdf2txt did output complete garbage.

I often use pdfBox (java) if pdftotext screws up the output. You might give it a try.

Max
  • 191
0

ebook-convert vs pdftotext concrete minimal example

ebook-coinvert was previously mentioned by frabjous , and I would like to illustrate it with a minimal example.

The problem with pdftotext from poppler-utils 22.12.0 is that it adds newlines within paragraphs when the paragraph is longer than the PDF page width, e.g. something like:

1:1 In the beginning God created the heaven and
the earth.
1:2 And the earth was without form, and void; and
darkness was upon the face of the deep. And the
Spirit of God moved upon the face of the waters.
1:3 And God said, Let there be light: and there
was light.
1:4 And God saw the light, that it was good: and
God divided the light from the darkness.

These extra newlines make the txt files really bad to read on a device like a Kindle.

ebook-convert however overcomes this very well, and produces something like:

1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light from the darkness.

which maintains paragraphs in single lines, regardless of how long the paragraph is, and adds a double newline between paragraphs, and behaves much better on a Kindle.

I'm going to test methods mentioned in other answers with this test PDF generated from this Libreoffice .odt file:

pdftotext output:

Title of my file
Table of Contents
H1 1......................................................................................................................................................1
H2 1 1...............................................................................................................................................1
H2 1 2...............................................................................................................................................1
H1 2......................................................................................................................................................1
H2 2 1...............................................................................................................................................1
H2 2 2...............................................................................................................................................1

H1 1 H2 1 1 H2 1 2 First very important paragraph. And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines. Reference to H1 1 on page: 1 https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg

H1 2 H2 2 1 H2 2 2

ebook-convert output:

Title of my file

Table of Contents

H1 1......................................................................................................................................................1

H2 1 1...............................................................................................................................................1

H2 1 2...............................................................................................................................................1

H1 2......................................................................................................................................................1

H2 2 1...............................................................................................................................................1

H2 2 2...............................................................................................................................................1

H1 1

H2 1 1

H2 1 2

First very important paragraph.

And now a very very very very very very very very very very very very very very very very very very very very very very very very long paragraph that gets split across two lines.

Reference to H1 1 on page: 1

https://commons.wikimedia.org/wiki/File:Fractal_Broccoli.jpg

H1 2

H2 2 1

H2 2 2

Document Outline

H1 1 H2 1 1

H2 1 2

H1 2 H2 2 1

H2 2 2

The line break aspect was also asked more specifically at: https://unix.stackexchange.com/questions/691579/how-to-convert-pdf-file-to-text-without-breaking-lines

Tested on Ubuntu 23.04, poppler-utils 22.12.0, calibre 6.11.0.