[MacPorts] #17540: poppler conflicts with xpdf
MacPorts
noreply at macports.org
Fri Oct 22 17:33:33 PDT 2010
#17540: poppler conflicts with xpdf
-----------------------------+----------------------------------------------
Reporter: gale@… | Owner: ricci@…
Type: defect | Status: reopened
Priority: Normal | Milestone:
Component: ports | Version: 1.6.0
Resolution: | Keywords: conflict
Port: poppler xpdf |
-----------------------------+----------------------------------------------
Comment(by ricci@…):
Replying to [comment:23 ricci@…]:
> Did anyone do any testing to verify that the poppler command line
utilities perform in the same way the xpdf command line utilities do? If
not, (and there are differences) then I agree that this is a "dangerous"
sort of change - people who know what the xpdf package contains (not just
the xpdf X11 pdf viewer) won't get what they are expecting.
I was able to run a test of 'pdftotext' on ~48k PDFs, here's the
summary:
Command used is: pdftotext -layout -enc ASCII7 -nopgbrk INPUT.pdf
OUTPUT.txt
Both xpdf's pdftotext and poppler's pdftotext fail on certain types of
data, for example poppler's tends to fail on 'ff' characters (like in the
word "offer"), possibly due to ligatures. However there were a few
instances where poppler worked on an 'ff' combination and xpdf didn't (go
figure). Overall it seemed like xpdf did a better here.
Ordering of text elements can be different, noticed in tabular data (I
suspect that each cell of the tables is its own text element in the PDF).
Where I could find the data, xpdf appeared to do a better job of ordering,
though neither was perfect. While there were a fair number of instances
where the ordering was different, I was only able to find a few files
where I could be sure I'd identified the ordering in the original
document.
Poppler did a better at spacing words in some of the data, having fewer
instances of run-together words.
There's also a significant difference in the text that the two versions
create when the PDF has a crop box - xpdf's pdftotext appears to pull text
from inside the crop box only, poppler's will pull all of the text from
the document. Which of these is "better" is probably subjective - to me
the user is expecting to get the text they can see when they open the PDF,
which would make the xpdf version "better".
Note that this is only one test on a single set of data, and with only
one application. More testing would be a good thing.
Based on the above, I do think that overwriting the xpdf command line
utilities with poppler is not the right answer, we need to give people a
choice here. So far as a 'default' goes (presuming we need one), if we
don't get more test data then I'd vote for the xpdf utilities as they
provided the text that the user would see when opening the PDF.
--
Ticket URL: <https://trac.macports.org/ticket/17540#comment:24>
MacPorts <http://www.macports.org/>
Ports system for Mac OS
More information about the macports-tickets
mailing list