[MacPorts] #17540: poppler conflicts with xpdf

MacPorts noreply at macports.org
Fri Oct 22 17:33:33 PDT 2010


#17540: poppler conflicts with xpdf
-----------------------------+----------------------------------------------
  Reporter:  gale@…          |       Owner:  ricci@…           
      Type:  defect          |      Status:  reopened          
  Priority:  Normal          |   Milestone:                    
 Component:  ports           |     Version:  1.6.0             
Resolution:                  |    Keywords:  conflict          
      Port:  poppler xpdf    |  
-----------------------------+----------------------------------------------

Comment(by ricci@…):

 Replying to [comment:23 ricci@…]:
 > Did anyone do any testing to verify that the poppler command line
 utilities perform in the same way the xpdf command line utilities do?  If
 not, (and there are differences) then I agree that this is a "dangerous"
 sort of change - people who know what the xpdf package contains (not just
 the xpdf X11 pdf viewer) won't get what they are expecting.

   I was able to run a test of 'pdftotext' on ~48k PDFs, here's the
 summary:

 Command used is:  pdftotext -layout -enc ASCII7 -nopgbrk INPUT.pdf
 OUTPUT.txt

 Both xpdf's pdftotext and poppler's pdftotext fail on certain types of
 data, for example poppler's tends to fail on 'ff' characters (like in the
 word "offer"), possibly due to ligatures.  However there were a few
 instances where poppler worked on an 'ff' combination and xpdf didn't (go
 figure).  Overall it seemed like xpdf did a better here.

 Ordering of text elements can be different, noticed in tabular data (I
 suspect that each cell of the tables is its own text element in the PDF).
 Where I could find the data, xpdf appeared to do a better job of ordering,
 though neither was perfect.  While there were a fair number of instances
 where the ordering was different, I was only able to find a few files
 where I could be sure I'd identified the ordering in the original
 document.

 Poppler did a better at spacing words in some of the data, having fewer
 instances of run-together words.

 There's also a significant difference in the text that the two versions
 create when the PDF has a crop box - xpdf's pdftotext appears to pull text
 from inside the crop box only, poppler's will pull all of the text from
 the document.  Which of these is "better" is probably subjective - to me
 the user is expecting to get the text they can see when they open the PDF,
 which would make the xpdf version "better".


   Note that this is only one test on a single set of data, and with only
 one application.  More testing would be a good thing.

   Based on the above, I do think that overwriting the xpdf command line
 utilities with poppler is not the right answer, we need to give people a
 choice here.  So far as a 'default' goes (presuming we need one), if we
 don't get more test data then I'd vote for the xpdf utilities as they
 provided the text that the user would see when opening the PDF.

-- 
Ticket URL: <https://trac.macports.org/ticket/17540#comment:24>
MacPorts <http://www.macports.org/>
Ports system for Mac OS


More information about the macports-tickets mailing list