[MacPorts] #47574: port request: 'tabula' and 'tabula-extractor'

MacPorts noreply at macports.org
Sun Apr 26 05:43:14 PDT 2015


#47574: port request: 'tabula' and 'tabula-extractor'
-----------------------------------------------+---------------------------
 Reporter:  kurt.pfeifle@…                     |      Owner:  macports-
     Type:  request                            |  tickets@…
 Priority:  Normal                             |     Status:  new
Component:  ports                              |  Milestone:
 Keywords:  PDF, table, csv, tsv, spreadsheet  |    Version:
                                               |       Port:
-----------------------------------------------+---------------------------
 The self-decription of Tabula project is quite telling and appropriate:

 > ''"Tabula is a tool for liberating data tables trapped inside PDF
 files."''

 Here is the link to the sources:

 * https://github.com/tabulapdf/

 ----

 Extracting tables from PDF pages into a usable spreadsheet format is
 extremely difficult.
 Here is some background information:

 * http://stackoverflow.com/a/26110587/359307

 Given the scope of this task, Tabula works extremely well.

 Tabula family of tools is written in Ruby. In the background they make use
 of PDFBox (which is written in Java) and a few other third-party libs. To
 run the command line tool `tabula`, hosted in the Tabula-Extractor
 repository, requires JRuby-1.7 installed.My JRuby is the Macports version.

 I've been successful to run `tabula` directly from the cloned git
 repository:
 {{{
     mkdir ~/svn-stuff
     cd ~/svn-stuff
     git clone https://github.com/tabulapdf/tabula-extractor.git git
 .tabula-extractor
 }}}
 Included in this Git clone will already be the required libraries, so no
 need to install PDFBox.
 The command line tool is in the `/bin/` subdirectory.

 Exploring the command line options:
 {{{
     ~/svn-stuff/git.tabula-extractor/bin/tabula -h

     Tabula helps you extract tables from PDFs

     Usage:
            tabula [options] <pdf_file>
     where [options] are:
              --pages, -p <s>:   Comma separated list of ranges, or all.
 Examples:
                                 --pages 1-3,5-7, --pages 3 or --pages all.
 Default
                                 is --pages 1 (default: 1)
               --area, -a <s>:   Portion of the page to analyze
                                 (top,left,bottom,right). Example: --area
                                 269.875,12.75,790.5,561. Default is entire
 page
            --columns, -c <s>:   X coordinates of column boundaries.
 Example
                                 --columns 10.1,20.2,30.3
           --password, -s <s>:   Password to decrypt document. Default is
 empty
                                 (default: )
                  --guess, -g:   Guess the portion of the page to analyze
 per page.
                  --debug, -d:   Print detected table areas instead of
 processing.
             --format, -f <s>:   Output format (CSV,TSV,HTML,JSON)
 (default: CSV)
            --outfile, -o <s>:   Write output to <file> instead of STDOUT
 (default:
                                 -)
            --spreadsheet, -r:   Force PDF to be extracted using
 spreadsheet-style
                                 extraction (if there are ruling lines
 separating
                                 each cell, as in a PDF of an Excel
 spreadsheet)
         --no-spreadsheet, -n:   Force PDF not to be extracted using
                                 spreadsheet-style extraction (if there are
 ruling
                                 lines separating each cell, as in a PDF of
 an Excel
                                 spreadsheet)
                 --silent, -i:   Suppress all stderr output.
       --use-line-returns, -u:   Use embedded line returns in cells. (Only
 in
                                 spreadsheet mode.)
                --version, -v:   Print version and exit
                   --help, -h:   Show this message
 }}}

-- 
Ticket URL: <https://trac.macports.org/ticket/47574>
MacPorts <https://www.macports.org/>
Ports system for OS X


More information about the macports-tickets mailing list