[MacPorts] #47574: port request: 'tabula' and 'tabula-extractor'
MacPorts
noreply at macports.org
Sun Apr 26 05:43:14 PDT 2015
#47574: port request: 'tabula' and 'tabula-extractor'
-----------------------------------------------+---------------------------
Reporter: kurt.pfeifle@… | Owner: macports-
Type: request | tickets@…
Priority: Normal | Status: new
Component: ports | Milestone:
Keywords: PDF, table, csv, tsv, spreadsheet | Version:
| Port:
-----------------------------------------------+---------------------------
The self-decription of Tabula project is quite telling and appropriate:
> ''"Tabula is a tool for liberating data tables trapped inside PDF
files."''
Here is the link to the sources:
* https://github.com/tabulapdf/
----
Extracting tables from PDF pages into a usable spreadsheet format is
extremely difficult.
Here is some background information:
* http://stackoverflow.com/a/26110587/359307
Given the scope of this task, Tabula works extremely well.
Tabula family of tools is written in Ruby. In the background they make use
of PDFBox (which is written in Java) and a few other third-party libs. To
run the command line tool `tabula`, hosted in the Tabula-Extractor
repository, requires JRuby-1.7 installed.My JRuby is the Macports version.
I've been successful to run `tabula` directly from the cloned git
repository:
{{{
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git
.tabula-extractor
}}}
Included in this Git clone will already be the required libraries, so no
need to install PDFBox.
The command line tool is in the `/bin/` subdirectory.
Exploring the command line options:
{{{
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all.
Examples:
--pages 1-3,5-7, --pages 3 or --pages all.
Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
page
--columns, -c <s>: X coordinates of column boundaries.
Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is
empty
(default: )
--guess, -g: Guess the portion of the page to analyze
per page.
--debug, -d: Print detected table areas instead of
processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON)
(default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT
(default:
-)
--spreadsheet, -r: Force PDF to be extracted using
spreadsheet-style
extraction (if there are ruling lines
separating
each cell, as in a PDF of an Excel
spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are
ruling
lines separating each cell, as in a PDF of
an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only
in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
}}}
--
Ticket URL: <https://trac.macports.org/ticket/47574>
MacPorts <https://www.macports.org/>
Ports system for OS X
More information about the macports-tickets
mailing list