PHP Classes
elePHPant
Icontem

PHP PDFBox: Extract text from PDF documents using PDFBox tool

Recommend this page to a friend!
  Info   View files Documentation   View files View files (11)   DownloadInstall with Composer Download .zip   Reputation   Support forum   Blog    
Last Updated Ratings Unique User Downloads Download Rankings
2015-06-23 (3 years ago) RSS 2.0 feedNot yet rated by the usersTotal: 379 This week: 1All time: 6,549 This week: 363Up
Version License PHP version Categories
pdfbox 1.0.1BSD License5.3PHP 5, Files and Folders, Text proces...
Description Author

This package can extract text from PDF documents using the PDFBox tool.

It can read a PDF document from a file or an opened stream and calls the PDFBox Java tool to extract text the PDF document.

The extracted text can be returned in plain text, HTML or DOM objects. The output can also be saved to a given file.

  Performance   Level  
Name: Fabian Schmengler <contact>
Classes: 6 packages by
Country: Germany Germany
Innovation award
Innovation award
Nominee: 4x

Details

PdfBox

A PHP interface for the PdfBox ExtractText utility, useful to unit-test contents of generated PDFs.

Requirements

  • Java Runtime Environment
  • PdfBox JAR file - Download: http://pdfbox.apache.org/downloads.html - Tested with 1.6.0, 1.7.0 and 1.8.6
  • PHP needs permissions for shell execution

Install

To install with composer:

composer require sgh/pdfbox

Basic Usage

use SGH\PdfBox

//$pdf = GENERATED_PDF;
$converter = new PdfBox;
$converter->setPathToPdfBox('/usr/bin/pdfbox-app-1.7.0.jar');
$text = $converter->textFromPdfStream($pdf);
$html = $converter->htmlFromPdfStream($pdf);
$dom  = $converter->domFromPdfStream($pdf);

If the source PDF is a file, use xxxFromPdfFile() instead xxxFromPdfStream() with the source path as parameter.

If you want to save the converted output to a file, specify the destination path as second parameter of the xxxFromPdfxxx() methods.

Advanced Usage

Convert a range of pages instead of the full document:

$converter->getOptions()
    ->setStartPage(2)
	->setEndPage(5);

Ignore corrupt objects in the PDF:

$converter->getOptions()
    ->setForce(true);

Sort text:

$converter->getOptions()
    ->setSort(true);

PHPUnit tests

To run the unit tests, change the environment variable PDFBOX_JAR to the full path of your PdfBox JAR file. See phpunit.xml.dist.

  Files folder image Files  
File Role Description
Files folder imagesrc (1 directory)
Files folder imagetest (1 file, 1 directory)
Accessible without login Plain text file composer.json Data Auxiliary data
Accessible without login Plain text file LICENSE.txt Lic. Documentation
Accessible without login Plain text file phpunit.xml.dist Data Auxiliary data
Accessible without login Plain text file README.md Doc. Auxiliary data

  Files folder image Files  /  src  
File Role Description
Files folder imageSGH (1 directory)

  Files folder image Files  /  src  /  SGH  
File Role Description
Files folder imagePdfBox (4 files)

  Files folder image Files  /  src  /  SGH  /  PdfBox  
File Role Description
  Plain text file Command.php Class Class source
  Plain text file Options.php Class Class source
  Plain text file PdfBox.php Class Class source
  Plain text file PdfConverter.php Class Class source

  Files folder image Files  /  test  
File Role Description
Files folder imageSGH (1 directory)
  Accessible without login Plain text file bootstrap.php Aux. Unit test script

  Files folder image Files  /  test  /  SGH  
File Role Description
Files folder imagePdfBox (2 files)

  Files folder image Files  /  test  /  SGH  /  PdfBox  
File Role Description
  Accessible without login Plain text file CommandTest.php Test Unit test script
  Accessible without login Plain text file PdfBoxTest.php Test Unit test script

 Version Control Unique User Downloads Download Rankings  
 100%
Total:379
This week:1
All time:6,549
This week:363Up