Author: Christian Vigh
Posted on: 2016-10-19
Package: PHP PDF to Text
Read this article that is the first of a series that will teach you about the challenge of processing the PDF file format and how the PdfToText class can be used to extract text and images from it.
Introduction
Extracting text from PDF files can be a tedious task for a developer. If you ever tried to open a PDF file using a text editor such as Notepad++ just to perform a simple search on some text you know for sure to be present in it, chances are great that you will find nothing but binary data!
This is due to the open nature of the PDF file format: the basic elements of a PDF file are objects, usually identified by a unique object number and a revision id.
Objects can contain anything like font definitions, character substitution tables and, of course, text data. Most of these objects are compressed with the gzip format, and eventually encrypted. You can also expect even more complicated things under the hood.
This article explains how the PHP PDF To Text class
can help you to extract text from almost any PDF file.
It will be followed by a series of articles explaining various parts of the PDF file format that are of interest during the text extraction process.
Installation
Talking about an installation process would be a little bit pretentious: just extract the PdfToText.phpclass file from the .zip archive to your preferred includes directory.
You may also install it using the composer tool from the PHP Classes composer repository.
A future version may include additional and completely optional satellite data files, but that's another story which will be the subject of another article...
Generating PDF files
Before starting working with the PdfToText class, you will need of course a few PDF sample files. If you do not have any at hand, a few are provided in the PdfToText .zip package, under the examples directory.
If you are using the Windows operating system, the following virtual printer drivers can be of some help to generate PDF files (the following list is not exhaustive) :
- Microsoft Print to PDF: the native solution from Microsoft. If not installed on your system, you can have a look here. Note that it may sometimes generate weird results.
- PdfCreator : a free virtual printer. The free version contains some ads.
- PrimoPdf : another free virtual PDF printer.
- Pdf Architect 4: Another product from PdfForge, which is not free. However, it includes a free virtual PDF printer driver really similar to Pdf Creator (if not identical, except the name).
- Pdf Pro 10 : A paid solution for editing PDF files. It includes a free virtual printer driver that has many interesting features, such as an elaborate printer spooler for managing files printed on servers.
- PdFill Image Writer : A free virtual printer. You can also purchase a PDF editor for less than $20.
- And, of course, Adobe Acrobat DC.
Getting started
Although the PDF file format is really versatile, the PdfToText class has been designed to hide the complexity from you of the underlying data and provide a simple interface.
Basically, the simplest PHP script that would process a PDF file given as a command-line argument and echo its text contents to the standard output would look like this :
<?php require( 'path/to/PdfToText.phpcass' ); $pdf = new PdfToText( $argv[1] ); echo $pdf->Text ;
Once you have loaded a PDF file, its text contents are accessible through the Text property. The filename supplied to the class constructor is optional, you can omit it, then later use the Load() method to extract its contents.
This allows you to specify additional options or set special properties before loading the actual PDF contents. The following example will extract images from your PDF file by setting the Options property before calling the Load() method:
<?php require( 'path/to/PdfToText.phpcass' ); $pdf = new PdfToText( ); $pdf->Options = PdfToText::PDFOPT_DECODE_IMAGE_DATA; $pdf->Load( $argv [1] ); echo $pdf->Text ;Note that this second approach will allow you to reuse the same object (with the same options) for processing different PDF files.
Retrieving page contents
You can retrieve individual page contents by using the Pages array property which is available, like the Text property, once the PDF file contents has been loaded.
The Pages property is an associative array whose keys are page numbers, and values, page contents.
A sample script which would display individual page contents from a PDF file would look like this :
<?php require( 'path/to/PdfToText.phpcass' ); $pdf = new PdfToText( $argv [1] ); foreach( $pdf -> Pages as $page_number => $page_contents) echo "Contents of page #$page_number :\n$page_contents\n";
Retrieving image data
The PDF file format supports several types of images contents. In its current version (1.2.46), the PdftoText class is only able to process images encoded in the JPEG format.
Retrieving image contents is a simple as specifying a special option as the second parameter of the class constructor :
<?php require( 'path/to/PdfToText.phpcass' ); $pdf = new PdfToText( $argv [1], PdfToText :: PDFOPT_DECODE_IMAGE_DATA ) ;Or, if you prefer deferred loading :
<?php require ('path/to/PdfToText.phpcass' ) ; $pdf = new PdfToText( ); $pdf->Options = PdfToText :: PDFOPT_DECODE_IMAGE_DATA ; $pdf->Load( $argv [1] );Once loaded, image contents will be available through the Images array property, which is an array of image resources that have been created for each JPEG image encountered in your PDF file.
There is another option, PdfToText :: PDFOPT_GET_IMAGE_DATA, which simply loads raw image data into the ImageData array property. This way, you may have more elements in the ImageData property than in Images, since the PdfToText class currently supports only JPEG images.
Note that specifying the PDF_DECODE_IMAGE_DATA flag automatically sets the PDFOPT_GET_IMAGE_DATA one.
Documentation
The complete documentation of the format is available at the Adobe PDF Reference version 1.7 page.
If you are enough enthusiastic to read the 1300 pages of this document, keep in mind that Adobe also provided a generous set of technical notes addressing various specific topics not completely covered by these specifications. Some of these technical notes are more than 200 pages long.
How to contribute to the development of the PdfToText class
There are so many ways to write the same page contents using the Adobe Postscript-like language that sometimes you may get strange results. Should this be the case, please feel free to contact me on this package support forum.
You can also have a look at my Github repository, and even issue pull requests. I also have a Web site dedicated to this class.
However, if you have any issue while processing one of your PDF files, and really don't want to go through the code to try to understand what's happening, you can reach me directly by email at christian.vigh@wuthering-bytes.com. Just send me the faulty PDF file as an attachment together with a little description about the issue, and I will be happy to try to solve your problem.
Known issues
The following is a list of known issues. I'm still working on them and they will normally be implemented in future versions :
- RTL languages, such as Arabic, Hebrew or Syriac, are not correctly processed: they are extracted from left to right
- Only JPEG images are currently supported
- There is currently no support for password-protected files (note that I'm not intending to develop a password cracker, just a feature that allows you to extract text contents from a password-encrypted PDF file, if you supply the correct password)
- Digitally signed files are not currently supported
- Text contents may sometimes show badly translated characters. The reason why will be explained in the next series of articles
- The extracted text contents may not exactly reflect text positioning on the page. This is especially true regarding PDF files that contain data in tabular format. Again, this issue will be fixed in a future release and explained in one of the future articles about this class.
- CID fonts (Adobe internal fonts, mainly used by eastern languages and developed before the Unicode effort took place) are not yet supported. This will be the subject of another article.
Conclusion
This article explained the basic usage of the PdfToText class. It presented a few features of the class, gave some basic examples on how to use it, and listed its current development state.
More articles will follow, diving into the internals of the PDF file format and explaining how the PdfToText class tries to handle them. The next article will lead you into a general overview of a PDF file layout (at least, the parts of it that are of interest to us when dealing with text extraction).
If you liked this article, please feel free to share it with other developers. If you have questions post a comment here.
You need to be a registered user or login to post a comment
Login Immediately with your account on:
Comments:
2. Execution time error - Hemanath (2016-11-11 22:27)
Error... - 1 reply
Read the whole comment and replies
1. Problem with Unicode and () - Nashir Uddin (2016-10-19 10:15)
If in pdf exist Unicode and symbol like () then show garbage... - 4 replies
Read the whole comment and replies