Extract PDF to text and XML: I need to parse a PDF file and convert whole text into XML

Recommend this page to a friend!

All requests

Extract PDF to text and XML

Request new recommendation

Featured requests

No recommendations

Extract PDF to text and XML #pdf to xml

Edit

by Anand Lagad - 10 years ago (2015-06-14)

I need to parse a PDF file and convert whole text into XML

I need PHP code to parse any PDF file and convert it into the XML format.

I think we can not examine the HTML tags in PDF, so I think that first of all we should parse whole PDF ,then convert it into the xml.

What I want is, if the PDF document contains table, I want table fields as XML tag and table data as a values.

1 Clarification request
1. by Manuel Lemos - 10 years ago (2015-06-15) Reply
I do not think that right now there is a class here that can convert an arbitrary PDF document to XML, HTML or any format that preserves the document structure.

There are classes for converting PDF to images of the pages, but I am not sure if that would address your needs.

There are solutions that require using external Web services or external programs like xpdf or Ghostscript. If that would do for you, maybe somebody can submit a class that wraps around those Web services or programs.

Ask clarification

2 Recommendations

Sweeper: Clean HTML to remove unwanted tags and attributes

This package can clean HTML to remove unwanted tags and attributes.

It is based on Mihai Sucan's ReTidy package and it uses regular expressions, DOM and XPath to find and remove the unwanted HTML code.

That package can can also reformat HTML tables to improve accessibility, and automatically generates a table of contents restructure contents.

by Jill Lingoff package author 30 - 6 years ago (2018-12-07) Comment

Here are two methods.

One: A custom mapping table when doing File > Save As in adobe acrobat (http: //flaurora-sonora.000webhostapp.com/Clean%20HTML%20V2.3.zip). Installing this file makes the "Clean HTML v2.3" show in the save as type select box.

Two: Use adobe acrobat's "Save As... HTML (.html,.htm)" option then use the "clean_PDF" sweeper profile.

Both will likely not perfectly convert the structure of the PDF content to HTML. This is due to the difference between PDF and HTML formats themselves. PDF positions content on a page while HTML has content in a nested structure. Funnily, a PDF is made accessible exactly by applying HTML tags to its content.

So, in short, PDFs often do not contain the sort of content structure desired so that achieving that structure involves converting from PDF as cleanly as possible then using manual or automated methods (like sweeper) to create that structure.

PHP DOC DOCX PDF to Text Converter: Convert DOCX, DOC, PDF to plain text

This class can convert DOCX, DOC, PDF files to plain text.

It can read files in either Microsoft Word DOCX and DOC formats or PDF and parse the files to extract text they contain.

The text extracted from the documents is returned as a plain text string.

+1	by Dave Smith 7625 - 10 years ago (2015-06-14) Comment The innovation nomination description indicates that this class will extract document elements in addition to text, which is what you will need to extract tables.

8 Comments
1. by Manuel Lemos - 10 years ago (2015-06-15) Reply
I think the original poster wants a solution that preserves the original document structure. So, just extracting text may not be enough for him.
2. by adam berger - 10 years ago (2015-06-15) Reply
An interesting project would be happy to'll try the same class to convert pdf to xml I am waiting for results :)
3. by adam berger - 10 years ago (2015-06-15) in reply to comment 2 by adam berger Reply
I suggest you first perform a conversion to html in the cache and then to xml. This can be done on the fly with cache
4. by Manuel Lemos - 10 years ago (2015-06-15) in reply to comment 3 by adam berger Reply
Well, XHTML is still HTML and XML.
5. by Dave Smith - 10 years ago (2015-06-15) in reply to comment 1 by Manuel Lemos Reply
If the comments for the innovation nomination of this class is correct, or I am not misreading it, the class should be able to get the document elements, not just text. That is the basis of my recommendation.
6. by Manuel Lemos - 10 years ago (2015-06-16) in reply to comment 5 by Dave Smith Reply
What the nomination comments say is that extracting document elements is not a trivial task. That class just extracts text using a simple approach.
7. by Dave Smith - 10 years ago (2015-06-16) in reply to comment 6 by Manuel Lemos Reply
Okay, looks like I was confused. Better to have tried and failed than to not have tried at all :)

Looks like adam berger will attempt the non trivial task.
8. by Manuel Lemos - 10 years ago (2015-06-16) in reply to comment 7 by Dave Smith Reply
That is OK, maybe my wording was not ideal either.

Recommend package

About us

Advertise on this site

For more information send a message to info at phpclasses dot org.