Recommend this page to a friend! | Stumble It! | Bookmark in del.icio.us |
All requests | > | Extract PDF to text and XML | > | Request new recommendation | > | Featured requests | > | No recommendations | ||
by Anand Lagad - 8 months ago (2015-06-14) pdf to xml
+1 | I need PHP code to parse any PDF file and convert it into the XML format. I think we can not examine the HTML tags in PDF, so I think that first of all we should parse whole PDF ,then convert it into the xml. What I want is, if the PDF document contains table, I want table fields as XML tag and table data as a values. |
1. by Manuel Lemos - 8 months ago (2015-06-15) Reply
I do not think that right now there is a class here that can convert an arbitrary PDF document to XML, HTML or any format that preserves the document structure.
There are classes for converting PDF to images of the pages, but I am not sure if that would address your needs.
There are solutions that require using external Web services or external programs like xpdf or Ghostscript. If that would do for you, maybe somebody can submit a class that wraps around those Web services or programs.
0 | by Dave Smith 5955 - 8 months ago (2015-06-14) Comment The innovation nomination description indicates that this class will extract document elements in addition to text, which is what you will need to extract tables. |
1. by Manuel Lemos - 8 months ago (2015-06-15) Reply
I think the original poster wants a solution that preserves the original document structure. So, just extracting text may not be enough for him.
2. by adam berger - 8 months ago (2015-06-15) Reply
An interesting project would be happy to'll try the same class to convert pdf to xml I am waiting for results :)
3. by adam berger - 8 months ago (2015-06-15) in reply to comment 2 by adam berger Reply
I suggest you first perform a conversion to html in the cache and then to xml. This can be done on the fly with cache
4. by Manuel Lemos - 8 months ago (2015-06-15) in reply to comment 3 by adam berger Reply
Well, XHTML is still HTML and XML.
5. by Dave Smith - 8 months ago (2015-06-15) in reply to comment 1 by Manuel Lemos Reply
If the comments for the innovation nomination of this class is correct, or I am not misreading it, the class should be able to get the document elements, not just text. That is the basis of my recommendation.
6. by Manuel Lemos - 8 months ago (2015-06-16) in reply to comment 5 by Dave Smith Reply
What the nomination comments say is that extracting document elements is not a trivial task. That class just extracts text using a simple approach.
7. by Dave Smith - 8 months ago (2015-06-16) in reply to comment 6 by Manuel Lemos Reply
Okay, looks like I was confused. Better to have tried and failed than to not have tried at all :)
Looks like adam berger will attempt the non trivial task.
8. by Manuel Lemos - 8 months ago (2015-06-16) in reply to comment 7 by Dave Smith Reply
That is OK, maybe my wording was not ideal either.
Recommend package | |
|