PHP Classes

Fix on excluded text

Recommend this page to a friend!

      PDF Text Extractor  >  All threads  >  Fix on excluded text  >  (Un) Subscribe thread alerts  
Subject:Fix on excluded text
Summary:A fix on a bug I encountered
Messages:8
Author:John Thomas
Date:2010-10-12 16:08:32
Update:2013-11-22 03:17:42
 

  1. Fix on excluded text   Reply   Report abuse  
Picture of John Thomas John Thomas - 2010-10-12 16:08:32
I was trying to extract text from a pdf when I noticed large blocks of it were missing. After fiddling around with your code (very nice by the way, saved me the grand annoyance of learning the pdf format's internals), I realized the issue was that you rely on newlines around the "obj" tokens in the pdf which aren't actually that reliable.
To be more exact, I changed this code:
preg_match_all("#obj[\n|\r](.*)endobj[\n|\r]#ismU", $infile, $objects);
$objects = @$objects[1];

To:
preg_match_all("#obj(.*)endobj#ismU", $infile, $objects);
$objects = @$objects[1];
array_map('ltrim',$objects);

The latter captures more objects, but still removes the excess spacing. I'm not sure if this was some weirdness due to my particular pdfs, but if this code is useful, feel free to use it.

  2. Re: Fix on excluded text   Reply   Report abuse  
Picture of joeri joeri - 2010-12-18 07:34:25 - In reply to message 1 from John Thomas
Tomas,

Thanks. I will test it on some PDF's and if it works, include it with the original.


  3. Re: Fix on excluded text   Reply   Report abuse  
Picture of Rene Hart Rene Hart - 2010-12-22 13:37:26 - In reply to message 2 from joeri
I still have issues that some of the text in the PDF I want to convert is missing. Any idea what can be done to solve this ?

  4. Re: Fix on excluded text   Reply   Report abuse  
Picture of Chris Li Chris Li - 2011-06-21 14:27:21 - In reply to message 1 from John Thomas
I checked and tested the PDF2TEXT codes. It works but it did not print new line break. All texts are displayed together. Is there any way to keep new line break for output file. Which code I need to add.

Thanks in advance.

  5. Re: Fix on excluded text   Reply   Report abuse  
Picture of Chris Li Chris Li - 2011-06-23 16:14:59 - In reply to message 1 from John Thomas
Do you have any new improvements on this project?
I have similar project and see some text not captured by current
source codes.

Chris

  6. Re: Fix on excluded text   Reply   Report abuse  
Picture of Tony Wilson Tony Wilson - 2011-06-26 19:35:10 - In reply to message 5 from Chris Li
At first, this seemed to answer my issue of extracting text (to load into a database for searching purposes), however I can not release it as part of my project as I can not always extract the text reliably (chunks are missing).
This is a real shame as it seems to be almost what I needed. Are there any updates scheduled?

  7. Re: Fix on excluded text   Reply   Report abuse  
Picture of arron wall arron wall - 2013-11-22 03:17:42 - In reply to message 1 from John Thomas
I have ever tried to extract text from PDF files with the help of the following code:
using YiiGo.Imaging.Basic;
using YiiGo.Imaging.Basic.Core;
using YiiGo.Imaging.Basic.Codec;
using YiiGo.Imaging.PDF;

YiiGoImaging PDF = new YiiGoImaging();

public void PdfProcessorExtractTextPage();
{
PDFInputFile = (@"C:/1.pdf");
PDFPageNumberStart = "0";
PDFPageNumberStop = "4";
PDFOutputFile = OutputFormat.txt;
PDFOutputFile = (@"C:/extract.txt");
};
PDF. PdfProcessorExtractText (@"C:/1.pdf", "0","4", @"C:/extract.txt");
You can check its tutorial page here:
yiigo.com/guides/csharp/how-to-extr ...
I hope it helps. Good luck.



Best regards,
Arron


  8. Re: Fix on excluded text   Reply   Report abuse  
Picture of lee charles lee charles - 2016-02-20 05:20:22 - In reply to message 7 from arron wall
Hi, Arron.
Thanks for sharing these code. But I wonder whether I need some 3rd party pdf text extraction toolkits (like: http://www.pqscan.com/extract-text/ ) to help me extract text from pdf files. If so, it will be better if itt offers free trial package for users to check. I will try it later and send you feedback.



Best regrads,
Pan