Recommend this page to a friend! |
PHP PDF to Text | > | All threads | > | Formatting Question | > | (Un) Subscribe thread alerts |
|
Jack Webb - 2017-03-06 17:54:14
I noticed that none of the opening script tags: "<?php" have closing "?>" tags, either in the "PdfToText.phpclass" or in any of the "example.php" scripts. I've never seen anyone do this before. Could this be the reason for some of the errors indicated?
Christian Vigh - 2017-03-06 18:21:50 - In reply to message 1 from Jack Webb
The answer is definitely : no ! the lack of a PHP closing tag has never been an issue.
In fact, PHP stop scanning PHP instructions whenever it encounters one of the following two cases when parsing an input file : - A closing tag has been found. In this case, PHP will simply output the data after the closing tag until it finds another opening tag. - The End of File has been reached (it's just like a final closing tag has been found). In fact, the closing tag is completely optional. I'm never using it in my classes, since most of them are used in command-line scripts, which output their own information (using echo()). In such situations, if I put a closing tag and forget to check that there is no spaces, tabs or newlines after it, then spaces/tabs/newlines will appear in the program standard output. Not terminating my source classes with a closing tag is a really safe way to ensure that all my scripts which use these classes will never have spurious output. As a habit however, I always put a closing tag when I write a page that mixes PHP and HTML contents. But this is just a habit... Hope I answered your question ! Christian.
Jack Webb - 2017-03-06 20:22:26 - In reply to message 2 from Christian Vigh
Yes, Christian... it did answer the question. Understand that I was NOT saying it was incorrect, only that I have never done it, nor did I realize that "the closing tag was optional". I primarily use PHP to create HTML.
I tried this with several PDFs, some created as a PDF from a text-based "source document" and others created from scans of book pages and then run through an OCR utility for indexing. As expected, the output of those that were indexed by means of OCR was marginal, at best. I did get some fatal errors when the documents were quite large. Do you know what the size limit might be? Basically, I find it to be quite useful... thanks! Jack
Christian Vigh - 2017-03-06 21:19:34 - In reply to message 3 from Jack Webb
Hi Jack,
don't worry : I guessed that you were talking about a PHP feature you were not aware of ! Documents processed through OCR can give strange results ; It took me a few hours to understand the first time I received such a document ; I didn't understand first why Acrobat Reader was showing me a scanned image, and why I was still able to select some text from it ! Regarding the errors you may encounter with large files (or with "special" files), I currently see a few settings in PHP.INI that might help : - memory_limit : normally, the default value of 128M (on non-shared servers) should be enough (I even tried my class on some pdf files with a memory limit of 5M). When memory is exhausted, you should have an error message in your PHP log saying "fatal error : trying to allocate x bytes (...)". It also gives you the source file and source line where the error occurred. - pcre.backtrack_limit : the PdfToText class heavily relies on regular expressions. In some cases, it may be inclined to use one very loose regex, which requires this limit to be augmented. The default value is 1000000, setting it to 3000000 should be safe (I've never met a PDF file so far that would require a higher value, but who knows ?). This should only happen on Windows platforms. - In some case, you may need to augment the pcre.recursion_limit setting In fact, the size limit depends on a lot of things, and not necessarily on the size of the PDF file itself ! Extracting text from the 1300-pages long Adobe PDF specifications works fine with a memory_limit of 128M (the file size is 16M). Extracting text from a file generated with Quark XPress could take more memory, because it generates an extremely precise output about character positioning. If you're using the class in a web form, then you may have to take care about additional settings : - upload_max_file_size : the maximum size of a file you may upload. - post_max_size : the maximum size of data into an HTTP POST request. This includes the size of the submitted form fields, plus the size of any file(s) uploaded when submitting the form, so it should be a little bit greater than upload_max_file_size. Anyway, if you have issues when processing PDF files, feel free to send me an email at the following address : christian.vigh@wuthering-bytes.com Attach the failing PDF file, and I will happily a look at what went wrong ! With kind regards, Christian. |
info at phpclasses dot org
.