|
Aryan Schmitz - 2016-10-25 23:13:46
Hello Christian,
Thank you very much for your PDF to text converter. Most of the time the output is already nearly perfect but now I found a file where the output is not even remotely useful. When I open this pdf file in LibreOfiice the text is also corrupted in a similar way.
This ”horror” file (a manual) can be downloaded directly from here http://www.biltema.se/BiltemaDocuments/Manuals/40-209_man.pdf
(2,5 MB)
I hope this can help you to improve the converter even more.
with kind regards,
Aryan
Christian Vigh - 2016-10-26 03:02:40 - In reply to message 1 from Aryan Schmitz
Hello Ryan,
I tried to open this pdf file on my Windows system using LibreOffice. It took more than 5 minutes to open on my Intel I7 quad-core system with 16Gb ram !
However, once opened, it looks similar to the original.
I also ran my class on it ; on page 2, you may notice that the last paragraph of page 2, "INTRODUKTION", comes just after the initial paragraph listing the standards®ulations to which the product complies to.
This is due to my class not handling (x,y) position enough accurately to provide a better result (the text objects are processed in the order they appear in the input file). This is an issue that has been put on my todo list for a long time and I hope to address it one day !
I also noticed a few garbage characters. This is due to the same issue we discussed a while ago about font reference substitutions.
Please let me know if I missed something int the output ! (I can send you the text file resulting from text extraction).
Regards,
Christian.
Aryan Schmitz - 2016-10-26 09:38:58 - In reply to message 2 from Christian Vigh
Hi Christian,
It seems that you get much better results than I do?
Aha, now I discovered why! It seems that opening the pdf and saving it again with preview (apple OS X build-in pdf reader), corrupts this file in a way that makes it unreadable for PDF-to-Text. When tried again but downloaded the file directly to disk and convert it with PDF-to-Text the output is totally different and almost perfect.
I’m sorry for the false alarm. Normally I do not have this problem when I save pdf files with preview but apparently some files can create horrible results.
In a PDF reader it still looks identical but I now also noticed that cut and paste from the pdf bodytext is not working correctly either. Do you want me to email this ”horror" pdf file that I have?
with kind regards
Aryan
Christian Vigh - 2016-10-26 09:49:06 - In reply to message 3 from Aryan Schmitz
Hi Ryan,
by "horror file", do you mean the copy you made by saving it with preview ? yes I'm interested !
In fact there must be a problem with the original : I wonder how it could take LibreOffice more than 5 minutes to load on my superfast computer.
Regards,
Christian.
Aryan Schmitz - 2016-10-26 12:45:23 - In reply to message 4 from Christian Vigh
Hi Christian,
I sen you what became my "horror" pdf file when I saved it with preview
I also see now that the data size has increased to 2,5MB where the original file http://www.biltema.se/BiltemaDocuments/Manuals/40-209_man.pdf was only 1,5 MB
When I open the original unaffected pdf file with acrobat reader instead of preview and save it as another new file from within Acrobat reader it does not seem to become horribly recoded!
Best regards
Aryan
Christian Vigh - 2016-10-26 19:08:04 - In reply to message 5 from Aryan Schmitz
Hi Aryan,
thanks for the information you brought together with the "horror" file ; I'm as aware as you that this may be completely out of the scope of my class, however I'm keeping that as an open issue (with low priority) because I have the feeling that this "horror file" may contain interesting information that could help me understand some weird pdf constructs !
With kind regards,
Christian.
|