PHP Classes

Works! But wrong encoding...

Recommend this page to a friend!

      PHP PDF to Text  >  All threads  >  Works! But wrong encoding...  >  (Un) Subscribe thread alerts  
Subject:Works! But wrong encoding...
Summary:Works! But wrong encoding...
Messages:3
Author:xegohu
Date:2017-01-30 18:30:08
 

  1. Works! But wrong encoding...   Reply   Report abuse  
Picture of xegohu xegohu - 2017-01-30 18:30:08
Hi! Here is the sample file http://wikisend.com/download/915688/__.pdf
I get all the text, but with wrong encoding

  2. Re: Works! But wrong encoding...   Reply   Report abuse  
Picture of Christian Vigh Christian Vigh - 2017-01-30 19:46:55 - In reply to message 1 from xegohu
Hello,

thanks for your feedback.

To tell the truth, it's not a problem of wrong decoding : your pdf file is using CID (character ID) fonts.

CID fonts were implemented by Adobe long time before the Unicode standard emerged. They simply map a character ID to its corresponding glyph, which is mainly a set of instructions saying how to draw the corresponding letter. Even Adobe does not know about the real character that is behind !

To further convince you, you can try to :
- Open your PDF file using Acrobat Reader and try to search for some text ( "BOTH" for example, which is located in the title)
- Copy and paste text from your PDF file to a simple text editor such as Notepad. You will then notice that you will get roughly the same results than my class.

I'm currently trying to implement CID font mapping, which is a tough task because there is so few documentation about that ! My first (partial and experimental) version worked more or less with text written in scandinavian languages.

This is why I thank you, because I had so far no other simple example PDF files that could help me to go further in this task. Handling a one-page document written in cyrillic language will help me to reverse-engineer a little bit farther what Adobe had in mind...

As a conclusion : be a little bit patient, contributions such like yours will help me further enhance my class, but it will take a little time...

With kind regards,
Christian.

  3. Re: Works! But wrong encoding...   Reply   Report abuse  
Picture of Christian Vigh Christian Vigh - 2017-03-05 00:55:48 - In reply to message 1 from xegohu
Hi,

finally, this was not due to CID fonts. This was due to the use of ISO8859-5 encoding (cyrillic).

Even Acrobat Reader performs a wrong interpretation : if you copy and paste the selection to a text editor such as Notepad++, you will get the same results as my class.

I have put in place some experimental implementation of ISO code pages handling, your sample should be ok now.

Please feel free to contact me if you have any issue or question.

Christian.