Christian Vigh - 2017-01-30 19:46:55 -
In reply to message 1 from xegohu
Hello,
thanks for your feedback.
To tell the truth, it's not a problem of wrong decoding : your pdf file is using CID (character ID) fonts.
CID fonts were implemented by Adobe long time before the Unicode standard emerged. They simply map a character ID to its corresponding glyph, which is mainly a set of instructions saying how to draw the corresponding letter. Even Adobe does not know about the real character that is behind !
To further convince you, you can try to :
- Open your PDF file using Acrobat Reader and try to search for some text ( "BOTH" for example, which is located in the title)
- Copy and paste text from your PDF file to a simple text editor such as Notepad. You will then notice that you will get roughly the same results than my class.
I'm currently trying to implement CID font mapping, which is a tough task because there is so few documentation about that ! My first (partial and experimental) version worked more or less with text written in scandinavian languages.
This is why I thank you, because I had so far no other simple example PDF files that could help me to go further in this task. Handling a one-page document written in cyrillic language will help me to reverse-engineer a little bit farther what Adobe had in mind...
As a conclusion : be a little bit patient, contributions such like yours will help me further enhance my class, but it will take a little time...
With kind regards,
Christian.