Recommend this page to a friend! |
Classes of Christian Vigh | PHP RTF Tools | help/README.parser.md | Download |
|
DownloadRtfParser classThe RtfParser class is an abstract class that allows you to parse Rtf files, using either the RtfStringParser or RtfFileParser derived classes. The method from this class you will probably use the most is NextToken(), which returns you an object derived from the RtfToken abstract class, representing the next Rtf token in your Rtf data. Using the RtfStringParser or RtfFileParser is pretty straightforward ; instantiate an object providing your Rtf data, then put a while loop to retrieve the tokens :
The above example works on Rtf data supplied by an in-memory string, but you can also parse files that are too big to fit into memory using the RtfFileParser class :
The value returned by the NextToken() method is an object inheriting from the RtfToken abstract class, corresponding to the token that has been read ; see the paragraph below for an explanation on the various token types (and corresponding classes) you can be faced with. The RtfParser class helps you in reading Rtf data and extracting tokens from it but it is definitely not designed to be an interpreter that could render a document or provide you with a level of abstraction like the Document Object Model (DOM) in Javascript. If you need to interpret things from your Rtf data, then you will still have to do the job by yourself and develop a class inheriting from RtfParser. The Rtf syntaxThe Rtf syntax is really simple and can even be easily parsed using routines written in assembly language. An Rtf file basically includes the following syntactic elements :
A control word can be followed by an optional space which is to be considered as being part of the control word itself.
The \b control word tells the Rtf interpreter (maybe Word or WordPad or OpenOffice Writer) that the next data enclosed in this set of curly braces has to be rendered in bold face ; however, the bold attribute reverts back to normal font weight when the next closing brace is encountered. Also note the double space between the closing brace and "world" ; the first one belongs to the closing brace element and is not to be interpreted as textual data, while the second one is really textual content, ie a space to be put between the words "beautiful" and "world". Curly braces can be nested.
Note that some control words can be preceded by \\\(such as :\\\\\blipuid) ; the Rtf specification states that : This control symbol identifies destinations whose related text should be ignored if the RTF reader does not recognize the destination control word. Other kind of data can of course appear within Rtf contents ; three categories of data are defined :
(the ellipses are provided here as a shortcut to represent the actual Rtf data, which may really be longer than that).
represents a binary data of 6 characters : "12{}\\6". The RtfParser class handles all these cases and will provide you with the appropriate token class inheriting from RtfToken. RtfToken classesThe Rtf\*Token classes map tokens that are returned by the NextToken() method from the RtfParser class and provide with the appropriate behavior related to the token type. They all inherit from RtfToken, which provides a basic set of properties and methods common to all token types. Of course, each derived class will provide with its own set of specific properties and methods. The following paragraphs list the various object types that can be returned by the NextToken() method. RtfToken classThe RtfToken class is the base abstract class for all other Rtf token classes ; it offers the following properties :
The following methods are available :
RtfLeftBraceToken and RtfRightBraceToken classesImplements an opening or closing brace in the Rtf flow. This class does not provide additional properties or methods to the RtfToken class. RtfNewlineToken classImplements an end of line found in the Rtf stream. This class is provided only for parsers that need to handle line-changing situations. It does not provide additional properties or methods to the RtfToken class. RtfControlSymbolToken classImplements a control symbol such as \\~, \\- or \\_. The ToText() method will return a space for \\~ and an hyphen for \\- and \\_. For all other symbols, it will return the character as is. RtfControlWordToken classImplements a control word. The following properties are available :
RtfEscapedExpressionToken classImplements an escaped special character (\\\\, \{ or \}). The ToText() method returns the escaped character itself, without the leading backslash. RtfEscapedCharacterToken classImplements a character specified through its code in the current codepage, in the form : \\'xy. The following properties are available :
For example, with the character specification \'41, Char will be equal to "A", and Ord to integer 65. The ToText() method returns the escaped character itself, without the leading backslash. RtfPCDataToken classImplements free-form text data. The ToText() method returns the textual data without any line break. This is required since some Rtf generators can arbitrarily break the text onto multiple lines. In such cases, line braks in the Rtf flow are meaningless. RtfSData classImplements a hexadecimal data string, such as those that can be found in \\pict constructs. RtfBData classImplements a binary data string. RtfParser class referencePropertiesThe following properties are available :
MethodsGetControlWordValue ( $word, $default = '' )The current parameter value of control words that have been tracked using the TrackControlWord() method can be retrieved using this method. The best example I can give for explaining the utility of this method comes from the RtfTexter class, and is regarding Unicode characters, which are specified by the \u tag like in the following example :
However, Unicode characters are followed by character symbols (using the "\'" tag) which gives the number of the code page that best matches the preceding Unicode character :
The number of character symbols that follow a Unicode character specification is given by the \uc tag ; in the above example, it should be written like this :
However, the specification states that this number (the parameter of the \uc2 tag) should be tracked and that a stack of applicable values should be handled, to keep applicable values depending on the current curly brace nesting level (the \uc tag may be present elsewhere in the document, not specifically before Unicode character specifications, and its default value should be 1). In this case, GetControlWordValue ( "uc" ) will return the parameter of the \uc control word that is applicable for the current nesting level. Note that the TrackControlWord() method must be called to track the control word "uc" before any parsing occurs for this method to work. If no value is applicable for the current nesting level then the one supplied by the $default parameter will be returned. IgnoreCompounds ( $list )Ignore "compound" control words that are given in the array specified by the $list parameter. Although not explicitly stated in the Rtf specification, some control words, such as \pict, need to be specified in a group enclosed within curly braces, and accept PCDATA or SDATA. Depending on your parsing needs, you may not be interested to retrieve the contents of control words such as \stylesheet or \fonttbl (this is the case for example of the RtfTexter class, which ignores several compound words that have no sense in the context of its activity). In this case, simply call this method before the first call to NextToken(), to make it ignore the specified words ; for example :
NextToken ( )Returns the next token available in the underlying Rtf stream in the form of an object deriving from the RtfToken abstract class. Note that the method will silently ignore all control words that may have been specified to the IgnoreCompounds() method. Reset ( )Resets the parser and puts it in a state where the NextToken() method can be called again to start parsing at the beginning of the Rtf flow. SkipCompound ( )Initially meant to skip compound control words, this method actually processes incoming characters until the current nesting level is closed by a right curly brace (}). TrackControlWord ( $word, $stackable, $default_value = false )Tracks a control word specification in the current Rtf document. This allows for example to associate raw data with a control word, such as for the "\pict" tags. The parameters are the following :
|