Recommend this page to a friend! |
Classes of Christian Vigh | PHP Search Large Files | README.md | Download |
|
DownloadINTRODUCTIONThe SearchableFile class is designed to allow you to handle text files which __do not fit into memory__, and will never fit into memory, whatever your php.ini file settings are. With this class, you can perform operations as would do any lexer or parser to operate on file contents and analyze them. It has been designed to minimize as much as possible the overhead implied by performing file IO to get the real data instead of working directly into memory. The initial motivation for this class was to be able to handle big RTF contents, while preserving performance. However, as you will see, it is completely independent of the underlying file format. Using the SearchableFile classThe SearchableFile class can be seen as some kind of wrapper around a text file opened in read-only mode. It won't allow you to perform in-place modifications, since it's aimed at reading text streams, analyzing contents and optionally performing some modifications on-the-fly, then finally writing them to some output stream. Creating a SearchableFile object is pretty simple :
You can also specify a block size for IO operations (the default is 16k), as well as a cache size, in numbers of records :
Then you will have to open a file :
Once created, you can use any of the search functions that the class has to offer ; the following example will look for the first character in the set [\\{}] :
Or you can just extract a substring from your file :
You can also use the object as an array, to access individual characters :
or cycle through file contents using an iterator :
However, please note that such constructs should be used with care, since PHP will be terribly slow at doing that. You can also use the equivalent of the preg\_match() and preg\_match\_all() builtin PHP functions ; the following example tries to find the first occurrence of the "\pict" or "\bin" strings :
while the following will try to find all the occurrences of those strings :
Please have a look at the Making the pcre functions work section later in this paragraph, because there are some restrictions on using them. Making the examples workAll the examples provided with this package use a file named "verybigfile.rtf" and assume that this is an RTF file which contains embedded pictures and drawing objects. They should be used as command-line scripts. I won't pollute this repository by providing a useless data file of almost 1Gb, but you can recreate it very easily :
Under Unix systems, you can do it that way :
The same, using Msdos commands on Windows systems :
Most of the examples test the SearchableFile functions and try to compare their timing with the same method using in-memory data (load the file contents using file\_get\_contents(), then use PHP builtin functions to achieve the same goal). For this reason, you should ensure that :
Making the pcre functions workThe PCRE functions provided by the SearchableFile class are : pcre\_match() and pcre\_match\_all() ; there is no magic in them, they simply rely on an external command, pcregrep, which is not included in standard Linux distributions. To install it :
If none of the above conditions are met, then you will not be able to use the pcre functions. SearchableFile APIThe following sections describe the SearchableFile methods and properties. MethodsConstructor
Creates a searchable file object. Since the class uses direct IO to access chunks of data, an optional block size can be specified (the default is 16k). Note that, at least on Windows systems, ideal block sizes range between 16 and 64Kb. Below or above that, performances seem to degrade. The $cache_size parameter indicates how many file records should be kept in cache for later retrieval. This is a naive LRU cache. DestructorThe destructor of the SearchableFile class closes the file,if already opened. Close
Closes the searchable file, if it was opened. No exception is thrown if it was already opened. Open
Opens the specified file. Throws an exception if the file was already opened or could not be opened. multistrpos, multistripos
These function behave like the PHP standard strpos() and stripos() functions, but can be used to find the first occurrence of a string within a set of searched strings. The parameters are the following :
Returns either the byte offset of a found occurrence in the $searched\_strings array, or false if the string was not found in the file. pcre\_match
pcre_match() tries to behave like the builtin preg_match() function, but operates on a file rather than in memory. For achieving that, it uses the pcregrep linux command to extract match offset using the --file-offsets parameter. The meaning of the parameters is the following :
The function returns false if some error occurred (the starting offset is beyond the end of the file, or the search pattern is incorrect) ; otherwise the number of matches is returned (0 or 1). Notes :
pcre\_match\_all
pcre\_match\_all() tries to behave like preg\_match\_all(), but operates on a file rather than in memory. For achieving that, it uses the pcregrep linux command to extract match offset using the --file-offsets parameter. It returns false if some error occurred (the starting offset is beyond the end of the file, or the search pattern is incorrect, or an individual preg\_match() on one of the sub-results failed for some reason) ; otherwise the number of matches is returned. strchr
Finds the offset of the first character belonging to $cset. Returns either the byte offset of the first character found belonging to $cset, or false if no more characters from $cset are present in the file. The $offset parameter indicates where the search is to be started. Unlike the useless PHP strchr() function, which returns a substring starting with the searched character or string, but much more like the C strchr() function, which returns a pointer to the found character, strchr() returns the offset in the file of the searched character(s). strpos, stripos
Behave like the PHP standard strpos() and stripos() functions. The parameters are the following :
Returns either the byte offset of a found occurrence of $searched_string, or false if the string was not found in the file. substr
Extracts a substring from the searchable file. The $start and $length parameters have the same meaning that for the php builtin substr() function. Returns the specified substring or false if one of the following conditions occur :
An empty string is returned if $length has been specified and is zero (ie, 0, false or null). Write
When processing large files, you sometimes need to write (copy) unmodified data from the input file to some output file. This is the purpose of the Write method, which takes the following parameters :
PropertiesFilenameGets the underlying filename. RecordSizeGets/sets the read buffer size. Note that if the record size is modified, it will only take effect on the next call to the Open() method. FileSizeGets the underlying file size. |