Example for using FilePatternSplitter
max_und_moritz.txt contains a text-only version of a famous German cartoon.
For some purposes, it may be useful to have each chapter in its own file.
Luckily, each chapter heading starts with an underscore in this file, so
we issue:
$ php FilePatternSplitter.php split max_und_moritz.txt '/^_/'
./fps00001_max_und_moritz.txt
./fps00002_max_und_moritz.txt
./fps00003_max_und_moritz.txt
./fps00004_max_und_moritz.txt
./fps00005_max_und_moritz.txt
./fps00006_max_und_moritz.txt
./fps00007_max_und_moritz.txt
./fps00008_max_und_moritz.txt
./fps00009_max_und_moritz.txt
./fps00010_max_und_moritz.txt
When we check the contents of these files, we see that the individual chapters are nicely put into their individual chapters and separated from the Gutenberg preamble:
$ head -n 1 fps*
==> fps00001_max_und_moritz.txt <==
The Project Gutenberg EBook of Max und Moritz, by Wilhelm Busch
==> fps00002_max_und_moritz.txt <==
_VORWORT._
==> fps00003_max_und_moritz.txt <==
_Erster Streich._
==> fps00004_max_und_moritz.txt <==
_Zweiter Streich._
==> fps00005_max_und_moritz.txt <==
_Dritter Streich._
==> fps00006_max_und_moritz.txt <==
_Vierter Streich._
==> fps00007_max_und_moritz.txt <==
_Fünfter Streich._
==> fps00008_max_und_moritz.txt <==
_Sechster Streich._
==> fps00009_max_und_moritz.txt <==
_Letzter Streich._
==> fps00010_max_und_moritz.txt <==
_SCHLUSS._
However, the last file does not only contain the final chapter, but also the Gutenberg license. In order to also separate that one, we simply add a second pattern:
$ php FilePatternSplitter.php split max_und_moritz.txt '/^_/' '/^End of the/'
./fps00001_max_und_moritz.txt
./fps00002_max_und_moritz.txt
./fps00003_max_und_moritz.txt
./fps00004_max_und_moritz.txt
./fps00005_max_und_moritz.txt
./fps00006_max_und_moritz.txt
./fps00007_max_und_moritz.txt
./fps00008_max_und_moritz.txt
./fps00009_max_und_moritz.txt
./fps00010_max_und_moritz.txt
./fps00011_max_und_moritz.txt
$ head -n 1 fps*
==> fps00001_max_und_moritz.txt <==
The Project Gutenberg EBook of Max und Moritz, by Wilhelm Busch
==> fps00002_max_und_moritz.txt <==
_VORWORT._
==> fps00003_max_und_moritz.txt <==
_Erster Streich._
==> fps00004_max_und_moritz.txt <==
_Zweiter Streich._
==> fps00005_max_und_moritz.txt <==
_Dritter Streich._
==> fps00006_max_und_moritz.txt <==
_Vierter Streich._
==> fps00007_max_und_moritz.txt <==
_Fünfter Streich._
==> fps00008_max_und_moritz.txt <==
_Sechster Streich._
==> fps00009_max_und_moritz.txt <==
_Letzter Streich._
==> fps00010_max_und_moritz.txt <==
_SCHLUSS._
==> fps00011_max_und_moritz.txt <==
End of the Project Gutenberg EBook of
Just to verify that things worked appropriately, we can merge the files again:
$ php FilePatternSplitter.php merge .
merged into max_und_moritz.txt.merged
and then check against the original:
$ diff -s max_und_moritz.txt max_und_moritz.txt.merged
Files max_und_moritz.txt and max_und_moritz.txt.merged are identical
|