Philipp Strazny - 2012-06-17 19:34:44 -
In reply to message 1 from Richard Keizer
Hi Richard,
thank you for the question.
For testing, I ran book-length texts such as Königliche Hoheit, Thomas Mann, txt version, downloaded from gutenberg.org.
The text contains ca. 8500 lines with 100.000 words and 750.000 characters. The following table shows how long it takes to load all strings into the DB, what size the DB is afterwards, and how long it takes to look up 8 different slightly modified items with their scores (on a laptop with 1.8GHz processor and 2GB RAM):
CharsHeuristic:
loadtime 112 s
dbsize: 20 MB
lookup 8 items: 1.7 s
avg score: 95.25%
WordHeuristic:
loadtime: 27 s
dbsize: 6.2 MB
lookup 8 items: 0.1 s
avg score: 93%
The other heuristics fall somewhere in between.
I have not run tests with very large collections as you suggest, because my guess is that searching (and loading) an index for 3 million items of 255 characters would be way too slow to be practical. Depending on the heuristic you use, the number of items in the database would increase dramatically. For example, CharsHeuristic could store up to 750 items for a single 255 character string - this all depends on what kind of data you are loading.