Yakim - 2013-08-09 18:22:43
Tokenizer should preserve hyphenated words, because the individual parts of a compound word are often unrelated to (not synonymous with) the overall word.
tag() and stopwords()
// Replace all non-word chars with comma
$pattern = '/[0-9\W]/';
consider these words:
Windows98
IPv4
802.11n
At a minimum, tokenizer should at least preserve "words" containing numerals+alpha chars. Also, in order to handle technical text corpus, tokenizer really should preserve "words" containing mid-word dot characters.
===============
quoting from github.com/vlogus/gistfy
Associate words with their grammatical counterparts. (e.g. "city" and "cities")
^--- the code within your class attempts to employ mixed concepts -- stemming, part-of-speech weighting, and frequency weighting. Prior to, or in addition to, performing these steps (certainly prior to moving on to the next step "calculate the occurrence") it really should discard true "hardcoded, predefined, language-specific" noisewords. For English text, a noisewords list would contain up to 450 elements.
Calculate the occurrence of each word in the text.
Assign each word with points depending on their popularity.
Detect which periods represent the end of a sentence.
(e.g "Mr." does not).
Split up the text into individual sentences.
Rank sentences by the sum of their words' points. and LexRank ponis
Return X of the most highly ranked sentences in chronological order.
^--- Respectfully, the gist() class method does not seem to fulfill this stated goal. The delimiting constraint is character count, regardless whether or not "X of the most highly ranked sentences" fit. Instead, gist() should return XX argv sentences and allow the caller the decision to truncate the returned string.