tokenizer should not split hyphenated words

Recommend this page to a friend!

tokenizer should not split...

Subject:	tokenizer should not split...
Summary:	the class code is great start but has a few holes
Messages:	4
Author:	Yakim
Date:	2013-08-09 18:22:43
Update:	2013-08-14 11:59:31

1. tokenizer should not split...

Report abuse

Yakim - 2013-08-09 18:22:43

Tokenizer should preserve hyphenated words, because the individual parts of a compound word are often unrelated to (not synonymous with) the overall word.

tag() and stopwords()
// Replace all non-word chars with comma
$pattern = '/[0-9\W]/';

consider these words:
Windows98
IPv4
802.11n

At a minimum, tokenizer should at least preserve "words" containing numerals+alpha chars. Also, in order to handle technical text corpus, tokenizer really should preserve "words" containing mid-word dot characters.

===============

quoting from github.com/vlogus/gistfy

Associate words with their grammatical counterparts. (e.g. "city" and "cities")

^--- the code within your class attempts to employ mixed concepts -- stemming, part-of-speech weighting, and frequency weighting. Prior to, or in addition to, performing these steps (certainly prior to moving on to the next step "calculate the occurrence") it really should discard true "hardcoded, predefined, language-specific" noisewords. For English text, a noisewords list would contain up to 450 elements.

Calculate the occurrence of each word in the text.

Assign each word with points depending on their popularity.

Detect which periods represent the end of a sentence.
(e.g "Mr." does not).

Split up the text into individual sentences.

Rank sentences by the sum of their words' points. and LexRank ponis

Return X of the most highly ranked sentences in chronological order.

^--- Respectfully, the gist() class method does not seem to fulfill this stated goal. The delimiting constraint is character count, regardless whether or not "X of the most highly ranked sentences" fit. Instead, gist() should return XX argv sentences and allow the caller the decision to truncate the returned string.

2. Re: tokenizer should not split...

Report abuse

Yakim - 2013-08-09 19:52:51 - In reply to message 1 from Yakim

Oops, I see in your github example that you are in fact utilizing a stopword list. I'm not a mongodb user (not even sure I have the momgodb PHP extension installed) so I'm at a loss to try the example and comment on the result.

3. Re: tokenizer should not split...

Report abuse

V Loganathane - 2013-08-14 11:57:07 - In reply to message 1 from Yakim

Hi Yakim,

Thanks for your review.

This just a proof of concept to implement pure PHP based text summarizer. And I will try to incorporate your valuable suggestion.

Regards,
Logu

4. Re: tokenizer should not split...

Report abuse

V Loganathane - 2013-08-14 11:59:31 - In reply to message 2 from Yakim

I have a mysql version and mysql dump. You can use it.