PHP Word Frequency Analysis: Extract text frequent terms of two or more words

Recommend this page to a friend!

Download .zip

Info

Example

Screenshots

View files (7)

Download .zip

Reputation

Support forum

Blog

Links

Last Updated		Ratings				Unique User Downloads		Download Rankings
2014-09-28 (2 years ago)		55%				Total: 360 This week: 3		All time: 6,524 This week: 377

Version		License		PHP version		Categories
`frequency-analyzer` 1		GNU General Publi...		5		Algorithms, PHP 5, Text processing

Description Author

This class can extract text frequent terms of two or more words.

It can parse a given text and extract terms made of individual words or multiple words.

A given list of words can be considered for exclusion.

Innovation Award

October 2014
Number 8

Prize: One copy of the Zend Studio

Analyzing the frequency of the appearance of words in a text is an useful method to determine what are the most relevant topics that the text is talking about.

This class provides interesting solution to compute the frequency of expressions made of one or multiple words in a text document.

Manuel Lemos

Alejandro Mitrou

Performance

Level

Name:	Alejandro Mitrou `<contact>`
Classes:	2 packages by Alejandro Mitrou
Country:	Argentina

Level 2

Innovation award

Nominee: 2x

Details

Frequent Terms Analyzer

This simple library discovers terms composed by two or more words which appears a significant amount of times within a given text.


About the sample script
The provided sample script discovers compound terms once provided with a occurrence probability threshold. It also eliminates those grammar elements such as prepositions, articles, verbs and isolated letters at the beginning and end of each term. Please take in consideration that in order to improve its accuracy, the library should be trained with more sample data about as many diverse topics as possible. This would improve the identification of those common words which in general are unspecific and not part of a compound noun (ie. has, been, a, the, this, etc.). All training texts have been acquired from Wikipedia as they are open-source. 


Constructor and public methods
1. public function __construct(&$termsArray, &$excludedWords = array())
Instantiates the array containing each word from the text as an element. It can also receive a list of words to exclude from 
both sides of the term as using the trim() function.

2. public function getFrequentWords($threshold = 0.01)
Obtains a list of frequent single word terms. By default it is considered that a word should be present at least in 1% of the text to be considered as frequent. You might need to fine tune the threshold value according to your available data.

3. public function getCompoundTerms($threshold = 0.001)
Obtains a list of compound terms, with as many words as the library is able to find in 0,1% of the text.
Once again, you might need to fine tune the threshold value according to your available data.

Further development
This first version solves the proposed problem and its good enough to serve my current needs with an acceptable execution time. Having this said, there are at least a couple of things to improve:
1. Even though I've placed some pointers, passing big data as a reference, the algorithm uses a huge amounts of memory compared to the original data size. This is mainly caused be the usage of arrays to held each word as an element, which in PHP uses quite some memory. A better solution could be to traverse a source string instead of storing single words into an array.
2. Once discovered two-words common terms, the method analysis1() traverses the original data once again in search of three-words common terms, without considering previous results. Initially I've planned to incrementally use the collected information to take advantage of it during the process, but this also will require a future release to be completed.

Please feel free to improve this small library as much as you like.


License
This development subscribes to GPL license model. If it’s useful to you, just use it leaving a link to Alejandro Mitrou [www.WiseTonic.com] in your acknowledgement page and/or within your documentation. This software is provided as it is, without warranty of any kind express or implied.

Screenshots

Files

File	Role	Description
`data` (4 files)
`frequentTermsAnalyzer.php`	Class	Class file
`README.frequentTermsAnalyzer`	Doc.	Brief documentation
`testInTextFile.php`	Example	Test script

Files

data

File	Role	Description
`wikipedia_barbicue.txt`	Data	data file
`wikipedia_new_york_city.txt`	Data	Data file
`wikipedia_personal_finance.txt`	Data	Data file
`wikipedia_social_media.txt`	Data	Data file

	frequency-analyzer-2014-09-28.zip 83KB
	frequency-analyzer-2014-09-28.tar.gz 82KB
	Install with Composer

Version Control

Unique User Downloads

Download Rankings

Total:	360
This week:	3

All time:	6,524
This week:	377

User Ratings

	All time
Utility:	75%
Consistency:	66%
Documentation:	58%
Examples:	58%
Tests:	-
Videos:	-
Overall:	55%
Rank:	1552

Applications that use this package

No pages of applications that use this class were specified.

If you know an application of this package, send a message to the author to add a link here.

Advertise on this site

For more information send a message to info at phpclasses dot org.