Recommend this page to a friend! |
basset-ir | > | All threads | > | Text Comparism with Basset-ir | > | (Un) Subscribe thread alerts |
|
Ayuba Dauda - 2018-06-07 15:43:28
Am developing a plagiarism detection system with laravel, and part of the requirement is to check a text file against other text files to detect if the contents are similar. I have checked the documentation but i still can't figure out how to go about this. Please help me out. Thanks
Jericko Tejido - 2018-06-07 23:28:25 - In reply to message 1 from Ayuba Dauda
Hi Ayuba,
I apologize for not having the documentation clear about its usage. I know documenting isn't my strongest suit. To better explain it, Basset only compares a parsed document (i.e., after you've parsed PDF, html, .doc to a .txt, etc.) Once doing that, simply use this: use Basset\Collections\CollectionSet; use Basset\Documents\TokensDocument; use Basset\Documents\Document; use Basset\Normalizers\English; use Basset\Index\Index; use Basset\Tokenizers\WhitespaceTokenizer; use Basset\Utils\StopWords; use Basset\Utils\TransformationSet; use Basset\Ranking\DocumentRanking; use Basset\Models\TfIdf; use Basset\Similarity\CosineSimilarity; Then... // The location of your documents $doc1 = file_get_contents("test.txt"); // Stop words list if you have them $stopwords = file_get_contents('stopwords.txt'); // Tokenizer for words $tokenizer = new WhitespaceTokenizer(); // Normalizing all words $english= new English(); // Tokenize the stopwords $filter = new StopWords($tokenizer->tokenize($stopwords)); // Register all pre-analysis stuff $transformations = array( $english, $filter ); $transform = new TransformationSet(); $transform->register($transformations); // Start Collecting documents $documents = new CollectionSet(true); $documents->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), 'Deadpool'); $documents->addDocument(new TokensDocument($tokenizer->tokenize($doc2)), 'BigFish'); // ...and some more documents // finally, transform all txts $documents->applyTransformation($transform); // documents have to be indexed $stats = new Index($documents); // Your query words have to be in Document class too. your query document can also be another document, as long as you insert it here for tokenization (and transforming like the others did) $query = new Document(new TokensDocument($tokenizer->tokenize('bIg bEaR and fIshy, tHen deadpool'))); $query->applyTransformation($transform); // then compare them $search = new DocumentRanking($stats); $search->query($query); $search->documentModel(new TfIdf); $search->queryModel(new TfIdf); $search->similarity(new CosineSimilarity); $search->search(); Other models are available on its documentation if Vector Space Model (where CosineSimilarity belongs to) isn't sufficient. At the moment Basset don't keep the precounted index, but I'm working on a DAFSA/B-trie structure for better traversing of an inverted index that can be kept and stored on disk and be re-useable. So if you have lots of documents to compare to, you can keep a serialized version of it somewhere, and just unserialize it when you're about to re-use it. This avoids a lengthy index process you'll have to go through each time (Indexing is the one that always takes a long time) and just go to the searching everytime your controller/service asks for it.
Ayuba Dauda - 2018-06-08 02:18:30 - In reply to message 2 from Jericko Tejido
Thanks for your ardent support...i've been trying it out but am getting a Class not found exception for the classes Index and DocumentRanking...the code is shown below with a description of what i am trying to achieve:
public function compare($doc1_name = 'test1.txt', $doc2_name = 'test2.txt'){ /* $doc1_name and $doc2_name contains the respective names of the two text files i want to compare */ $path = 'projects/schema/'; $doc1 = file_get_contents($path.$doc1_name); $doc2 = file_get_contents($path.$doc2_name); $tokenizer = new WhitespaceTokenizer(); $english= new English(); $transformations = array( $english ); $transform = new TransformationSet(); $transform->register($transformations); $documents = new CollectionSet(true); //please what is the use of the second parameter below $documents->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), 'Deadpool'); $documents->applyTransformation($transform); $query = new QueryDocument(new TokensDocument($tokenizer->tokenize($doc2))); $query->applyTransformation($transform); //Throws a Class Index not found exception $stats = new Index($documents); //Throws a Class not found exception $search = new DocumentRanking($stats); $search->query($query); $search->documentModel(new TfIdf); $search->queryModel(new TfIdf); $search->similarity(new CosineSimilarity); $search->search(); /* after the search i want to return true if doc1 and doc2 have simlar contents and false otherwise...Example Below: $isSimilar = $search->search(); //am assuming search returns Boolean if($isSimilar){ echo "Plagiarism Suspected....files contain similar texts"; return true; } else{ echo "Documents do not match"; return false; } */ } ***Thanks***
Jericko Tejido - 2018-06-08 05:10:29 - In reply to message 3 from Ayuba Dauda
Hi Ayuba,
I didn't realize until now that the composer package isn't updated to what I currently have on my github repo (after seeing some classes in your sample code that shouldn't be there). I've updated it so you may have to remove it from your composer and vendor folder, or do a composer update (ensure that you get v1.1 now) to install it again. That should fix your question 1. For 2: //please what is the use of the second parameter below $documents->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), 'Deadpool'); This is a Label or ID for the document, it has something to do with the answer for number 3. //am assuming search returns Boolean search() doesn't return boolean, but in fact an array of documents with similarity percentage (for CosineSimilarity it's between 0 and 1, 0 being not similar and 1 being fully copied) based on what's on the query. This is in a form [DocumentID/Label => score] Without the label as second parameter (which is only optional by the way), all documents will be identified by their array offset (the first document you entered in addDocument() will be 0, then +1 for the succeeding docs, provided that you set CollectionSet(false)) The reason is that, we wouldn't say if your query is actually an exact match (a plagiarist may have changed word positions or even re-worded some), so percentage for a given document is given on Basset's search(), to give you a statistical means of seeing 'which document most likely matches my query'.. So if you're looking if they actually copied the entire text, you'd have to say that it gets 1 otherwise it's going to be between 0 and 1.. You'd have to do a checking in your final lines like this: $result = $search->search(); $docID = <whatever your doc label/id here>; if(isset($result[$docID]) && ($result[$docID] === 1 )){ echo "Plagiarism Suspected....files contain similar texts"; // not only suspected but a 100% copied document return true; } else { echo "No exact match found"; return false; } If you have a threshold in which you have a percentage of how much is suspectedly copied from a document, then you can change the conditions above to that percentage (like $result[$docID] > .8 appears to be suspecting) P.S. try having a single document as sample in the collection then the same document as query as well, you'll get 1. (provided that they're similarly tokenized, normalized and filtered with same stopwords list)
Ayuba Dauda - 2018-06-08 15:18:47 - In reply to message 4 from Jericko Tejido
Thank you very much Jericko...This package has an absolute solution to my problem...even beyond my expectations, am currently updating my package. I'll drop a feedback soon. **Gratitudes**
Ayuba Dauda - 2018-06-08 16:16:01 - In reply to message 4 from Jericko Tejido
Hi, I've just tried it out...am glad it's executing without exception this time except that i always get the same search result (0) even when i use exactly the same texts for test1.txt and test2.txt files.Please bear with me as am not so conversant with the similarity types, perhaps its the problem please recommend an appropriate type for me. Here is Code:
public function compare($doc1_name = 'test1.txt', $doc2_name = 'test2.txt'){ //the content of both text files is "I love mangoes" $path = 'projects/schema/'; $doc1 = file_get_contents($path.$doc1_name); $doc2 = file_get_contents($path.$doc2_name); $tokenizer = new WhitespaceTokenizer(); $english= new English(); $transformations = array( $english ); $transform = new TransformationSet(); $transform->register($transformations); $documents = new CollectionSet(true); $documents->addDocument(new TokensDocument($tokenizer->tokenize($doc1)), 'test'); $documents->applyTransformation($transform); $query = new Document(new TokensDocument($tokenizer->tokenize($doc2))); $query->applyTransformation($transform); $stats = new Index($documents); $search = new DocumentRanking($stats); $search->query($query); $search->documentModel(new TfIdf); $search->queryModel(new TfIdf); $search->similarity(new CosineSimilarity); $result = $search->search(); $docID = "test"; echo "Percentage Similarity: ".$result[$docID]; //always 0 regardless of the content of the text files }
Jericko Tejido - 2018-06-08 17:25:43 - In reply to message 6 from Ayuba Dauda
Hi Ayuba,
I've fixed a bug to allow one-off document for all kinds of Idf (idf is log() of how many documents there is in the collection divided by number of document a term shows up....I never realized it can and should be used like a one-off similarity like this (1 vs 1 document log(1/1) is always 0), as I initially intended it to be used in a large corpus spanning thousands of docs during my research), please update the package again and thanks for notifying me. also, one of PHP's nuance is its floating value system (all scores in Basset happens to be float). So you can't do: if($result['test'] == 1) { do stuff } You should do: if($result['test'] >= 1) { do stuff } if you're aiming at 100% similarity, otherwise, you can set a threshold and do ($result['test'] <= $minimumvalue)..
Ayuba Dauda - 2018-06-08 21:58:31 - In reply to message 7 from Jericko Tejido
Thanks for your timely response Jericko...i have updated the package but am still having the same result, am now using the snippet below as you suggested:
$result = $search->search(); if($result['test'] >= 1){ echo "This Document has been copied"; } else if($result['test'] >= 0.5){ echo "Plagiarism Suspected"; } else{ echo "Documents do no match"; } am getting "Document do no match" even though the text files are exactly the same
Ayuba Dauda - 2018-06-09 15:01:34 - In reply to message 7 from Jericko Tejido
Am glad it's working now YOU'VE done a great work Jericko, i changed the similarity type to Dice Similarity and am getting reasonable ranges now.
the only issues now is: 1.) i only get reasonable results if i write the actual text in the new Document Parameter list as below; $documents->addDocument(new TokensDocument($tokenizer->tokenize("i love oranges")), 'test1'); $documents->addDocument(new TokensDocument($tokenizer->tokenize("we hate mangoes")), 'test1'); $documents->applyTransformation($transform); $query = new Document(new TokensDocument($tokenizer->tokenize("I love potatoes")), 'test1'); it does yield reasonable results when i use the variable name of the fetched text file as parameter...am presuming it cant read the contents of the fetched document. 2.) The one-off comparison still results in a 0 output regardless of the input texts. ****THANKS***
Ayuba Dauda - 2018-09-10 01:14:02 - In reply to message 2 from Jericko Tejido
Hi, i have just updated my version of BASSET-ir to the latest version but after testing it i realized am not getting realistic results as i use to have with older versions, testing the system with two completely dissimilar texts return a high similarity score like 0.68... please help me out, THANKS
|
info at phpclasses dot org
.