Author: Dave Smith
Posted on: 2016-10-17
Package: PHP Language Detection Library
Introduction
Using the PHP Language Detection Library
Methods to Detect Languages
Conclusion
Introduction
The PHP Language Detection Library uses a web service provided by languageLayer. Send them some text and they will return a list of languages that are candidates, as well as a main match that is most likely the language being used.
In addition to using this simple package, I will also take a brief a look at how the detection process works.
Using the PHP Language Detection Library
Using the package to detect the language of any text is very simple. Provide the text and read the result returned from the languageLayer API.
You will first need to set up your own subscription account at https://languagelayer.com/product where you will receive your unique access key. You add this key in the langlayer.class.php file, replacing YOUR_API_KEY_HERE with your unique access key.
private $apiKey = 'YOUR_API_KEY_HERE';
You can now instantiate the class and send any text to the API to be checked.
include('langlayer.class.php');
$lang = new languageLayer();
$text = 'Ich bin mir sicher, dass dies die Sprache Deutsch';
$lang->getResponse($text);
The response will be in the $lang->response object which we can see by dumping to the screen
var_dump($lang->response);
Languages use a specific alphabet, Latin for example, so the results can contain more than one language with each language detected containing:
language_code = 2 digit language code
language_name = The full English name for the language
probability = a numerical weighted probability, the higher the number the more likely text is this specific language
percentage = the percentage between 0% and 100% which represents the API's confidence
reliable_result = true or false depending on whether the API is confident in the main match
The more text provided, the greater probability that the language will be accurately identified.
Methods to Detect Languages
There are several ways that text can be evaluated to determine the language it is written in. The simplest is to look at the character set, for example the Latin and Cyrillic languages contain different characters. Using this method, we can differentiate between English and Russian, however it will not be easy to tell the difference between English and Spanish, which are both Latin languages.
Another method is to look for specific character combinations known as digraphs and trigraphs. A digraph is 2 characters side by side and a trigraph is a set of 3 sequential characters.
Certain character groups will appear more often in one language than another which allows an algorithm to determine the likelihood the text belongs to a specific language. This method provides a better way to determine languages within the same character family.
The more languages which are supported, the more likely that certain languages will have similar digraph and trigraph sets. To further separate these similar languages we need to look for specific words that are more common in a specific language. As these words are located, our confidence grows that we have identified the correct language.
Conclusion
Since languageLayer supports over 170 languages, they have to use all the methods described in this article. The formula is simple, by comparing character sets, digraphs, trigraphs and unique words, any text can be evaluated to determine the language it is written in.
The hard part in writing your own application is developing accurate digraphs, trigraphs and unique word sets. These 'secret' sets are the power behind accurately detecting a language.
Fortunately for us, we just need to query the web service provided by languageLayer and let them do the heavy lifting in the background.
You need to be a registered user or login to post a comment
Login Immediately with your account on:
Comments:
No comments were submitted yet.