As I mentioned earlier in my spam attack analysis, I wanted to know which language spams I receive are written in. My first bruteforce-like idea was to take each word one by one, and search in english/french/german/… dictionaries whether the words were in. But with this approach I would miss all the conjugated verbs (until I had a really nice dictionary like the one I have now in firefox plugin). Then I remember that languages could differ in the distribution of their alphabetical letters, but well I had no statistics about that…
That was it for my own brainstorming, I decided to have a look at what google thinks about this problem. I firstly landed on some online language detector… The easy solution would have been to abuse this service which must have some cool algorithms, but well I needed to know what kind of algorithms it could be, and I didn’t want to rely on any thirdparty web service. Finally I read about Evaluation of Language Identification Methods, of which the abstract seemed perfect:
Language identification plays a major role in several Natural Language Processing applications. It is mostly used as an important preprocessing step. Various approaches have been made to master the task. Recognition rates have tremendously increased. Today identification rates of up to 99 % can be reached even for small input. The following paper will give an overview about the approaches, explain how they work, and comment on their accuracy. In the remainder of the paper, three freely available language identification programs are tested and evaluated.
I found the N-gram approach on page 8 (chapter 4) rather interesting. The principle is to cut into defined pieces m long texts written in their respective language (english, french…), that we will call training texts, and count how much time each piece appeared; Do the same on the text you want to identify, and check the training text matching your text the best; This training text is most likely written in the same language as your text.
The pieces are the N-grams, ie for the word GARDEN the bi-grams (N=2) are: G, GA, AR, RD, DE, EN, N.
Now there are various way of finding the best matching text playing with the N-grams, distances, score…
I found an implementation from 1996 in C, here with sources. So I followed same algorithm and implemented it in Ruby. Those C sources reminded me of my C days where you had to implement your lists, hashes. Those sources are optimized for memory usage (10 years ago…)… At the end the Ruby code is a hundred line while the C was four times more, and the Ruby code is easier to read. Don’t take that as a demonstration, it is not!! I admit the C binary is maybe a bit faster (but not that much ;)). I’ll try to commit it on rubyforge when I have some time.
The results are excellent as shown in the paper:
Anyway actually in this story the most interesting was not the implementation but the method: It is funny that you can identify languages (so human population as well) by without requiring linguistic knowledge: ignoring grammar, senses of words (dictionary)… But by only analyzing letters and blocks of letters from Shakespeare or Baudelaire. N-grams can also be used in other areas, for example in music to predict which note is likely to follow.