Recently I took an interest in identifying the language of a snippet of text. Let’s say I get a text message. It says “Bienvenue!” If I have seen this word, I’d already have a mental map of the word to its language and meaning. The problem is that people typically understand one language. Maybe two or three languages if you’re gifted. If I were to build a computer to do this, it would be a huge “mental” map of all possible words to their respective meanings. While this is plausible, I decided to employ machine learning to the task.
Job to be Done
Let’s start with what the task is. To narrow down the scope of this to a long weekend exploration, I decided to focus on just six Latin based languages: English, French, Dutch, Italian, Portuguese, and Spanish. All text are in Unicode. For value proposition’s sake, let’s say I work for Apple and my task is to quickly identify what language a user is typing in and automatically switch to that keyboard. This seems like a multinomial classification task. The input is a sentence and the output is the language used.
Representation Matters
Next, let’s think about what the representation of a sentence should be in order to best capture its language. The bag of words approach to representing text is always the first thing that comes to mind for NLP tasks. It’s simple, intuitive, and often works really well out of the box. I could use all possible words for all the languages involved and then convert each sentence into a large count dictionary. That feels too much. Instead, using just the top 200 to 300 words from each language seems to be a more lean approach. I’d like to think of this as one data hyperparameter. Let’s not remove stop words as that could introduce unwanted bias. After all, stop words in one language may not appear in another language and could be a linguistic trait. This could be considered another data hyperparameter. Lastly, I can try limiting the length of the sentences to a maximum of 200 characters and a minimum of 50 characters. This is another design choice made to speed up processing and standardize the data a bit, so it’s up to the ML engineer what they want to pick.
A twist to the bag of words approach would be to use a bag of characters approach instead. Same concept, different implementation. Not only do words carry information on what language a sentence is in, so do the characters. I can thus convert each sentence into a bag of n-grams (n would be yet another data hyperparameter!).
Lastly, another representation technique is quantizing each character into a one hot vector where the length of the vector is the number of characters considered. The sentence is represented by a sequence of quantized characters (vectors).