Natural language processing (NLP) is a subfield of artificial intelligence concerned with the interactions between computers and human language. The objective of NLP is being able to treat and handle the nuances of the language within them. In a sense, we want computer to be capable of “understanding” the contents of documents, including the contextual nuances. Cool right ? But how do we go about it?
A lot of research have been done in the field, and today many streamlined processes are in existence, as well as very powerful models. However, it is still benefitial to understand the details, in order to not only utilize the techniques to the best of their abilities, in the right context, but also for researchers, to build upon. Some of the common jargon of creating a models, include common steps such as Tokenization and numericalization.
Tokenization is a common task in Natural Language Processing. The goal of it is to split the raw texts into tokens. Models understand numbers, and languages are made of …well, words. Therefore, in the context of machine learning, we need to find a way to convert the components of languages, such as words and sentences, into a machine representation.
Tokenization means dividing up the input text, which to a computer is just one long string of characters, into sub-units, called tokens. These tokens are then fed into subsequent natural language processing steps.  In the case of many latin based languages, we understand basic building to be letters and words, but as pointed in the related fastai lesson. What about languages like Japanese and Chinese that don’t use bases at all, and don’t really have a well-defined idea of word? Because there is no one correct answer to these questions, there is no one approach to tokenization. Therefore, many approaches exist, as they all cater to particular settings and objective, as well as different languages.
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:
Some other important terms related to word Tokenization are:
- Unigrams, consisting in single words.
- Bigrams: Tokens consist of two consecutive words known as bigrams.
- Trigrams: Tokens consist of three consecutive words known as trigrams.
- Ngrams: Tokens consist of ’N’ number of consecutive words known as ngrams.
Separating a chunk of continuous text into separate words is fairly trivial in english, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages tokenization is a significant task requiring knowledge of the vocabulary and morphology of words in the language.
Various languages problem have different issues to attend, for example, German writes compound nouns without spaces (e.g., Computerlinguistik `computational linguistics’; Lebensversicherungsgesellschaftsangestellter `life insurance company employee’). Retrieval systems for German greatly benefit from the use of a compound-splitter module and French has a variant use of the apostrophe for a reduced definite article the before a word beginning with a vowel (e.g., l’ensemble) and has some uses of the hyphen with postposed clitic pronouns in imperatives and questions (e.g., donne-moi give me). 
This is a tokenization scheme that deals with an infinite potential vocabulary via a finite list of known words. This is an attempt to remove extra complexity of breaking everything into single characters since character-level tokenization can lose some of the meaning. Words such as “any” and “place” which make “anyplace” or compound words like “anyhow” or “anybody”. For those we just need to remember a few words and put them together to create the other words.
The simplest subword tokenization strategy is simply character tokenization: taking the text data one character at a time without doing any grouping. Unfortunately, this simplistic strategy has a serious drawback: it puts the burden of grouping characters on the neural net, requiring it to have significantly more parameters.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE) was introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization, e.g. GPT-2, Roberta. More advanced pre-tokenization include rule-based tokenization, e.g. XLM, FlauBERT which uses Moses for most languages, or GPT which uses Spacy and ftfy, to count the frequency of each word in the training corpus.
All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to separate words. However, not all languages use spaces to separate words. One possible solution is to use language specific pre-tokenizers, e.g. XLM uses a specific Chinese, Japanese, and Thai pre-tokenizer). To solve this problem more generally, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) and many more…
Finally it is easier for machine to deal with the numbers so replace the tokens with the location of the token in the vocabulary. It is the process of assigning every token a unique value. The steps are basically identical to those necessary to create a Category variable. First, we make a list of all possible levels of that categorical variable (the vocab), then we replace each level with its index in the vocab. In general, the steps will be tokenize the text, reduce vocabulary size and remove infrequent words from vocabulary, then numericalize the strings.
Tokenization and numericalization in the NLP-related tasks usually go hand-in-hand, and are part of the processing steps needed to trains models that are dealing with text data, with an objective related to linguistic.
- Grefenstette, Gregory. “Tokenization.” Syntactic Wordclass Tagging. Springer, Dordrecht, 1999. 117–133.