Chapter 11: Deep Learning for Text (Ed. 2)

Modern NLP is about using machine learning and large datasets to give computers the ability not to understand language, which is a more lofty goal, but to ingest a piece of language as input and return something useful, with tasks like text classification, filtering, sentiment analysis, translation, etc.

Timeline of models

Dawn of time: None DL ML models like decision trees

In 2014-17: RNNs specifically LSTMs

2017-today: Transformers

Preparing text data

There are many ways to vectorize text, but they all follow a common format.

Standardization: Standardize the text to make it easier to process (all lower case, remove punctuation).
Tokenization: Split the text into units called tokens, this could be as characters, words, or groups of words.
Vectorization: Convert the tokens into numerical vectors. This will usually involve indexing all tokens present in the data.

Text standardization

Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with.

One of the simplest and most widespread standardization schemes is “convert to lowercase and remove punctuation characters.”

A much more advanced standardization pattern that is more rarely used in a machine learning context is stemming: converting variations of a term (such as different conjugated forms of a verb) into a single shared representation, like turning “caught” and “been catching” into “[catch]” or “cats” into “[cat]”. With stemming, “was staring” and “stared” would become something like “[stare]”.

With these standardization techniques, your model will require less training data and will generalize better—it won’t need abundant examples of both “Sunset” and “sun-set” to learn that they mean the same thing, and it will be able to make sense of “Méx-ico” even if it has only seen “mexico” in its training set.

Tokenization

There are three different ways to tokenize a standardized text:

Word-level: tokens are space separated substrings. A variant of this is to further split words into subwords when applicable.
N-gram: tokens are groups of N consecutive words. For instance, “the cat” or “he was” would be 2-gram tokens.