Modern NLP is about using machine learning and large datasets to give computers the ability not to understand language, which is a more lofty goal, but to ingest a piece of language as input and return something useful, with tasks like text classification, filtering, sentiment analysis, translation, etc.
Timeline of models
Dawn of time: None DL ML models like decision trees
In 2014-17: RNNs specifically LSTMs
2017-today: Transformers
There are many ways to vectorize text, but they all follow a common format.
Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with.
One of the simplest and most widespread standardization schemes is “convert to lowercase and remove punctuation characters.”
A much more advanced standardization pattern that is more rarely used in a machine learning context is stemming: converting variations of a term (such as different conjugated forms of a verb) into a single shared representation, like turning “caught” and “been catching” into “[catch]” or “cats” into “[cat]”. With stemming, “was staring” and “stared” would become something like “[stare]”.
With these standardization techniques, your model will require less training data and will generalize better—it won’t need abundant examples of both “Sunset” and “sun-set” to learn that they mean the same thing, and it will be able to make sense of “Méx-ico” even if it has only seen “mexico” in its training set.
There are three different ways to tokenize a standardized text: