Natural language processing (NLP) : an overview
glossing over ways how "things" can learn about words
2025-12-11 23:59
// updated 2025-12-19 09:43
// updated 2025-12-19 09:43
Natural language processing (NLP) refers to the interpretation of human language by computer or computer-like devices!
In this overview we will look at:
- a (very) short history of NLP
- common NLP tasks
- common NLP models
- common NLP libraries
History of NLP
We can summarize the history of NLP (so far) in two main periods:
🤖 1950s-1990s : the symbolic period
"language as a set of rules", i.e. "if this, then certainly that"
- 1954 : Georgetown-IBM experiment
- first attempt at automatic translation from Russian to English
- word-for-word translation between two very different languages (doomed to fail!)
- progress ground to a halt by the 1960s
- 1960s : ELIZA, the first chatbot
- language processing based on pattern matching
- remarkably human output for simple prompts
- fallback to generic responses for complex prompts
- chatbot had no modern machine learning capabilities
- language processing based on pattern matching
- 1970s-1980s : quantitative evaluation of text
- paved the way for the next "statistical period"
📈 1990s-onwards : the statistical period
"language as a set of probabilities", i.e. "the best word to appear next is..."
- 1990s : computing power and availability increases exponentially
- 2000s : "the internet" provides more than enough training data for various competing models
- 2010 : the rise of neural networks and word embeddings to interpret language more like an "active, flowing brain" than a "dumb, step-wise machine"
- 2017 : the historic academic paper "Attention is all you need" by Google scientists bring about the revolutionary transformer
- looking at text as a whole rather than word-for-word
- 2022 : the rise of large language models (i.e. ChatGPT)
- 2020s : exponential development and improvement upon the new language models
Common NLP tasks
Most of these tasks have libraries which each specialize in doing most of the work, although they can always use some human validation:
Text collection
- sourcing
- via proprietary, public, web scraping, somewhere else?
Text cleaning (noise removal)
As an example, for a text: "He very much praised NLP!"
- lowercasing all text
- computers treat uppercase and lowercase letters differently so we could lowercase all text
- after which, we will have "He very much praised nlp!"
- however, this could also result in a loss of meaning as a capital letter might change the meaning of a word (e.g. we have "The Sun" [the newspaper] and then "the sun" [the star])
- computers treat uppercase and lowercase letters differently so we could lowercase all text
- stop word removal
- deleting common words from the dataset that carry little context and meaning, e.g. many prepositions like "for" and "in" and "to"
- after which, the text becomes "praised nlp!"
- deleting common words from the dataset that carry little context and meaning, e.g. many prepositions like "for" and "in" and "to"
- punctuation removal
- remove punctuation since they appear after words without spaces
- after which, the text becomes "praised nlp"
- remove punctuation since they appear after words without spaces
Text pre-processing
- tokenization
- splitting a text up into smaller parts, roughly corresponding to "words"
- so, the text becomes a list: [ "praised", "nlp"]
- splitting a text up into smaller parts, roughly corresponding to "words"
- stemming
- taking the "chopped up" version of words, e.g. "happiness" becomes "happi"
- so, we have: [ "prais", "nlp" ]
- taking the "chopped up" version of words, e.g. "happiness" becomes "happi"
- lemmatization
- making that chopped up version meaningful, e.g. "happiness" should instead of becoming "happi", becomes "happy"
- so, we then have: [ "praise", "nlp" ]
- making that chopped up version meaningful, e.g. "happiness" should instead of becoming "happi", becomes "happy"
Text processing
- part-of-speech tagging
- giving each token a linguistic category
- e.g. "praise" = verb, "nlp" = noun
- giving each token a linguistic category
- named entity recognition (NER)
- acknowledging the proper nouns (if any)
- e.g. if someone's or some place's name appeared in the text
- acknowledging the proper nouns (if any)
Text analysis
- sentiment analysis
- determining the tone of a text just by analyzing its tokens
- logistic regression
- analyze text using statistical regression methods
Common NLP models
Bag of words
- looking at the frequency of tokens (words) to detect a text's meaning
TF/IDF
- looking at the "term (token) frequency" in a text versus its frequency elsewhere in other documents, i.e. "inverse document frequency"
n-grams
- looking at groups of n (a number > 1) tokens to detect a text's meaning
- e.g. 2-grams (or bigrams) to see which pairs of tokens happen most often
- e.g. 3-grams (or trigrams) to eliminate more coincidental token pairs
- note how "bag of words" is just a special case of 1-gram "unigrams"!
Latent Dirichlet allocation (LDA)
- a form of topic modelling that finds keywords to determine a topic or theme within a text
Latent semantic analysis (LSA)
- looks at relationships between sets of documents and the terms they contain, i.e.:
- words with similar meanings appear frequently together
Common NLP libraries
NLP libraries for Python programs:
NLTK (natural language toolkit)
- for educational usage
- text pre-processing
- first released in 2001
- website @ nltk.org
SpaCy
- for production usage
- deep learning workflows
- multi-lingual tokenization and NER models
- first released in 2015
- website @ spacy.io
TextBlob
- builds upon NLTK
- for processing and analyzing text data
- tokenization
- part of speech tagging
- translation
- sentiment analysis
- returns a number between -1 (negative) to 1 (positive)
- language detection
- "in which language is a text written?"
- website @ https://textblob.readthedocs.io/en/dev/
transformers
- understands contexts of words based on surrounding words
- returns a generic sentiment scale but also confidence scores
Common NLP applications
- speech recognition
- next word prediction
- spell and grammar check
- sentiment analysis
- text generation (i.e. chatbots)
A note about other NLPs
We should not confuse the NLP of "natural language processing" with the NLP in:
- a related term in AI known as "natural-language programming"
- the use of a natural language, such as English, to write software
- an unrelated term in mathematics known as "non-linear programming"
- a completely unrelated term known as "neuro-linguistic programming"
- the use of language to "program", i.e. modify, human behaviour