Natural language processing (NLP) : an overview

glossing over ways how "things" can learn about words

2025-12-11 23:59
// updated 2025-12-19 09:43

Natural language processing (NLP) refers to the interpretation of human language by computer or computer-like devices!

In this overview we will look at:

a (very) short history of NLP
common NLP tasks
common NLP models
common NLP libraries

History of NLP

We can summarize the history of NLP (so far) in two main periods:

🤖 1950s-1990s : the symbolic period

"language as a set of rules", i.e. "if this, then certainly that"

1954 : Georgetown-IBM experiment
- first attempt at automatic translation from Russian to English
- word-for-word translation between two very different languages (doomed to fail!)
- progress ground to a halt by the 1960s
1960s : ELIZA, the first chatbot
- language processing based on pattern matching
  - remarkably human output for simple prompts
  - fallback to generic responses for complex prompts
- chatbot had no modern machine learning capabilities
1970s-1980s : quantitative evaluation of text
- paved the way for the next "statistical period"

📈 1990s-onwards : the statistical period

"language as a set of probabilities", i.e. "the best word to appear next is..."

1990s : computing power and availability increases exponentially
2000s : "the internet" provides more than enough training data for various competing models
2010 : the rise of neural networks and word embeddings to interpret language more like an "active, flowing brain" than a "dumb, step-wise machine"
2017 : the historic academic paper "Attention is all you need" by Google scientists bring about the revolutionary transformer
- looking at text as a whole rather than word-for-word
2022 : the rise of large language models (i.e. ChatGPT)
2020s : exponential development and improvement upon the new language models

Common NLP tasks

Most of these tasks have libraries which each specialize in doing most of the work, although they can always use some human validation:

Text collection

sourcing
- via proprietary, public, web scraping, somewhere else?

Text cleaning (noise removal)

As an example, for a text: "He very much praised NLP!"

lowercasing all text
- computers treat uppercase and lowercase letters differently so we could lowercase all text
  - after which, we will have "He very much praised nlp!"
  - however, this could also result in a loss of meaning as a capital letter might change the meaning of a word (e.g. we have "The Sun" [the newspaper] and then "the sun" [the star])
stop word removal
- deleting common words from the dataset that carry little context and meaning, e.g. many prepositions like "for" and "in" and "to"
  - after which, the text becomes "praised nlp!"
punctuation removal
- remove punctuation since they appear after words without spaces
  - after which, the text becomes "praised nlp"

Text pre-processing

tokenization
- splitting a text up into smaller parts, roughly corresponding to "words"
  - so, the text becomes a list: [ "praised", "nlp"]
stemming
- taking the "chopped up" version of words, e.g. "happiness" becomes "happi"
  - so, we have: [ "prais", "nlp" ]
lemmatization
- making that chopped up version meaningful, e.g. "happiness" should instead of becoming "happi", becomes "happy"
  - so, we then have: [ "praise", "nlp" ]

Text processing

part-of-speech tagging
- giving each token a linguistic category
  - e.g. "praise" = verb, "nlp" = noun
named entity recognition (NER)
- acknowledging the proper nouns (if any)
  - e.g. if someone's or some place's name appeared in the text

Text analysis

sentiment analysis
- determining the tone of a text just by analyzing its tokens
logistic regression
- analyze text using statistical regression methods

Common NLP models

Bag of words

looking at the frequency of tokens (words) to detect a text's meaning

TF/IDF

looking at the "term (token) frequency" in a text versus its frequency elsewhere in other documents, i.e. "inverse document frequency"

n-grams

looking at groups of n (a number > 1) tokens to detect a text's meaning
- e.g. 2-grams (or bigrams) to see which pairs of tokens happen most often
- e.g. 3-grams (or trigrams) to eliminate more coincidental token pairs
note how "bag of words" is just a special case of 1-gram "unigrams"!

Latent Dirichlet allocation (LDA)

a form of topic modelling that finds keywords to determine a topic or theme within a text

Latent semantic analysis (LSA)

looks at relationships between sets of documents and the terms they contain, i.e.:
- words with similar meanings appear frequently together

Common NLP libraries

NLP libraries for Python programs:

NLTK (natural language toolkit)

for educational usage
- text pre-processing
first released in 2001
website @ nltk.org

SpaCy

for production usage
- deep learning workflows
- multi-lingual tokenization and NER models
first released in 2015
website @ spacy.io

TextBlob

builds upon NLTK
for processing and analyzing text data
- tokenization
- part of speech tagging
- translation
- sentiment analysis
  - returns a number between -1 (negative) to 1 (positive)
- language detection
  - "in which language is a text written?"
website @ https://textblob.readthedocs.io/en/dev/

transformers

understands contexts of words based on surrounding words
returns a generic sentiment scale but also confidence scores

Common NLP applications

speech recognition
next word prediction
spell and grammar check
sentiment analysis
text generation (i.e. chatbots)

A note about other NLPs

We should not confuse the NLP of "natural language processing" with the NLP in:

a related term in AI known as "natural-language programming"
- the use of a natural language, such as English, to write software
an unrelated term in mathematics known as "non-linear programming"
a completely unrelated term known as "neuro-linguistic programming"
- the use of language to "program", i.e. modify, human behaviour

ai textbook machine learning NLP language overviews

⬅️ older (in takeaways)
🤔 Opening a link to a new window?

newer (in takeaways) ➡️
Machine learning (an overview) 🤖

⬅️ older (posts)
🐍 Python unpacking operator

newer (posts) ➡️
Pre-processing steps for NLP 📜