Pre-processing steps for NLP

splitting text up with tokenization + removing "textual noise" + reducing tokens with stemming and lemmatization
2025-12-17 23:59
// updated 2025-12-19 11:23

In natural language processing (NLP), we usually have the goal of analyzing a document (i.e. a piece of text) in order to gather some information about it (and maybe even differentiate it from other documents!)

However, before we perform any such analysis on a document of text, we should do some preparation work to make the words (or more generally, tokens) more meaningful:

  • split text into word-sized tokens
    • tokenization ("divide and conquer")
  • remove "textual noise" from the document
    • lower-casing all characters
    • punctuation removal
    • stop words removal (meaningless words removal)
  • reduce tokens into root forms
    • stemming
      • reduce words to root forms (although they may or may not be actual words in the natural language)
        • e.g. happier => happi
    • lemmatization
      • reduce words to dictionary root forms
        • e.g. happier => happy

Setup

We will start with the Python library NLTK (natural language toolkit) to implement the pre-processing steps:

# app.py

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

(Note that all code snippets on this page labelled with the comment # app.py all belong to one large file!)

We will use the following sample text as an example:

# app.py

text = "NLTK provides powerful tools for tokenization. It includes a word tokenizer and sentence Tokenizer!"

Splitting text into word-sized tokens

Tokenization

Splitting the text up into more manageable word-sized tokens:

# app.py

# split text into tokens
tokens_text = word_tokenize(text)

From that, our output will look like:

['NLTK',
 'provides',
 'powerful',
 'tools',
 'for',
 'tokenization',
 '.',
 'It',
 'includes',
 'a',
 'word',
 'tokenizer',
 'and',
 'sentence',
 'Tokenizer',
 '!']

Removing textual noise

Lower-casing characters

Machines regard upper-cased and lower-cased versions of a letter as completely different characters:

  • in the news headline "Ship runs into another ship"
    • both instances of the word "ship" refer to the same idea of "an aquatic vessel"
    • lower-casing the first instance would therefore allow machines to regard both instances as using the same word
# app.py

# lowercase all text tokens
tokens_text = [word.lower() for word in tokens_text]

which becomes

['nltk',
 'provides',
 'powerful',
 'tools',
 'for',
 'tokenization',
 '.',
 'it',
 'includes',
 'a',
 'word',
 'tokenizer',
 'and',
 'sentence',
 'tokenizer',
 '!']

Of course, sometimes lower-casing the character of a name could cause problems, as the meaning of a word might change if we capitalize it! For that, we might need to manually exclude it from this step with regular expressions (beyond the scope of this article!)

Punctuation removal

Punctuation often appears attached to its preceding word in a text; we can remove these markers with one line of Python:

# app.py

# remove punctuation from all tokens
tokens_text = [ word for word in tokens_text if any(char.isalnum() for char in word) ]

That line:

  • looks at each token in a list of tokens
  • removes any tokens that have no alphanumeric characters (a-z, 0-9)
    • char.isalnum() provides a handy means to find this out

which results in:

['nltk',
 'provides',
 'powerful',
 'tools',
 'for',
 'tokenization',
 'it',
 'includes',
 'a',
 'word',
 'tokenizer',
 'and',
 'sentence',
 'tokenizer']

Stop words removal

For a more interesting list of words, we can remove any words that we deem too common ("stop words" like the, of, in, etc.)! In most cases, we can use a pre-made list of so-called stop words:

# app.py

# the stopwords comes from the NLTK corpus library 
# we can replace 'english' with another language if we need to
stop_words = stopwords.words('english')

# remove all tokens (i.e. words) if they appear in stop_words
tokens_text = [ word for word in tokens_text if word not in stop_words ]

which results in:

['nltk',
 'provides',
 'powerful',
 'tools',
 'tokenization',
 'includes',
 'word',
 'tokenizer',
 'sentence',
 'tokenizer']

Reduce tokens to root forms

Especially in Indo-European languages like English, words can have multiple forms, (e.g. run, ran, running, runs or happy, happier, happiest); in order to do some meaningful text analysis, we must also reduce the word tokens to root forms!

Stemming

Reduces a word to its stem, e.g. happiest => happi

# app.py

import nltk
from nltk.stem import PorterStemmer
ps = PorterStemmer()
tokens_text = [ ps.stem(token) for token in tokens_text ]

which would result in:

['nltk',
 'provid',
 'power',
 'tool',
 'token',
 'includ',
 'word',
 'token',
 'sentenc',
 'token']

Notice how many of the tokens don't correspond to English words but they do reduce a token to its "stem"; we can instead use the next method, lemmatization, to reduce a token to its "lemma" or root meaning!

Lemmatization

Reduces to a token to its lemma, or root meaning, e.g. happier => happy:

# app.py

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lm = WordNetLemmatizer()

# lemmatizing by noun
tokens_text_l = [lm.lemmatize(token, "n") for token in tokens_text]

# taking the last line's list and lemmatizing the verbs
tokens_text_l = [lm.lemmatize(token, "v") for token in tokens_text_l]

which would yield:

['nltk',
 'provide',
 'powerful',
 'tool',
 'tokenization',
 'include',
 'word',
 'tokenizer',
 'sentence',
 'tokenizer']

Notice how all items of the list correspond to actual English words!

Combined file

To summarize, all the previous snippets with # app.py can combine into something like this:

# app_combined.py

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

text = "NLTK provides powerful tools for tokenization. It includes a word tokenizer and sentence Tokenizer!"

# tokenization
tokens_text = word_tokenize(text)

# lowercase
tokens_text = [word.lower() for word in tokens_text]

# punctuation removal
tokens_text = [word for word in tokens_text if any(char.isalnum() for char in word)]

# stop words removal
stop_words = stopwords.words('english')
tokens_text = [ word for word in tokens_text if word not in stop_words ]
tokens_text

# stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()
tokens_text_s = [ ps.stem(token) for token in tokens_text ]

# lemmatization
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()

# lemmatizing by noun
tokens_text_l = [lm.lemmatize(token, "n") for token in tokens_text]

# then lemmatize by verb
tokens_text_l = [lm.lemmatize(token, "v") for token in tokens_text_l]

Kaggle notebook:

Summing up

From the aforementioned steps, we have:

  • split the text up into tokens
  • lowercased all tokens (words)
  • removed tokens with only punctuation
  • removed tokens with common words
  • reduced tokens to stems or lemmas

Next steps

Now we can focus on the "clean document" in order to:

  • find the most common meaningful words (bag of words)
  • find documents that cater to specific words (TF/IDF)
  • detect hidden, unwritten topics (topic modelling)
  • look at a document's tone (sentiment analysis)
⬅️ older (in snippets)
🐍 Python unpacking operator
newer (posts) ➡️
Machine learning (an overview) 🤖