Understand Stemming and Lemmatization with Python NLTK Package

Gaurav Karki
3 min readJan 31, 2021

--

Stemming and Lemmatization are steps to convert a word into root form. The root form of word we get from Stemming and Lemmatization are called “Stem” and “Lemma”, respectively. The difference between stemming and lemmatization is that stem might be meaningful or meaningless word but lemma will always be meaningful word.

Table gives an idea how stems and lemmas might be different or same for a word.

Let’s go into deep dives on Stemming and Lemmatization with Python NLTK package for better understanding.

Stemming and Lemmatization are steps to normalizing text (Text-preparation process before it is analyzed) in natural language processing using Python NLTK package. NLTK is an acronym for Natural Language Toolkit. It is a set of libraries that let us build Python programs to work with natural language data.

Before going further, I confirm that NLTK is installed and test datasets to work within NLP are downloaded in Python.

Python Stemming

Python Stemming is to get stem by stemming algorithm for stripping affixes from text. Stemming algorithm is also called Stemmer. There are different stemmers available in different languages in Python NLTK. PorterStammer and LancasterStammer are examples for the English language. We use PorterStemmer to understand Python Stemming.

NOTE: Highlighted Stems are meaningless word.

Python Lemmatization

Python Lemmatization is to get lemma by considering morphological analysis of the words using dictionaries which the algorithm can look through to link the form back to its lemma. I use WordNET Lemmatizer provided by NLTK that uses WordNet Database to lookup lemmas of words. The wordNet lemmatizer strip affixes from word if the resulting word is in its database(dictionary).

In above example Python Lemmatization task provided lemmas without mentioning parts-of-speech(pos). we need to define a parts-of-speech in which we want to lemmatize to obtain correct lemma. This is done by giving the value for “pos” parameter.

Here, pos is a speech parameter, which is noun by default. This means Python will try to find the closest noun.

Here, pos is a verb parameter. This means Python will try to find the closest verb.

Conclusion

Hence, I understood the role of Stemming and Lemmatization for normalizing text and the difference between them with examples. I came to know that Stemming process is comparatively simple and faster than Lemmatization as WordNet corpus is also required to use to get lemma which consumes time. And I have to use stemmers and lemmatizers according to the application I am working on as an example, The Porter Stemmer is a good choice if i am indexing some texts and wants to support search using alternative forms of words and The WordNet lemmatizer is a good choice if want to compile the vocabulary of some texts and want a list of valid lemmas.

References

1. http://www.nltk.org/book_1ed/

2. https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

3. https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

4. https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

5. https://data-flair.training/blogs/nltk-python-tutorial/

--

--

No responses yet