Posted on: Monday, August 29, 2022 • 4 minute read
Keyword Extraction in Python via RAKE
Keyword extraction is a process by which important terms are identified that best represent the material in a document. Keyword extraction is a subset of NLP (natural language processing) and information retrieval systems.
Despite these scary words, keyword extraction is fairly simple to perform in Python thanks to the beauty of open-source software.
This post is also available in video form at https://youtu.be/xdlrQ0-okZc.
Background
There are several different libraries/methods that can perform keyword extraction in Python. In this post, we will mainly focus on RAKE (Rapid Automatic Keyword Extraction).
RAKE is a novel method of automatically extracting keywords from documents created by researchers much smarter than me. Check out their paper here.
Thankfully, the open-source community has taken the approach described in this paper and created a class to extract keywords. We will be using the rake-nltk package in this tutorial. Note that there are other Python RAKE implementations but this one is PIP installable which I like.
Checkout this awesome blog post by Ali Mansour if you want to see a comparison of other Python keyword extraction libraries.
Setup
Start by installing the package
pip install rake-nltk
Debugging
If you see an error regarding missing nltk resources (stopwords or punkt), you may need to download them from nltk via the following commands:
python3 -c "import nltk; nltk.download('stopwords')"
python3 -c "import nltk; nltk.download('punkt')"
Usage
from rake_nltk import Rake
# text snippet from: https://en.wikipedia.org/wiki/COVID-19
text = """
Coronavirus disease 2019 (COVID-19) is a contagious disease caused by a virus,
the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
The first known case was identified in Wuhan, China, in December 2019.
The disease quickly spread worldwide, resulting in the COVID-19 pandemic.
"""
rake = Rake()
rake.extract_keywords_from_text(text.strip())
keywords_with_scores = rake.get_ranked_phrases_with_scores()
for score, keyword in keywords_with_scores:
print(keyword, score)
This will output the following keywords and their associated scores:
severe acute respiratory syndrome coronavirus 2 32.5 disease quickly spread worldwide 15.333333333333334 coronavirus disease 2019 10.333333333333334 contagious disease caused 9.333333333333334 first known case 9.0 2 ). 6.0 december 2019 4.5 19 pandemic 3.5 19 1.5 wuhan 1.0 virus 1.0 sars 1.0 resulting 1.0 identified 1.0 covid 1.0 covid 1.0 cov 1.0 china 1.0
Note that you can use get_ranked_phrases instead to get just the keywords in ranked order without the associated score
Conclusion
Clearly the above output is not perfect but all in all the out of the box performance on a fairly small document is pretty impressive. Some text cleaning and output pruning can be performed to further reduce junk keywords such as 2 ). and 19.