Keyword extraction is a foundational task in Natural Language Processing (NLP), essential for applications such as summarization, Search Engine Optimisation (SEO), and text classification. This post discusses the use of regular expressions (regex) for simple keyword extraction, introduces basic NLP tools like TF-IDF, and incorporates the advanced Bidirectional Encoder Representations from Transformers (BERT) model to enhance the process, providing a balance between ease of implementation, effectiveness, and deep contextual understanding.

Importance of Keyword Extraction in NLP

Keyword extraction is the process of identifying the most relevant and informative words or phrases within a text. It serves as a powerful tool in NLP and plays a crucial role in many applications, such as:

The challenge lies in automating this task across large datasets, and that’s where techniques like regex and TF-IDF, augmented by BERT, become invaluable. Let’s explore how these methods work together for efficient keyword extraction.

Basic Regex Patterns for Identifying Keywords

Regular Expressions (Regex) are sequences of characters that define a search pattern. Regex is powerful for searching, matching, and manipulating text based on specific patterns, and it’s widely used in programming, data cleaning, and text processing for tasks requiring precise pattern matching.
A regex pattern can be as simple as finding a specific word, or as complex as identifying a pattern for email addresses or phone numbers. Here’s a quick breakdown of what regex can do:

Common Examples of Regex Patterns

pattern = r’\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b’
pattern = r’\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}’
pattern = r'\bmitral(?: valve)? regurgitation\b'
pattern = r'\b(?:aspirin|ibuprofen|acetaminophen)\b'
Regex provides an initial filtering step by focusing on predefined patterns. While effective for structured data, it cannot rank or understand the importance of extracted terms in context. This is where TF-IDF comes in.

TF-IDF: Adding Context and Relevance

TF-IDF (Term Frequency-Inverse Document Frequency) adds context by considering the importance of a term within a document and across a corpus.

How TF-IDF Works

TF-IDF calculates a score that highlights terms that are frequent in a document but rare across other documents, making them key identifiers. The formula combines two metrics:

1. Term Frequency (TF): Measures the occurrence of a term in a document.
2. Inverse Document Frequency (IDF): Highlights terms that are unique to a document within a larger collection.
The TF-IDF score for a term is calculated as: TF-IDF = TF × IDF

Example

This is how we use TfidfVectorizer in Python:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample patient notes
documents = [
    "Mild mitral regurgitation detected.",
    "Severe mitral valve regurgitation noted.",
    "Moderate mitral regurgitation was diagnosed."
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Vocabulary and scores
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Scores for the First Document:", tfidf_matrix.toarray()[0])

TF-IDF ranks sentences based on their relevance, helping prioritize text for further analysis. However, it does not capture deeper contextual meaning, which is where BERT comes in.

BERT: Enhancing Contextual Understanding

BERT (Bidirectional Encoder Representations from Transformers) further enhances the keyword extraction process by understanding the context in which words appear, making it highly effective for complex texts where words may have different meanings based on their surroundings.

Why Use BERT?

Example

Using the transformers library to generate embeddings:

from transformers import BertTokenizer, BertModel
import torch

# Initialize BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Sample sentence
sentence = "Moderate mitral regurgitation was diagnosed."

# Generate BERT embedding
encoded_input = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    output = model(**encoded_input)
embedding = output.last_hidden_state.mean(dim=1).squeeze()

print("Embedding Shape:", embedding.shape)
BERT generates a numeric representation (embedding) for the sentence, which captures its meaning and can be used to rank sentences or phrases.

Bringing It All Together: A Comprehensive Pipeline

Combining regex, TF-IDF, and BERT creates a robust pipeline for keyword extraction.

In the example below, we demonstrate how to extract the severity of the condition “mitral regurgitation” from unstructured patient notes. The pipeline begins by identifying sentences mentioning the condition, then ranks their relevance, and finally refines the results using contextual understanding.

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Regex patterns
severity_pattern = r"\b(mild|moderate|severe)\b"
condition_pattern = r"\bmitral (?:valve )?regurgitation\b"

# Tokenizer and BERT model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Example patient notes with severity
notes = {
   "Patient A": "Patient exhibits signs of mild aortic regurgitation, and mitral regurgitation is moderate. There was severe hypertension",
   "Patient B": "The echocardiogram confirmed Severe mitral valve Regurgitation.",
   "Patient C": "Mitral regurgitation was detected and was classified as MODERATE severity."
}

# Function to extract sentences using regex
def extract_sentences_with_regex(text, patterns):
   sentences = text.split('.')  # Simple sentence splitting
   relevant_sentences = [s.strip() for s in sentences if any(re.search(p, s, re.IGNORECASE) for p in patterns)]
   return relevant_sentences

# Function to compute TF-IDF and rank sentences
def rank_sentences_tfidf(corpus, query_terms):
   vectorizer = TfidfVectorizer()
   tfidf_matrix = vectorizer.fit_transform(corpus)
   query_vector = vectorizer.transform([' '.join(query_terms)])
   similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
   ranked_indices = np.argsort(-similarities)  # Descending order
   return [corpus[i] for i in ranked_indices[:5]]  # Top 5 sentences

# Function to get BERT embedding for a sentence
def get_bert_embedding(sentence):
   encoded = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
   with torch.no_grad():
       output = model(**encoded)
   return output.last_hidden_state.mean(dim=1).squeeze()

# Function to rank sentences with BERT embeddings based on similarity
def rank_sentences_with_bert(sentences, query):
   query_embedding = get_bert_embedding(query)
   sentence_embeddings = [get_bert_embedding(s) for s in sentences]
   similarities = [cosine_similarity([query_embedding], [emb])[0][0] for emb in sentence_embeddings]
   ranked_indices = np.argsort(-np.array(similarities))
   return [sentences[i] for i in ranked_indices[:1]]  # Top 1 sentence

# Main processing loop
results = {}
for patient, note in notes.items():
   # Extract relevant sentences using regex
   patterns = [severity_pattern, condition_pattern]
   relevant_sentences = extract_sentences_with_regex(note, patterns)

   # If no relevant sentences are found, handle gracefully
   if not relevant_sentences:
       results[patient] = {
           "Condition": "No relevant condition mentioned",
           "Severity": "No severity mentioned"
       }
       continue  # Skip to the next patient

   # Rank sentences using TF-IDF
   top_sentences = rank_sentences_tfidf(relevant_sentences, ["mitral", "regurgitation", "mild", "moderate", "severe"])
   # Refine ranking with BERT
   final_sentence = rank_sentences_with_bert(top_sentences, "mitral regurgitation")
   # Extract condition and severity
   condition = None
   severity = None
   # Check for severity and condition in the final sentence
   for severity_match in re.findall(severity_pattern, final_sentence[0], re.IGNORECASE):
       severity = severity_match.lower()
   for condition_match in re.findall(condition_pattern, final_sentence[0], re.IGNORECASE):
       condition = condition_match.lower()

   # Store results in the desired format
   results[patient] = {
       "Condition": condition.title() if condition else None,
       "Severity": severity.title() if severity else None
   }
# Display results
import json
print(json.dumps(results, indent=4))

Output

For the provided notes, the output is:

{
    "Patient A": {
        "Condition": "Mitral Regurgitation",
        "Severity": "Moderate"
    },
    "Patient B": {
        "Condition": "Mitral Valve Regurgitation",
        "Severity": "Severe"
    },
    "Patient C": {
        "Condition": "Mitral Regurgitation",
        "Severity": "Moderate"
    }
}

Practical Applications

1. Clinical Decision Support Systems (CDSS)

In hospital settings, Clinical Decision Support Systems (CDSS) support doctors by highlighting relevant medical information from a patient’s electronic health record (EHR), improving the speed and accuracy of diagnosis and treatment planning. By combining regex, TF-IDF and BERT, CDSS can extract key medical terms, rank them by relevance, and refine results with contextual understanding, making it easier for doctors to focus on the most critical information.

Example: Imagine a patient visits the hospital with symptoms such as chest pain and dizziness. The CDSS uses regex patterns to identify medical terms in the patient’s records, finding phrases like “hypertension,” “shortness of breath,” or “heart disease.” TF-IDF then scores these terms to prioritize the ones most relevant to the patient’s symptoms. Finally, BERT analyzes the context to further refine the ranking, ensuring accurate suggestions.
With these prioritized terms, the CDSS can suggest possible diagnoses, such as heart failure, and recommend specific tests like an ECG or blood tests. By integrating BERT’s contextual understanding, the system avoids false positives and ensures that doctors receive actionable insights, improving diagnostic accuracy and supporting more targeted treatment decisions.

2. Pharmacovigilance: Adverse Event Detection from Medical Reports

Regulatory bodies and pharmaceutical companies need to monitor reports of adverse drug reactions (ADRs) to ensure drug safety. Keyword extraction with regex, TF-IDF, and BERT enables the detection of adverse events in unstructured text such as medical reports or social media posts.

Example: Regex patterns are designed to capture adverse reaction terms like “nausea,” “dizziness,” or “rash.” TF-IDF scores these terms based on their frequency within individual reports versus the entire corpus, identifying serious or uncommon side effects that may warrant further investigation. BERT then adds a layer of semantic understanding by analyzing whether the mention of symptoms is causal (e.g., “rash caused by medication”) or incidental (e.g., “rash reported but unrelated to treatment”).
This combination enables timely detection and response to emerging safety concerns, ensuring regulatory compliance and improved drug safety monitoring.

3. Automated Report Classification in Healthcare

In healthcare facilities, managing and categorizing large volumes of medical reports is essential to ensure that each report is routed to the appropriate analysis tool or classifier. Different health systems often use unique report titles and naming conventions, making manual classification slow and error-prone. Using regex, TF-IDF, and BERT provides a solution by analyzing report content directly and assigning each report to the relevant classifier based on its content, rather than relying solely on its title.

Example: Imagine a hospital receives thousands of reports daily from multiple departments, including radiology, pathology, and lab results. Regex identifies key terms specific to each type (e.g., “x-ray,” “CT scan,” “biopsy”), while TF-IDF ranks these terms by importance within each report. BERT refines the classification by understanding context, differentiating between reports mentioning similar terms but with different implications:

By integrating regex, TF-IDF, and BERT, this system efficiently categorizes reports, ensuring that each document is sent to the appropriate classifier for analysis. This reduces manual sorting, improves accuracy, and allows healthcare providers to handle large and varied datasets more effectively.

Conclusion

Combining regex, TF-IDF, and BERT creates a robust pipeline for extracting actionable insights from unstructured text. This approach leverages the precision of regex, the contextual ranking of TF-IDF, and the deep understanding of BERT to address challenges like variability, irrelevant details, and ambiguity. The resulting pipeline is highly versatile and well-suited for applications in healthcare, pharmacovigilance, and automated text classification.

Author: Chetna Gohel

Share on