Importance of Keyword Extraction in NLP
- Summarization: Capturing the essence of a document by identifying core topics.
- Content Categorization: Efficiently tagging, indexing, and organizing text for retrieval and grouping.
The challenge lies in automating this task across large datasets, and that’s where techniques like regex and TF-IDF, augmented by BERT, become invaluable. Let’s explore how these methods work together for efficient keyword extraction.
Basic Regex Patterns for Identifying Keywords
- Pattern Matching: Identify whether a string contains a specific sequence of characters.
- Extraction: Capture specific parts of text, such as dates or emails, based on their format.
- Replacement: Modify a string by replacing certain patterns, like changing phone numbers to a standard format.
Common Examples of Regex Patterns
- Emails: To find email addresses, you could use a regex pattern like
r’\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b’
- Phone Numbers: To capture standard phone numbers, a regex pattern might look like
r’\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}’
- Identifying Specific Diagnoses: To extract "mitral regurgitation" and its variations (e.g., "mitral valve regurgitation"), a regex pattern might look like:
r'\bmitral(?: valve)? regurgitation\b'
- Capturing Drug Names: To extract drug brand names, you could use a regex pattern like
r'\b(?:aspirin|ibuprofen|acetaminophen)\b'
TF-IDF: Adding Context and Relevance
How TF-IDF Works
1. Term Frequency (TF): Measures the occurrence of a term in a document.
2. Inverse Document Frequency (IDF): Highlights terms that are unique to a document within a larger collection.
Example
This is how we use TfidfVectorizer in Python:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample patient notes
documents = [
"Mild mitral regurgitation detected.",
"Severe mitral valve regurgitation noted.",
"Moderate mitral regurgitation was diagnosed."
]
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Vocabulary and scores
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Scores for the First Document:", tfidf_matrix.toarray()[0])
BERT: Enhancing Contextual Understanding
Why Use BERT?
- Context Awareness: BERT considers the entire sequence of words in a text, not just isolated terms.
- Robustness: It effectively differentiates between contexts, which is crucial for accurate keyword extraction.
Example
from transformers import BertTokenizer, BertModel
import torch
# Initialize BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
# Sample sentence
sentence = "Moderate mitral regurgitation was diagnosed."
# Generate BERT embedding
encoded_input = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
output = model(**encoded_input)
embedding = output.last_hidden_state.mean(dim=1).squeeze()
print("Embedding Shape:", embedding.shape)
Bringing It All Together: A Comprehensive Pipeline
In the example below, we demonstrate how to extract the severity of the condition “mitral regurgitation” from unstructured patient notes. The pipeline begins by identifying sentences mentioning the condition, then ranks their relevance, and finally refines the results using contextual understanding.
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Regex patterns
severity_pattern = r"\b(mild|moderate|severe)\b"
condition_pattern = r"\bmitral (?:valve )?regurgitation\b"
# Tokenizer and BERT model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
# Example patient notes with severity
notes = {
"Patient A": "Patient exhibits signs of mild aortic regurgitation, and mitral regurgitation is moderate. There was severe hypertension",
"Patient B": "The echocardiogram confirmed Severe mitral valve Regurgitation.",
"Patient C": "Mitral regurgitation was detected and was classified as MODERATE severity."
}
# Function to extract sentences using regex
def extract_sentences_with_regex(text, patterns):
sentences = text.split('.') # Simple sentence splitting
relevant_sentences = [s.strip() for s in sentences if any(re.search(p, s, re.IGNORECASE) for p in patterns)]
return relevant_sentences
# Function to compute TF-IDF and rank sentences
def rank_sentences_tfidf(corpus, query_terms):
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
query_vector = vectorizer.transform([' '.join(query_terms)])
similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
ranked_indices = np.argsort(-similarities) # Descending order
return [corpus[i] for i in ranked_indices[:5]] # Top 5 sentences
# Function to get BERT embedding for a sentence
def get_bert_embedding(sentence):
encoded = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
output = model(**encoded)
return output.last_hidden_state.mean(dim=1).squeeze()
# Function to rank sentences with BERT embeddings based on similarity
def rank_sentences_with_bert(sentences, query):
query_embedding = get_bert_embedding(query)
sentence_embeddings = [get_bert_embedding(s) for s in sentences]
similarities = [cosine_similarity([query_embedding], [emb])[0][0] for emb in sentence_embeddings]
ranked_indices = np.argsort(-np.array(similarities))
return [sentences[i] for i in ranked_indices[:1]] # Top 1 sentence
# Main processing loop
results = {}
for patient, note in notes.items():
# Extract relevant sentences using regex
patterns = [severity_pattern, condition_pattern]
relevant_sentences = extract_sentences_with_regex(note, patterns)
# If no relevant sentences are found, handle gracefully
if not relevant_sentences:
results[patient] = {
"Condition": "No relevant condition mentioned",
"Severity": "No severity mentioned"
}
continue # Skip to the next patient
# Rank sentences using TF-IDF
top_sentences = rank_sentences_tfidf(relevant_sentences, ["mitral", "regurgitation", "mild", "moderate", "severe"])
# Refine ranking with BERT
final_sentence = rank_sentences_with_bert(top_sentences, "mitral regurgitation")
# Extract condition and severity
condition = None
severity = None
# Check for severity and condition in the final sentence
for severity_match in re.findall(severity_pattern, final_sentence[0], re.IGNORECASE):
severity = severity_match.lower()
for condition_match in re.findall(condition_pattern, final_sentence[0], re.IGNORECASE):
condition = condition_match.lower()
# Store results in the desired format
results[patient] = {
"Condition": condition.title() if condition else None,
"Severity": severity.title() if severity else None
}
# Display results
import json
print(json.dumps(results, indent=4))
Output
{
"Patient A": {
"Condition": "Mitral Regurgitation",
"Severity": "Moderate"
},
"Patient B": {
"Condition": "Mitral Valve Regurgitation",
"Severity": "Severe"
},
"Patient C": {
"Condition": "Mitral Regurgitation",
"Severity": "Moderate"
}
}
Practical Applications
1. Clinical Decision Support Systems (CDSS)
In hospital settings, Clinical Decision Support Systems (CDSS) support doctors by highlighting relevant medical information from a patient’s electronic health record (EHR), improving the speed and accuracy of diagnosis and treatment planning. By combining regex, TF-IDF and BERT, CDSS can extract key medical terms, rank them by relevance, and refine results with contextual understanding, making it easier for doctors to focus on the most critical information.
Example: Imagine a patient visits the hospital with symptoms such as chest pain and dizziness. The CDSS uses regex patterns to identify medical terms in the patient’s records, finding phrases like “hypertension,” “shortness of breath,” or “heart disease.” TF-IDF then scores these terms to prioritize the ones most relevant to the patient’s symptoms. Finally, BERT analyzes the context to further refine the ranking, ensuring accurate suggestions.
With these prioritized terms, the CDSS can suggest possible diagnoses, such as heart failure, and recommend specific tests like an ECG or blood tests. By integrating BERT’s contextual understanding, the system avoids false positives and ensures that doctors receive actionable insights, improving diagnostic accuracy and supporting more targeted treatment decisions.
2. Pharmacovigilance: Adverse Event Detection from Medical Reports
Regulatory bodies and pharmaceutical companies need to monitor reports of adverse drug reactions (ADRs) to ensure drug safety. Keyword extraction with regex, TF-IDF, and BERT enables the detection of adverse events in unstructured text such as medical reports or social media posts.
Example: Regex patterns are designed to capture adverse reaction terms like “nausea,” “dizziness,” or “rash.” TF-IDF scores these terms based on their frequency within individual reports versus the entire corpus, identifying serious or uncommon side effects that may warrant further investigation. BERT then adds a layer of semantic understanding by analyzing whether the mention of symptoms is causal (e.g., “rash caused by medication”) or incidental (e.g., “rash reported but unrelated to treatment”).
This combination enables timely detection and response to emerging safety concerns, ensuring regulatory compliance and improved drug safety monitoring.
3. Automated Report Classification in Healthcare
In healthcare facilities, managing and categorizing large volumes of medical reports is essential to ensure that each report is routed to the appropriate analysis tool or classifier. Different health systems often use unique report titles and naming conventions, making manual classification slow and error-prone. Using regex, TF-IDF, and BERT provides a solution by analyzing report content directly and assigning each report to the relevant classifier based on its content, rather than relying solely on its title.
Example: Imagine a hospital receives thousands of reports daily from multiple departments, including radiology, pathology, and lab results. Regex identifies key terms specific to each type (e.g., “x-ray,” “CT scan,” “biopsy”), while TF-IDF ranks these terms by importance within each report. BERT refines the classification by understanding context, differentiating between reports mentioning similar terms but with different implications:
- Reports containing terms like “tumor,” “biopsy,” and “malignant” are automatically routed to the pathology classifier.
- Reports mentioning “imaging,” “thoracic,” or “pulmonary” are assigned to the radiology classifier.
- Reports with terms like “hemoglobin,” “WBC,” or “platelets” go to the lab result analyzer.
Conclusion
Combining regex, TF-IDF, and BERT creates a robust pipeline for extracting actionable insights from unstructured text. This approach leverages the precision of regex, the contextual ranking of TF-IDF, and the deep understanding of BERT to address challenges like variability, irrelevant details, and ambiguity. The resulting pipeline is highly versatile and well-suited for applications in healthcare, pharmacovigilance, and automated text classification.
Author: Chetna Gohel