Client Background
Our Client Drives India's Economic Growth and Social Development
The client, a prominent government policy think tank in India, plays a critical role in policy formulation, socio-economic development, and data-driven insights. It collaborates with states and key stakeholders to promote cooperative federalism and align grassroot initiatives with national priorities, ensuring sustainable and inclusive growth.
Challenge
Difficulty in Efficient Information Search
The objective of the project was to design and implement a RAG-Powered Similarity Search tailored for the government sector. The system would enable users to query a large database of government policies and documents with precision, enhancing data accessibility and streamlining policy analysis and decision-making. A key requirement was to ensure high accuracy and transparency in search results to support informed decision-making processes.
Healthark’s role
Developed & Tested Patient Identification Algorithm
Healthark was responsible for end-to-end development of the RAG-Powered Similarity Search. This involved designing an intuitive user interface, integrating AI-driven search algorithms, and building a robust backend capable of processing a large repository of government documents. The solution leveraged advanced natural language processing (NLP) techniques to ensure precise retrieval of information.
- Needs Assessment Collaborated with key stakeholders to assess the complexity, volume, and diversity of the data. Identified the need for a “RAG-Powered Similarity Search” that supports multiple languages and handles various data formats, including web pages, PDFs, spreadsheets, images, and databases. The solution also required the ability to accurately extract information from non-editable sources using OCR tools, while ensuring scalability to effectively manage future data growth.
- Development of Similarity Search Functionality Developed an advanced AI-powered search system to understand user queries and retrieve over 300 government policy documents with high precision, handling multiple formats such as text, tables, and images.
Parsing and Preprocessing: Utilized tools like PyMuPDF, Tabula, and BeautifulSoup for parsing, along with NLTK and spaCy for text cleaning and standardization to ensure structured data.Chunking and Metadata Extraction: Split large documents into smaller chunks for efficient indexing and retrieval, with metadata extraction (titles, dates, authors) supporting filtering.Embedding Generation: Employed Sentence Transformers to create high-dimensional embeddings, stored in Pinecone DB for fast similarity searches.Model Selection: Used both pre-trained and fine-tuned models from OpenAI and Hugging Face for accurate query matching and document relevance.Hybrid Retrieval System: Combined cosine similarity with BM25 relevance scoring and metadata-based filters to enhance search precision.Integration of Multilingual Support: Implemented OpenAI’s language translation model to handle multilingual queries, enabling users to input questions in various Indian languages with real-time translation, retrieval, and direct links to relevant policies for easy verification.
- Results, Testing and Feedback: The system utilized transformer-based models to aggregate and summarize relevant results, providing concise insights with direct reference links to the original documents for easy navigation. Extensive testing and user feedback were gathered to enhance response accuracy, answer quality, and overall system performance. Continuous improvements refined the model’s understanding, ensuring a user-friendly experience that is both accurate and efficient. This iterative process made information retrieval more intuitive, reliable, and aligned with user needs.
Empowering Tomorrow's Healthcare
This case study highlights how a vector search-based Q&A system enhances stakeholder access to insights from policy documents. Integration of AI-driven search algorithms and advanced NLP enables efficient querying of over 300 policies, ensuring accurate, multilingual results that enhance transparency, data accessibility, and informed decision-making.
Want to learn more about Healthark’s expertise in AI-driven solutions? Explore our website or contact us today!
Checkout our latest Case Studies
In this episode, titled “Measuring Effectiveness of Pharma Marketing”, our host, Shivang Bhagat, Senior Consultant at Healthark Insights, engages in a dynamic conversation with Preetha Vasanji, President of Emerging Markets
Read MoreIn the latest episode of the Healthcare Innovation podcast, Dr. Purav Gandhi, CEO of Healthark Insights, welcomes Dan Housman, Co-founder and CTO of Graticule, to talk about his experiences in
Read MoreThe RWD & RWE Club promotes active conversation on the creation and use of Real-World Data for generating Real-World Evidence for clinical research, drug development, medical devices, value access, etc.
Read More