Document Translation, Data extraction, and Literature Search


The client is a leading service provider to the Life Sciences industry. The company provides services in analytics, clinical research services, and technology solutions with operations in over 100 countries. It focuses on making intelligent connections, through which it provides insights and execution capabilities to biotech, medical device and Pharma, medical research, government, payers, and other healthcare stakeholders, which in turn provides a deeper understanding that helps in developing and implementing scientific advances in towards prevention and cure

Project brief:

The project is aimed at facilitating the client’s operations which involves collection and understanding of various types of data and documents in different languages and consolidating the insights to provide to their clients. The initial focus was on Adverse Events (AE). Our client gets AE forms in various languages which need to be translated and the required data is extracted (including non-language data such as checkboxes). The client also receives published literature – case studies, systematic reviews, etc. from which they have to extract the AEs, possible drugs responsible for it, and the causality between the two.


Generative AI-based platform, where the client can upload the required document/s, and provide task instructions in Natural language (for data extraction and language conversion) or converse with the document to obtain the necessary information, with OCR capabilities

Healthark’s Role

We developed the platform for the client, initially trialling it with various foundation model and their APIs, to assess their capabilities for our solution. The results of our various sprints were benchmarked against standards to find the most accurate solution. Subsequently, the platform was built, enabling the client to upload the requisite documents and issue task instructions in Natural Language.


The client was able to do the following with the platform built by us

  1. Translate documents from various languages
  2. Extract language and non-language data types from complex forms with dynamic formats with output supported in JSON
  3. Identify entities and causal inference between them
  4. Summarize Text