Data4Cure AI CuratorAI
AI Solution · Metadata harmonization

Make every dataset speak the same language.

CuratorAI uses LLMs to harmonize the messy metadata that describes biomedical datasets — mapping inconsistent sample and study annotations to standardized, ontology-linked terms. It’s the cleanup layer that makes data comparable and analysis-ready before any model touches it.

Overview

CuratorAI

Most biomedical datasets arrive with inconsistent, free-text metadata — the same tissue, disease, or treatment described a dozen different ways. CuratorAI is an LLM-based agentic workflow that reads it, maps entities like disease, drug, tissue and cell type to standardized, ontology-linked terms, and — working from the metadata table and a short study description — suggests the analysis contrasts the data actually supports.

So a dataset arrives messy and leaves analysis-ready: harmonized columns plus proposed contrasts like treatment vs. control or responder vs. non-responder. It’s the clean foundation every downstream model and knowledge-graph query depends on.

In the Data Import Studio

Messy metadata in, analysis-ready data out.

CuratorAI reads each dataset’s metadata table and a short description of the study, then suggests harmonized, KG-mapped columns — and the analysis contrasts the data actually supports.

Raw metadata
tissue: breast tumor
agent: dithranol
group: disease (incl. treated samples)
Curated by CuratorAI
tissue → Breast · KG-mapped
drug → Anthralin · dithranol = synonym
contrast → disease vs. control · treated excluded
Catches the mislabeled treated samples that would have quietly corrupted a disease-vs-control comparison.

The same agentic curation harmonizes metadata across thousands of datasets — powering the Oncology Sample Universe (~150K samples across ~7K datasets, with treatment and response annotations).

Get in touch

See CuratorAI in action.

Walk through this with an applications scientist — focused on the questions that matter to you.