CuratorAI

Make every dataset speak the same language.

CuratorAI uses LLMs to harmonize the messy metadata that describes biomedical datasets — mapping inconsistent sample and study annotations to standardized, ontology-linked terms. It’s the cleanup layer that makes data comparable and analysis-ready before any model touches it.

Overview

Most biomedical datasets arrive with inconsistent, free-text metadata — the same tissue, disease, or treatment described a dozen different ways. CuratorAI is an LLM-based agentic workflow that reads it, maps entities like disease, drug, tissue and cell type to standardized, ontology-linked terms, and — working from the metadata table and a short study description — suggests the analysis contrasts the data actually supports.

So a dataset arrives messy and leaves analysis-ready: harmonized columns plus proposed contrasts like treatment vs. control or responder vs. non-responder. It’s the clean foundation every downstream model and knowledge-graph query depends on.

Messy metadata in, analysis-ready data out.

CuratorAI reads each dataset’s metadata table and a short description of the study, then suggests harmonized, KG-mapped columns — and the analysis contrasts the data actually supports.

Raw metadata

tissue: breast tumor

agent: dithranol

group: disease (incl. treated samples)

Curated by CuratorAI

tissue → Breast · KG-mapped

drug → Anthralin · dithranol = synonym

contrast → disease vs. control · treated excluded

Catches the mislabeled treated samples that would have quietly corrupted a disease-vs-control comparison.

The same agentic curation harmonizes metadata across thousands of datasets — powering the Oncology Sample Universe (~150K samples across ~7K datasets, with treatment and response annotations).

Make every dataset speak the same language.

CuratorAI

Messy metadata in, analysis-ready data out.

The Data4Cure AI suite.

See CuratorAI in action.