AI-Powered Curation

Scaling biomarker data curation with intelligent multi-agent systems

With over 1.5 million biomedical articles published annually, traditional manual curation cannot keep pace with the exponential growth of biomarker data. We are developing an AI-powered multi-agent system that automates literature discovery, data extraction, and quality assurance while maintaining scientific rigor through human-in-the-loop validation.

Coming Soon: The first batch of AI-extracted datasets will be available in the coming weeks. Stay tuned for expanded coverage of pathogen shedding, immunological responses, and human challenge studies.

Multi-Agent Architecture

Three specialized AI agents work in concert across the entire curation workflow

Discovery Agent

Implements PRISMA-guided systematic discovery to identify relevant biomarker studies across multiple databases including PubMed, Embase, Web of Science, and clinical trial registries.

  • Systematic review integration
  • Comprehensive database searching
  • Metadata-based screening with confidence scoring
  • Full-text retrieval and processing
  • Eligibility assessment

Extraction Agent

Processes heterogeneous document formats with structured schema validation, extracting biomarker data from tables, text, and figures.

  • Multi-format parsing (XML, HTML, PDF)
  • Structured output with schema validation
  • Supplementary materials processing
  • Controlled vocabulary integration

Review Agent

Provides cross-model quality assurance using different LLM APIs to reduce systematic biases inherent in single-model systems.

  • Independent validation of extractions
  • Cross-model verification
  • Source verification to eliminate hallucinations
  • Confidence score calibration
  • Human expert triage for uncertain cases

Two-Stage Quality Control

Our framework advances beyond traditional binary classification by implementing interpretable prompt-based confidence scoring that enables intelligent triage of extraction outputs.

Confidence Scoring (1-5 Scale)

4-5 High confidence → Automated processing
2-3 Medium confidence → Review agent validation
1 Low confidence → Human expert review

Human-in-the-Loop Validation

Expert review is strategically triggered for:

  • Low confidence extractions
  • Review agent disagreements
  • Random sampling (5% of high-confidence cases)

Target Data Domains

Comprehensive coverage across critical biomarker research areas

Pathogen Shedding

Longitudinal quantitative measurements of viral load, bacterial counts, and antigen concentrations across specimen types with temporal resolution for kinetic parameter characterization.

Immunological Responses

Quantitative measurements of antibody titers, cellular responses, and cytokine levels with temporal data enabling kinetic analysis.

Human Challenge Studies

Controlled pathogen exposure with systematic biomarker monitoring, including infection outcomes and correlations between biomarker levels and clinical endpoints.

FAIR Data Infrastructure (Coming Soon)

All AI-curated data follows FAIR principles (Findable, Accessible, Interoperable, Reusable) to maximize research impact and enable seamless computational integration.

Controlled Vocabularies

SKOS-based hierarchical taxonomies with mappings to established ontologies (OBI, SNOMED CT) ensuring terminology standardization during curation.

Semantic Metadata

JSON-LD metadata using Dublin Core, DCAT, and SKOS standards enables semantic web interoperability and automated data discovery.

Multi-Format Outputs

Datasets available in YAML (human-readable), JSON-LD (semantic web), and CSV (statistical analysis) with comprehensive data dictionaries.

API Access

FastAPI-based REST infrastructure with OpenAPI specifications for programmatic access, flexible querying, and seamless integration with analysis pipelines.

Why AI-Powered Curation?

Accelerated Research

Reduce data preparation from months to days, enabling rapid response to emerging health threats.

Scalable Quality

Process hundreds of articles per hour while maintaining quality standards equivalent to expert curation.

Continuous Updates

Automated pipelines enable real-time evidence synthesis as new studies are published.

Public Health Impact

Support diagnostic testing strategies, surveillance systems, and disease control measures.

Modeling Support

Standardized datasets enable scenario planning, power calculations, and sensitivity analyses.

Open Science

All tools, datasets, and workflows publicly available under open-source licenses.

Current Progress

Building on our manually curated benchmark datasets, we are actively developing and validating the AI-powered curation system.

Component Status Details
Manual Curated Benchmarks Completed 39 studies, 27,328 participants, 60,617 measurements
Controlled Vocabulary Implemented 60 controlled vocabulary terms, 132 semantic relationships
Literature Discovery Agent Prototype PRISMA-guided systematic search across PubMed
Data Extraction Agent Prototype Multi-format parsing with schema validation
Review Agent In Development Cross-model validation framework
First AI-Extracted Dataset Batch Coming Soon Expected in the coming weeks

Get Involved

We welcome collaboration from researchers, public health professionals, and developers interested in advancing AI-powered biomedical data curation.