Overview
CrateGen is a Python library that converts GA4GH Cloud API schemas (TES, WES) to RO-Crate profiles. It makes data sharing and reproducibility easier in scientific research by handling the conversion between genomic and health dataset formats.
The Problem
In the world of genomic research, data comes in many formats. The GA4GH (Global Alliance for Genomics and Health) has established cloud API standards like TES (Task Execution Service) and WES (Workflow Execution Service), but converting between these schemas and research-crate formats was a manual, error-prone process.
Solution
CrateGen automates this conversion process with:
- Schema Mapping: Intelligent mapping between GA4GH schemas and RO-Crate profiles
- Validation: Built-in validation to ensure FAIR data principles compliance
- Extensibility: Plugin architecture for custom schema extensions
Key Features
Automatic Schema Detection
from crategen import CrateGenerator
# Automatically detects input schema type
generator = CrateGenerator(input_file="workflow.tes.json")
crate = generator.to_ro_crate()FAIR Compliance Checking
from crategen import FairValidator
validator = FairValidator(crate)
report = validator.validate()
print(f"FAIR Score: {report.score}/100")
print(f"Issues: {report.issues}")Batch Processing
from crategen import BatchProcessor
processor = BatchProcessor(input_dir="./workflows/")
results = processor.convert_all(output_format="ro-crate")Technical Architecture
The library follows a modular architecture:
- Parser Layer: Handles input schema parsing (TES, WES, custom)
- Transformer Layer: Maps fields between schemas
- Generator Layer: Produces valid RO-Crate output
- Validation Layer: Ensures compliance with FAIR principles
Impact
- 40% improvement in data exchange workflows
- Adopted by research institutions globally
- Ensures FAIR data principles compliance
- Enables integration across genomic databases
Lessons Learned
Working on CrateGen taught me a lot about:
- Designing APIs for scientific communities
- Implementing strict validation systems
- Contributing to open-source standards organizations
- CI/CD best practices with Django and DigitalOcean