Generative AI on Enterprise Cloud with NiFi and Milvus
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Sequencing Experiment
1. Minimum information about an Adaptive Immune Receptor
Repertoire Sequencing Experiment
Presenter: Syed Ahmad Chan Bukhari, PhD
Department of Pathology, Yale School of Medicine
2. Inability to reproduce scientific experiments is a
big challenge.
Lithgow, G. J., Driscoll, M., & Phillips, P. (2017). A long
journey to reproducible results. Nature News, 548(7668), 387.
● A drug-like molecule could extend an
roundworm lifespan by as much as 67%.
● Other labs failed to replicate the studies.
● Two cancer labs spent more than a year
trying to understand inconsistencies with
same tumour biopsy.
● Because of lack of standards, both labs
were using different cell isolation
protocols.
3. Inability to reproduce scientific experiments is a
big challenge.
Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for
preclinical cancer research. Nature, 483(7391), 531-533.
● Amgen could reproduce the findings
in only 6 of 53 “landmark” papers
in cancer biology
● Bayer could validate only 25% of
67 preclinical studies
4. Inability to reproduce scientific experiments can
have multiple reasons behind.
● Undocumented scientific procedures
● Datasets size and variability
● Problem with statistical techniques
● A documented but a difficult procedure to follow
5. Standardization is a proven way to make sense
to scientific procedures and outcomes.
How and what array platform used?
6. Experiments in Immunology facing the similar
reproducibility challenges
● High-throughput sequencing (HTS) of B-cell (antibody, immunoglobulin) and T-cell
receptor repertoires has increased dramatically since the technique was
introduced in 2009.
○ Previously relied on low-resolution approaches, such as flow cytometry, spectratyping and
Sanger sequencing
● B cell receptors (BCRs) and T cell
receptors (TCRs) serve as the primary
means for specific detection of foreign
antigens.
7. Adaptive Immune Receptor Repertoire (AIRR) Sequencing
● Collection of BCRs or TCRs in an individual, tissue, cell subset or during an immune
response is referred to as the repertoire.
● AIRR-seq studies are associated with complex metadata, such as donor
phenotypes, cell types and nucleic acid material used.
○ Crucial for ensuring reproducibility and facilitating secondary and meta analyses
● AIRR sequencing has enormous promise for understanding the dynamics of the
immune repertoire in vaccinology, infectious disease, autoimmunity, and cancer
biology.
9. Adaptive Immune-Receptor Repertoire (AIRR) Community
Next-generation sequencing of B & T cell receptor repertoires (AIRR-seq)
Developing standard protocols for reporting and sharing AIRR-seq data to
optimize their use in biomedical research and patient care
AIRR Community
Formed
10. AIRR Community Data Elements
Each of the 6 high-level principles has been expanded into a set of data elements
11. “Accurate specification of the pathophysiological
condition is important for cross-comparison of
multiple studies”
● This set describes the experimental study design including the title of the
study, laboratory contact information etc
● For individual subjects, the species, sex, age, and ancestry are included
along with information about disease state(s) etc
● This set describes the metadata about the diagnosis process
12. “Information about the origin and expected
composition of the biological sample(s) is central
for the interpretation of downstream sequencing
results.”
13. “Proper interpretation of experimental results for
future comparative analysis require information”
● How cells are prepared for processing?
● how the sequencing is performed?
● Quality of the data produced are all critically important too.
“MiAIRR focuses on what information need to be
shared rather suggesting the analysis techniques
and tools”
14. Providing raw data enables the most
up-to-date data processing to be performed,
as the analysis tools for AIRR-seq data are
undergoing rapid evolution
● Providing the raw NGS data for each sequencing run (e.g., FASTQ files) permits the
reanalysis, secondary analysis and combination of multiple data sets from different
studies using meta-analysis techniques.
15. ● Variety of tools are in use sequencing and processing. MiAIRR does not
provide tool specific details.
● MiAIRR defines broad categories that cover the essential data processing
steps.
● The software tools with version numbers, quality
thresholds, primer match and length cutoffs, etc.
16. ● This final MiAIRR set will thus comprise the list of processed
sequences, along with sequence-level annotations.
● This should include the V(D)J gene segment and constant region (isotype)
annotation if used in the associated publication, along with the CDR3
sequence.
20. CAIRR: A pipeline to submit AIRR
data to the NCBI through the
CEDAR-workbench
21. NCBI is an important resource to archive biomedical data
● NCBI hosts a collection of biomedical databases:
○ BioProject, BioSample, SRA, GenBank, GEO etc.
● Provide infrastructure to submit experimental data and associated metadata
● Minimal use of standard terminologies to define the necessary metadata
○ Ontologies recommended for some data elements (Not implemented)
● NCBI metadata are often described using inconsistent terminologies
○ Limit our ability to access, find, interoperate and reuse the data sets
Goal: Leverage CEDAR to improve NCBI metadata submissions
NCBI BioSample guideline suggests to use Disease Ontology terms
22. What are the issues with the current NCBI
submission process?
● Rapid growth
● Lack of metadata standardization
● Error prone data entry
● Lack of community-specific metadata
(e.g., AIRR)
● Laborious metadata entry
NCBIGrowth
GenBankGrowth
Metadata Diversity in NCBI repositories
23. How are metadata currently submitted to NCBI?
BioProject
BioSample
Sequence Read Archive
Combination of web-based forms
and excel templates
● No mechanism to enforce standardized
vocabularies or ontology links
27. CAIRR Metadata Generation
Data Submitter
NCBI CAIRR
Controlled Vocabularies
Predictive Entry
Interactive Metadata Entry
Metadata Findability
Metadata Accessibility
Metadata Interoperability
Metadata Reusability
represents limited features availability
Metadata submissions to NCBI BioProject, BioSample
and SRA are ontologically controlled and relationally
linked, which enables concept-based federated queries
across repositories that are silos otherwise.
Why CAIRR?
28. Resources
● Download AIRR NCBI templates:
https://github.com/airr-community/airr-standards
● How to submit AIRR data to NCBI Manual?
https://www.overleaf.com/read/tytddwptgkhb
29. Breden et. al. “Reproducibility and Reuse of Adaptive
Immune Receptor Repertoire Data” (2017)Rubelt, F., Busse, C., Bukhari, SAC et. al. “Adaptive Immune
Receptor Repertoire (AIRR) Community Recommendations for
Sharing Immune Repertoire Sequencing Data” (2017)
30. Kei-Hoi Cheung, Yale University, Dept. of Medical Informatics
● AIRR Community
Kleinstein Lab, Yale University, Dept. of Pathology