SlideShare una empresa de Scribd logo
1 de 16
PDB
RCSB
Compressive Structural Bioinformatics:
Large-scale analysis and visualization of the
Protein Data Bank archive
Peter W. Rose, Anthony R. Bradley,
Alexander S. Rose, Yana Valasatava,
Jose M. Duarte, Andreas Prlić
Structural Bioinformatics Laboratory
San Diego Supercomputer Center
UC San Diego
PDB
RCSB
PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
120,000
structures
in June 2016
PDB
RCSB
Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-1 capsid: PDB ID 3J3Q
~2.4M unique atoms
Faustovirus major capsid: PDB ID 5J7V
~40M overall atoms
PDB
RCSB
Growing User Base
PDB
RCSB
 Scalability Issues
• Interactive visualization
• slow network transfer
• slow parsing
• slow rendering
• Mobile visualization
• limited bandwidth
• limited memory
• Large-scale structural analysis
• slow repeated I/O
• slow repeated parsing
PDB
RCSB
Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macromolecules
Perform large-scale structural calculations such as geometric queries or structural
comparisons over the entire PDB archive held in memory
PDB
RCSB
Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bonds
Biological macromolecules: proteins, nucleic acids
PDB
RCSB
PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif.wwpdb.org)
repetitive information
redundant annotations
inefficient representation
PDB
RCSB
MMTF
• MacroMolecular Transmission Format (mmtf.rcsb.org)
• Compact
• fast network transfer, less I/O
• Fast to parse
• binary, no string parsing
• Contains information for structural analysis and visualization
• covalent bonds and bond orders
• consistently calculated secondary structure
PDB
RCSB
MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural
data
calculate bonds,
SSE
Binary, extensible container format of MMTF
It's like JSON.
but fast and small.
PDB
RCSB
Size and Parsing Speed
mmCIF vs. MMTF for 120,000 Structures
Fast
Mac mini with 2.6 GHz Intel Core i5
(4 cores) and 16GB RAM using
30 GB
7 GB
< 2 min
400 min
MMT
F
mmCIF MMT
F
mmCIF
Whole PDB archive GZIP compressed
(MMTF reduced/lossy: ~800 MB)
Small
PDB
RCSB
Efficient hashing algorithm
Inefficient looping algorithm
MMTFmmCIF
50
6
448
404
Find all C-alpha-C-alpha contacts
Data Mining using Apache Spark
mmCIF vs. MMTF
PDB
RCSB
Download + Parsing time
MMTF vs. mmCIF
Bethesda, MD
85 MMTF
2418 mmCIF
Switzerland
1589 MMTF
4431 mmCIF
Russia
557 MMTF
failed mmCIF
Japan
79 MMTF
2838 mmCIF
San Diego, CA
36 MMTF
840 mmCIF
Time (seconds) to download* 100 large PDB structures from UCSD
and parse with JavaScript decoder in Chrome browser
*Note: download times are highly variable and not representative
PDB
RCSB
Community Engagement
• Open source specification
• Open source decoding libraries
• Java
• JavaScript
• Python
• C/C++ (developed by community members)
• Applications using MMTF
• 3Dmol.js, JSmol, iCn3D(NCBI), ICM Viewer, PyMol
• BioJava, Biopython, MDAnalysis
• RCSB PDB website
PDB
RCSB
Summary
• MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
• Compressed, binary, efficient representation of 3D structures
• Lossless representation (~4x compression)
• Lossy, reduced representation (~37x compression)
• Compressive Structural Bioinformatics
• Algorithms, application, and workflows using MMTF
• 10 to 100+ fold speedup
Structure Visualization Large Scale PDB Mining
Web-based molecular graphics for large complexes (2016)
Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
PDB
RCSB
Acknowledgements
Funding: NCI/NIH (U01 CA198942)
MMTF Early Adopters

Más contenido relacionado

Similar a Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Gezim Sejdiu
 

Similar a Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive (20)

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
POLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloud
 
Look Ma! No more blobs
Look Ma! No more blobsLook Ma! No more blobs
Look Ma! No more blobs
 
Summit2013 eventos onto quad
Summit2013   eventos onto quadSummit2013   eventos onto quad
Summit2013 eventos onto quad
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
Getting Started with Amazon Redshift
 Getting Started with Amazon Redshift Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaar
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
The Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data Integration
 

Último

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 

Último (20)

Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

  • 1. PDB RCSB Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive Peter W. Rose, Anthony R. Bradley, Alexander S. Rose, Yana Valasatava, Jose M. Duarte, Andreas Prlić Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  • 2. PDB RCSB PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units 120,000 structures in June 2016
  • 3. PDB RCSB Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  • 5. PDB RCSB  Scalability Issues • Interactive visualization • slow network transfer • slow parsing • slow rendering • Mobile visualization • limited bandwidth • limited memory • Large-scale structural analysis • slow repeated I/O • slow repeated parsing
  • 6. PDB RCSB Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 7. PDB RCSB Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  • 8. PDB RCSB PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  • 9. PDB RCSB MMTF • MacroMolecular Transmission Format (mmtf.rcsb.org) • Compact • fast network transfer, less I/O • Fast to parse • binary, no string parsing • Contains information for structural analysis and visualization • covalent bonds and bond orders • consistently calculated secondary structure
  • 10. PDB RCSB MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  • 11. PDB RCSB Size and Parsing Speed mmCIF vs. MMTF for 120,000 Structures Fast Mac mini with 2.6 GHz Intel Core i5 (4 cores) and 16GB RAM using 30 GB 7 GB < 2 min 400 min MMT F mmCIF MMT F mmCIF Whole PDB archive GZIP compressed (MMTF reduced/lossy: ~800 MB) Small
  • 12. PDB RCSB Efficient hashing algorithm Inefficient looping algorithm MMTFmmCIF 50 6 448 404 Find all C-alpha-C-alpha contacts Data Mining using Apache Spark mmCIF vs. MMTF
  • 13. PDB RCSB Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589 MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative
  • 14. PDB RCSB Community Engagement • Open source specification • Open source decoding libraries • Java • JavaScript • Python • C/C++ (developed by community members) • Applications using MMTF • 3Dmol.js, JSmol, iCn3D(NCBI), ICM Viewer, PyMol • BioJava, Biopython, MDAnalysis • RCSB PDB website
  • 15. PDB RCSB Summary • MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) • Compressed, binary, efficient representation of 3D structures • Lossless representation (~4x compression) • Lossy, reduced representation (~37x compression) • Compressive Structural Bioinformatics • Algorithms, application, and workflows using MMTF • 10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  • 16. PDB RCSB Acknowledgements Funding: NCI/NIH (U01 CA198942) MMTF Early Adopters