The document discusses UC Berkeley's efforts in analyzing and storing large genomic datasets. It describes 1) UC Berkeley's work on building fast and accurate genetic analysis pipelines and the Berkeley Data Analysis Stack. 2) Their collaboration with other institutions as part of The Cancer Genome Atlas project to sequence and analyze thousands of cancer genomes. 3) The challenges of storing massive genomic datasets and their use of the Cloud Computing Data Center to store over 2 petabytes of genomic data from TCGA at a relatively low cost.
12. UC Students/Post-Docs External Faculty
– Ma’ayan Bresler – Bill Bolosky (MS/MSR) – Armando Fox
– Kristal Curtis – Mishali Naik (Intel) – Michael Jordan
– Jesse Liptrap – Paolo Narvaez (Intel) – Anthony Joseph
– Sara Sheehan – Ravi Pandya (MS) – David Patterson
– Ameet Talwalkar – Abirami Prabhakaran (Intel) – Satish Rao
– Jonathan Terhorst – Taylor Sittler (UCSF) – Scott Shenker
– Richard Xia – Gans Srinivasa (Intel) – Yun Song
– Matei Zaharia – Arun Wiita (UCSF) – Ion Stoica
– Yuchen Zhang Expertise
– Computational Biology/Medicine
– Machine Learning
– Systems
13. • 2011-2016
Adaptive/Active
Machine Learning • Berkeley Data Analysis Stack
and Analytics release as Open Source
Massive
and Diverse
Data
CrowdSourcing/
Human Cloud Computing
Computation
19. GENOME
PROTEOME
CENTER
CENTER
PROTEOME GENOME
TCGA CENTERS PROTEOME CENTER
CENTER Boise State University CENTER
ANALYSIS SEQUENCING
TCGA CENTERS
CENTER GENOME
PROTEOME Brigham & Women’s Hospital and Harvard Medical School CENTER
CENTER Broad Institute CENTER
John Hopkins University ANALYSIS
Memorial Sloan-Kettering Cancer Center CENTER
TCGA CENTERS
BC Cancer Research Center ANALYSIS
Fred Hutchinson Cancer Research Center CENTER
Complete Genomics Inc.
Pacific NW National Laboratory TCGA CENTERS
University of Southern California Nationwide Children’s Hospital BIOSPECIMEN DATA COORDINATING
Oregon Health & Science University CORE PROTEOME CENTER
Institute for Systems Biology CENTER
GENOME
University of California, Santa Cruz CENTER
SEQUENCING
PROTEOME CENTER
CENTER
ANALYSIS ANALYSIS
TCGA CENTERS GENOME
CENTER CENTER
Vanderbilt University CENTER
ANALYSIS PROTEOME
Washington University Genome Institute PROTEOME
CENTER GENOME CENTER
CENTER
CENTER
TCGA CENTERS
BIOSPECIMEN GENOME University of North Carolina
CORE CENTER ANALYSIS
DATA CENTER CENTER
SEQUENCING
TCGA CENTERS
CENTER
International Genomics Consortium
TCGA CENTERS
Baylor College of Medicine
TCGA Centers: University of Texas, M.D. Anderson Cancer Ctr
Biospecimen Core Resource
Genome Characterization Centers (GCCs)
Genome Sequencing Centers (GSCs)
Proteome Characterization Centers (PCCs)
Data Coordination Center (DCC)
Genome Data Analysis Centers (GDACs)
20. Built at SDSC to store DNA information in for
The Cancer Genome Atlas
Designed for 50,000 genomes with average
of 100 gigabytes per genome: 5 petabytes
Currently 24,000 files from ~5,500 cases,
~60 gigabytes/case, in total 2 PB of
downloads
Total Cost ~ $100/year/genome at 50K
genomes, i.e. $5M/year. The technology cost
is about ½ the total
Co-location opportunities in same data
center for groups who want to compute on
the data