Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×
Próximo SlideShare
2016 bergen-sars
2016 bergen-sars
Cargando en…3

Eche un vistazo a continuación

1 de 56 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)


Similares a 2016 davis-biotech (20)


Más reciente (20)

2016 davis-biotech

  1. 1. Biology, Big Data, Precision Medicine, and Other Buzzwords C.Titus Brown School ofVeterinary Medicine; Genome Center & Data Science Initiative 1/15/16 #titusbuzz Slides are on slideshare.
  2. 2. N.B.This talk is for the students! (I heard they had to attend, and I couldn’t pass up a guaranteed audience!) Note: at end, I would like to take a question or two from grad students first!
  3. 3. My academic path • Undergrad: math major • Grad school: developmental biology/genomics • Postdoc: developmental biology/genomics • Asst Prof: genomics/bioinformatics • Now: bioinformatics/data-intensive biology
  4. 4. My non-academic path: • Open source programming. • Two startups, one real one & one half- academic thing. • Some consulting on software engineering and testing.
  5. 5. Outline 1. Research on how to deal with lots of data. 2. How biology, in particular, is unprepared. 3. My advice for the next generation of researchers.
  6. 6. 1. My research! Some background & then some information.
  7. 7. DNA sequencing rates continues to grow. Stephens et al., 2015 - 10.1371/journal.pbio.1002195
  8. 8. Oxford Nanopore sequencing Slide viaTorsten Seeman
  9. 9. Nanopore technology Slide viaTorsten Seeman
  10. 10. Scaling up --
  11. 11. Scaling up --
  12. 12. Slide viaTorsten Seeman
  13. 13.
  14. 14. “Fighting EbolaWith a Palm- Sized DNA Sequencer” See: sequencer-dna-minion/405466/
  15. 15. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski Lots of data other than just sequencing!
  16. 16. Data integration between different data types.. Figure 2. Summary of challenges associated with the data integration in the proposed project. Figure via E. Kujawinski
  17. 17. => My research Planning for ~infinite amounts of data, and trying to do something effective with it.
  18. 18. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  19. 19. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  20. 20. Digital normalization
  21. 21. Digital normalization
  22. 22. Digital normalization
  23. 23. Digital normalization
  24. 24. Digital normalization
  25. 25. Digital normalization
  26. 26. Computational problem now scales with information content rather than data set size. Most samples can be reconstructed via de novo assembly on commodity computers.
  27. 27. Digital normalization & horse transcriptome The computational demands for cufflinks - Read binning (processing time) - Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization) Diginorm - Significant reduction of binning time - Relative increase of the resources required for gene model construction with merging more samples and tissues - ? false recombinant isoforms Tamer Mansour
  28. 28. Effect of digital normalization ** Should be very valuable for detection of ncRNA Tamer Mansour
  29. 29. The khmer software package • Demo implementation of research data structures & algorithms; • 10.5k lines of C++ code, 13.7k lines of Python code; • khmer v2.0 has 87% statement coverage under test; • ~3-4 developers, 50+ contributors, ~1000s of users (?) The khmer software package, Crusoe et al., 2015.
  30. 30. khmer is developed as a true open source package •; • BSD license; • Code review, two-person sign off on changes; • Continuous integration (tests are run on each change request); Crusoe et al., 2015; doi: 10.12688/f1000research.6924.1
  31. 31. Literate graphing & interactive exploration Camille Scott
  32. 32. Research process Generate new results; encode in Makefile Summarize in IPython Notebook Push to githubDiscuss, explore
  33. 33. This is standard process in lab -- Our papers now have: • Source hosted on github; • Data hosted there or onAWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Zhang et al. doi: 10.1371/journal.pone.0101271
  34. 34. The buoy project - decentralized infrastructure for bioinformatics. Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI)
  35. 35. The next questions -- (a) If you had all the data from all the things, what could you do with it? (b) If you could edit any genome you wanted, in any way you wanted, what would you edit?
  36. 36. 2. Big Data, Biology, and how we’re underprepared. (Answers to previous qs: we are not that good at using data to inform our models or our experimental plans...)
  37. 37. My first 7 reasons -- 1. Biology is very complicated. 2. We know very little about function in biology. 3. Very few people are trained in both data analysis and biology. 4. Our publishing system is holding back the sharing of knowledge. 5. We don’t share data. 6. We are too focused on hypothesis-driven research. 7. Most computational research is not reproducible.
  38. 38. Biology is complicated. Sea urchin gene network for early development;
  39. 39. We know very little, and a lot of what we “know” is wrong. One recent story that caught my eye – problems with genetic testing & databases. (See URL below for full story.) • “1/4 of mutations linked to childhood diseases are debatable.” • In a study of 60,000 people, on average each had 53 “pathogenic” variants… is-full-of-costly-mistakes/420693/
  40. 40. Very few people are trained in both data analysis and biology. (More on this later)
  41. 41. Our publishing system has become a real problem. • The journal system costs more than $10bn/yr, with profit margins estimated at 20-30% (see citation, below). • Articles in high impact factor journals have lower statistical power. • High-IF journals have higher rates of retractions (which cannot solely be attributed to “attention paid”) • We publish in PDF form, which is computationally opaque. • Publishing is slow! $10bn/year:
  42. 42. High-impact-factor articles have poor statistical power. Our current system rewards A but not B. Brembs et al., 2013 -
  43. 43. High impact factor => high retraction index. Brembs et al., 2013 -
  44. 44. We just don’t share our data. • Researchers have virtually no short-term incentives to share data in useful ways. • “46% of respondents reported they do not make their data available to others” – study in ecology (Tenopir et al., 2011) • Some “great” stories from the rare disease community – see NewYorker link, below.
  45. 45. We are focused on hypothesis- driven research. • Granting agencies require specific hypotheses, even when little is known. • This focuses research on “known unknowns”, and leaves “unknown unknowns” out in the cold.
  46. 46. The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11, e88889. Via Erich Schwarz
  47. 47. Most computational research is not reproducible. I don’t know of a systematic study, but of papers that I read, approximately 95% fail to include details necessary for replication. It’s very hard to build off of research like this. (There’s a lot more to say about reproducibility and replicability than I can fit in here…)
  48. 48. What am I doing about it? 1. Open science 2. “Culture hacking” to drive open data. 3. Training! (I don’t have any guaranteed solutions.All I can do is think & work.)
  49. 49. Perspectives on training • Prediction: The single biggest challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report) • Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing. • Training is systematically undervalued in academia (!?)
  50. 50. UC Davis and training My goal here is to support the coalescence and growth of a local community of practice around “data intensive biology”.
  51. 51. Summer NGS workshop (2010-2017)
  52. 52. General parameters: • Regular intensive workshops, half-day or longer. • Aimed at research practitioners (grad students & more senior); open to all (including outside community). • Novice (“zero entry”) on up. • Low cost for students. • Leverage global training initiatives.
  53. 53. Thus far & near future ~12 workshops on bioinformatics in 2015. Trying out Q1 & Q2 2016: • Half-day intro workshops (27 planned); • Week-long advanced workshops; • Co-working hours (“data therapy”).
  54. 54. 3. Advice to the next generation (or two generations, if you want me to feel really old.) a. Get involved with a broad group of people and ideas (social media FTW!) b. Learn something about both computing and biology. c. Realize that you have nothing but opportunity, and that there has never been a better time to be in bio research!
  55. 55. Precision Medicine?
  56. 56. Thanks for listening! Please contact me at! Note: I work here! (I’d like to start with a grad student question?)