Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Genome-scale Big Data Pipelines

430 visualizaciones

Publicado el

Slides from talks at YOW 2017 on work with CSRIO Bioinformatics covering VariantSpark and GT-Scan2

Publicado en: Ciencias
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Genome-scale Big Data Pipelines

  1. 1. Dr. Denis Bauer & Lynn Langit Genomic-scale Data Pipelines
  2. 2. Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’BrienLaurence Wilson, PhD Adrian White Andy Hindmarch Collaborators David Levy News Software Dan Andrews Kaitao Lai, PhD Arash Bayat John Hildebrandt Mia Chapman Ian Blair Kelly Williams Jules Damji Gaetan Burgio Lynn Langit Natalie Twine, PhD Prabha Pillay Transformational Bioinformatics | Denis C. Bauer | @allPowerde Transformational Bioinformatics Team
  3. 3. 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  4. 4. 1 0.17 2 20 0 5 10 15 20 25 Astronomy Twitter YouTube Genomic GENOMIC Big Data in 2025 - Exabytes Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  5. 5. Genome holds Blueprint for Every Cell Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  6. 6. Affects Looks, Disease Risk, and Behavior Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  7. 7. VCF Data Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  8. 8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Genomic Research Workflow https://www.projectmine.com/about/ BigData Focus
  9. 9. Finding the Disease Gene(s) Spot the letter that is… • common amongst all affected • absent in all unaffected* * oversimplified cases controls Gene1 Gene2 Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  10. 10. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) Cited 4 Transformational Bioinformatics| Denis C. Bauer @allPowerde
  11. 11. Why Apache Spark? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  12. 12. Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome Transformational Bioinformatics | Denis C. Bauer | @allPowerde low Accuracy high lowSpeedhigh
  13. 13. CloudDataPipelinePattern Business Problem Data Quality Candidate Technologies Build/Test MVPs Assemble Pipeline Transformational Bioinformatics| Denis C. Bauer @allPowerde
  14. 14. Building a CloudDataPipeline Candidate Technologies • Ingest/Clean • Analyze/Predict • Visualize Build MVPs • Test • Iterate • Learn Assemble Pipeline • Combine pieces • Validate sections • Test at scale Transformational Bioinformatics| Denis C. Bauer @allPowerde
  15. 15. Building a Cloud Data Pipeline Spark •IaaS, PaaS, SaaS Vendors •AWS, Azure, GCP… Transformational Bioinformatics| Denis C. Bauer @allPowerde
  16. 16. Visualizing Machine Learning Results Transformational Bioinformatics| Denis C. Bauer @allPowerde
  17. 17. Solving Important Questions… Cancer genomics? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  18. 18. DEMO: Who is a Bondi Hipster? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  19. 19. Supervised ML: Wide Random Forests Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  20. 20. Scaling to 50 M variables and 10 K samples Transformational Bioinformatics | Denis C. Bauer | @allPowerde 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster • 12 workers • 16 x Intel CPUs • Xeon E5-2660@2.20GHz • 128 GB RAM • Spark 1.6.1 • 128 executors • 6GB / executor 0.75TB • Synthetic dataset Whole Genome Range GWAS Range
  21. 21. Future Directions for VariantSpark RF Mixed feature types Unordered Categorical Continuous Build Community Python API Non-Genomic Demos Transformational Bioinformatics | Denis C. Bauer | @allPowerde Implementation by
  22. 22. Try it out: VariantSpark Notebook Transformational Bioinformatics| Denis C. Bauer @allPowerde https://docs.databricks.com/spark/latest/training/variant-spark.html
  23. 23. Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy “Editing does not work every time, e.g. only 7 in 10 embryos were mutation free.” Aim: Develop computational guidance framework to enable edits the first time; every time Ma et al. Nature 2017 * * Controversy around the paper – stay tuned Transformational Bioinformatics| Denis C. Bauer @allPowerde
  24. 24. Make Process Parallel and Scalable SPEED • Each search can be broken down into parallel tasks - each takes seconds SCALE • Researchers might want to search the target for one gene or 100,000 Scalability + Agility = Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  25. 25. One of the first Serverless Applications in Research Transformational Bioinformatics | Denis C. Bauer | @allPowerde Featured in
  26. 26. X-Ray Tracing Demo of GT-Scan2 • Find performance bottlenecks • Fix and test Transformational Bioinformatics | Denis C. Bauer | @allPowerde Webapp Resources (S3, DynamoDB) Lambda
  27. 27. 25 50 75 getFastaSequence createJob targetScan offtargetScanStarter offtargetSearch targetIntersects targetTranscriptionIntersects targetW uScorer targetSgR N AScorer O nTargetScorer genom eC R ISPR functions runtime(s) Type base old GTScan2 X-Ray Analysis Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  28. 28. Results – 4x Faster (80% improvement) Transformational Bioinformatics | Denis C. Bauer | @allPowerde 2 min 30 sec
  29. 29. Considering Services for GT-Scan2 • Use AWS Step Functions • Simplify workflow • Simplify task timeouts • Simplify task failures • Must evaluate costs • SNS vs. Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  30. 30. CloudDataPipelinePattern Problem Data Technologies MVPs Pipeline Search GTScan2 fastq, bed-> S3, NoSQL Ingest ETL, Analyze Viz S3 Lambda Lambda/API Gateway Serverless Transformational Bioinformatics| Denis C. Bauer @allPowerde
  31. 31. Serverless Pipeline Pattern Lambda function 1 Lambda function 2 Lambda function 3 buckets with objects DynamoDB API Gateway Users Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  32. 32. CloudDataPipelinePattern Problem Data Technologies MVPs Pipeline Analyze GWAS vcf -> S3/Spark Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook, SQL, R, Python Spark Server Cluster Transformational Bioinformatics| Denis C. Bauer @allPowerde
  33. 33. Spark Server Cluster Pipeline Pattern Transformational Bioinformatics | Denis C. Bauer | @allPowerde Jupyter Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  34. 34. Cloud Genomic-Scale Data Pipelines • Problem # 1 – ML on Large Data • Solution: Spark-server cluster + custom machine learning • Problem #2 – Burstable Search • Solution: Serverless pipeline Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  35. 35. Genomic-scale Data Pipelines Transformational Bioinformatics | Denis C. Bauer | @allPowerde Dr. Denis Bauer & Lynn Langit

×