Jax 2013 - Big Data and Personalised Medicine

Big Data in Genomics and Personalized
Medicine – Challenges and Solutions

Gaurav Kaul
Software Architect, Intel
JAX London 2013

Agenda
Global Healthcare Trends
The Rise of Personalized Medicine
Big Data Scenarios in Healthcare
Methods to Manage Big Data
Use Cases
Summary and Next Steps

2

*Other names and brands may be claimed as the property of others

We are at an Inflection Point in
Healthcare - TRENDS

% of population over age 60

30+ %

25-29%

20-24%

10-19%

0-9%

2050

WW Average Age 60+: 21%
Source: United Nations “Population Aging 2002”

Healthcare costs are
RISING
Significant % of GDP

Global AGING
Average Age 60+:
growing from 10% to
21% by 2050

Source: McKinsey Global Institute Analysis
ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast

3


US Healthcare BIG DATA
Value
$300 Billion in value/year
~ 0.7% annual productivity
growth

We are at an Inflection Point in
Healthcare - TRENDS
Storage Growth

Total Data Healthcare Providers (PB)
15000

Admin

Imaging

10000

Medical Imaging Archive Projection
Case from just 1 healthcare system

EMR
Email

5000

File
Non Clin Img

0
2010 2011 2012 2013 2014 2015

Research

Data Explosion projected to reach 35 Zetabytes by 2020, with a 44-fold increase from 20095

Source: McKinsey Global Institute Analysis
ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast

4


Sequencing Cost Trend

5


6


Vision for Personalized Medicine

7


How can we take
Personalized Medicine
Mainstream by 2020 ??

A “bioinformatics computing system” includes
technologies from this entire “stack”
Software Frameworks
Applications
Programming Model (abstraction)
Virtualization
System Software and Resource
Management

Computer Hardware, Storage and
Networks

A “bioinformatics computing system” includes
technologies from this entire “stack”

Software Frameworks

Applications

Programming Model
(abstraction)
Virtualization

System Software and
Resource Management

Computer
Hardware, Storage and
Networks

Multiple
Cores –
Shared
memory, multi
ple
threads, Open
MP
Multiple
Nodes –
MPI;
GAS, PGAS;
Hadoop

galaxy.psu.edu

Searching for SNPs with
cloud computing
Langmead, Schatz et al;

The Crossbow Pipeline

11


Big Data – A Foundation For Delivering Big Value

Big Data Building Blocks
Network

Storage

Software & Technologies

Intel® Xeon®
Product Family E3E5-E7

Intel® Ethernet
Controllers

Intelligent Storage1

Intel® Distribution for
Apache Hadoop

Energy
Efficient

Responsive

Compute

Intel®

Atom™

Xeon PhiTM

Ethernet
Adapters

Intel® Ethernet
Switch Silicon
Intel® True Scale
Fabric

Choice

High
Availability

Secure

Intel®

Intel®

Scale-out Storage1
Scale-up Storage1
Intel®

SSD 710
series, DC S3700
(SATA)
Intel® SSD 910
series (PCIe)

Intel® Node Manager
Intel® Expressway
Service Gateway
Intel® Cache
Acceleration Software
Intel’s Lustre
Intel® VT and
Intel® TXT
Intel® AES-NI

Intel’s Foundational Technologies Offer Advanced Solutions for Big data Analytics

Xeon-based storage systems are available in a wide range of configuration options from the industry’s leading storage vendors

12

Intel® Data Center
Manager


Big Data Compute Platform
Optimizations
Intel® Xeon® E5 Family

Intel® Xeon® E7 Family

RAM
QPI 1
QPI 2

Xeon E7-4800

CORE 3

CORE 4

QPI 4

CORE 5

CORE 6

CORE 7

CORE 8

CORE 9

CORE 10

Up to 4 channels
DDR3 1600 MHz
memory
Up to 8 cores
Up to 20 MB

cache

SCALE-OUT with Hadoop
and analytic/DW engines

Proof point: E5 Analytics 25X Improvement
Hadoop on E5

13

CORE 2

QPI 3

Integrated
PCI
Express*
3.0
Up to 40
lanes
per socket

CORE 1


4 QPI 1.0
Lanes for
robust
scalability

Up to 8 channels
DDR3 1066 MHz
memory

CACHE

Up to 10 cores
Up to 30 MB

cache

SCALE-UP in-memory analytic engines
and databases: Oracle*, SAS*, SAP Hana*

Proof point: SAP HANA


Intel® Ethernet Reduces Time to Process Large Data Sets

1GbE Network Connections

Trends and Challenges
Big data is hitting the enterprise with
unprecedented
volume, velocity, variety, complexity, and
OPPORTUNITY

Intel® Ethernet Solution
Up to 20x performance boost over legacy
infrastructure with optimizations on
Intel® Xeon® processors, Intel® SSD
storage, and 10Gb Intel® Ethernet
networking
10 Gigabit Ethernet allows quicker import
and export of large data sets for processing

VM VM VM

VM VM VM

Hypervisor

Hypervisor

Moving the Data with 10GbE
Up to


Up to

80%

15%

Reduction
in Cables & Switch
ports

Reduction
in Infrastructure
Costs

1 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/10gbe-10gbase-t-hadoop-clusters-paper.pdf

14

2 Ports 10GbE

10 Ports 1GbE

Up to

2x

Improved
Bandwidth per
Server


Intel® CAS with Intel® SSD Solution
Added as cache layer accelerates Big Data workloads

50X IOPS
3X TPC-C
20X TPC-H

Performance near equal to replacing all hard drives
with SSDs at significantly lower cost
http://www.intel.com/content/www/us/en/mission-critical/mission-critical-scalability-oracle-intel-brief.html

15


throughput performance


Data Methods for the Right Data Structure
Unstructured
Data

Emerging
Technologies

Analytical
Paradigms

MapReduce
/Hive

Structured
Data

Relational
Database

EXALYTICS

* Other names and brands may be claimed as the property of others.

16



HiTune (URL)

Intel® Distribution for Apache Hadoop* & Tools

MapReduce

File-based Encryption in HDFS

Up to 20x faster decryption with AES-NI*
Role-based access control for Hadoop services

Instrument

Up to 8.5X faster Hive queries using HBase co-processor

Aggregation
Engine

Report
Engine

HiTune Controller

Optimized for SSD with Cache Acceleration Software
Adaptive replication in HDFS and HBase

HiBench (URL)

Integrated text search with Lucene
1

2

Micro Benchmarks
Sort
WordCount
TeraSort

Simplified deployment & comprehensive monitoring
Deployment of HBase across multiple datacenters

Web Search

Nutch Indexing
Page Rank

HiBench

Automated configuration with Intel ® Active Tuner
Detailed profiling of Hadoop jobs
Simplified design of HBase schemas (+ in 2.4)
REST APIs for deployment and management (+ in 2.4)

3

Machine Learning

Bayesian Classification
K-Means Clustering

4

HDFS

Enhanced DFSIO

Result = many Hadoop optimization tips
(IDF2012 presentation “Big Data
Analytics on a Performance-optimized
Hadoop Infrastructure”)

17


Life Sciences 2013:
Key Industry Challenges and Solutions
Many (most) applications are singlethreaded, single address space
Intel is delivering optimizations working with
open source community, developing
NGS+HPC curriculum

Some algorithms scale quadratically with the
size of the problem. Large data sets exceed
available memory and storage
Innovations in
acceleration, compute, storage, networking,
security, and *-as-a-service.

International collaboration is an
imperative, bioinformatics expertise is scarce
Intel is working closely with the ecosystem to
address enterprise to cloud transmission of
terabyte payloads

Need are distributed, data is siloed and
for Balanced Compute Infrastructure
Databases

18will likely stay that way


Examples of Intel®-powered Servers in Big Data
and Analytics

Cisco* UCS Server1
Intel® Xeon® 5600

Cisco UCS server with EMC
Greenplum MR software “enterprise-class” Hadoop*
distribution that features
technology from MapR

1

Dell* PowerEdge* C Series2
Intel Xeon 5500/5600

The Dell | Cloudera* solution for
Apache* Hadoop sold pre-configured

Oracle* Sun Fire* server3
Intel Xeon E7-4800

Oracle Exalytics* In-Memory
Machine, features the Oracle BI
Foundation Suite and Oracle
TimesTen In-Memory Database for
Exalytics

http://gigaom.com/cloud/ciscos-servers-now-tuned-for-hadoop/
http://www.businesswire.com/news/home/20110804005376/en/Dell-Cloudera-Collaborate-Enable-Large-Scale-Data
3
19 http://www.itp.net/mobile/588145-oracle-unveils-exalytics-in-memory-machine
INTEL CONFIDENTIAL
2

Solution 4.0 – NGS Appliances
16 Cores
96 GB RAM
18T Red. Storage
SSD for OS

32 Cores
1.2 TFlops
18-56TB RAID
NSS-HA Pair

NSS User Data

HSS Metadata Pair

HSS OSS Pair

HSS User Data

2U Plenum
Actual placement in racks may vary.

Scale through independent solutions,
each targeting a different segment & usage model
20
Intel Confidential may be claimed as the property of others
*Other names and brands

NGS Appliance
Dell Scalable Unit “SANGER”
Infrastructure:
Dell PE, PC & F10

NSS-HA Pair
NSS User
Data

Dell NSS (NFS)
(up to 180TB)

Challenge: Experiment processing takes 7
days with current infrastructure. Delays
treatment for sick patients
Solution: Dell Next Generation Sequencing
Appliance
•
•

HSS Metadata
Pair
HSS OSS Pair

Dell HSS (Lustre)
(up to 360TB)

9 Teraflops of Sandy Bridge Processors

•

Lustre File Storage

•

Intel SW tools and engineers

Benefits: RNA-Seq processing reduced to
4 hour

HSS User
Data

M420 (Compute)
(up to 32 nodes)

2U Plenum

21

Single Rack Solution

*Other names and racks may vary.
Actual placement in brands may be claimed as the property of others

Includes everything you need for NGS compute, storage, software, networking, infra
structure, installation, deployment, training,
service & support

22


Use Case: NEXTBIO

Analytics for Genomics Data
•

Cost to sequence a Genome has fallen by
800x in the last 4 years

•

Each Genome has ~4 million variants

•

Growth in the genomics data in the public
and private domain

•

Data available in variety of sources
–

•

Structured, semi-structured, Un-structured

New aggregated data growing
exponentially

Sequencing
3 Billion
base Pairs

23

Data
Processing
Cloud Storage
Visualization
Millions of
variants


Interpretation &
Analytics
Millions of Variants
Millions of Patients

Commercializing
Targeted
Therapeutics
Companion
Diagnostics
Actionable Biomarkers

Data-Intensive Discovery: Genomics
Value
Enable researchers to discover biomarkers and
drug targets by correlating genomic data sets
90% gain in throughput; 6X data compression

Analytics
Provide curated data sets with pre-computed
analysis (classification, correlation, biomarkers)
Provide APIs for applications to combine and
analyze public and private data sets

Data Management
Use Hive and Hadoop for query and search
Dynamically partition and scale Hbase
10-node cluster / Intel Xeon E5 processors
10GbE network

24


Intel Distribution

Use Case: NEXTBIO

Nextbio & Intel Collaboration
Technical Challenge:
Immutable Data – write once,
change, read many times

never

Traditional Bloom Filters works
Hadoop & HBase well suited
1 Genome  10 Million rows
100 Genomes  1Billion rows
1M Genomes  10 Trillion rows
100M Genomes  1 Quadrillion
1,000,000,000,000,000 rows

App can dynamically partitions HBase as
data size grows
Intel Optimizations for Hadoop:
Optimized Hadoop stack in Open Source
Stabilize HBase to provide reliable scalable
25
deployment


Putting it together ..
Software Frameworks
Applications
Programming Model (abstraction)

Virtualization
System Software and Resource
Management

Computer Hardware, Storage and
Networks

Summary
• Enabling ecosystem of partners to innovate and make
Personalized Medicine vision a reality

• Delivering hardware-enhanced capabilities and software to
deploy Personalized Medicine
• Work with Big Data Vendors to onboard increasing number
of life science workloads to Hadoop and other analytics
technologies

Jax 2013 - Big Data and Personalised Medicine

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Jax 2013 - Big Data and Personalised Medicine

Similar a Jax 2013 - Big Data and Personalised Medicine (20)

Último

Último (20)

Jax 2013 - Big Data and Personalised Medicine

Notas del editor