SlideShare una empresa de Scribd logo
1 de 51
Design Patterns for Big Data
Architecture: Best Strategies for
Streamlined [Simple, Powerful]
Design

Allen Day, PhD
Data Scientist, MapR Technologies
December 2013
©MapR Technologies - Confidential
BIG DATA
©MapR Technologies - Confidential
Me, Us
• Allen Day, Principal Data Scientist, MapR
R contributor (10 yr), Hadoop developer (6 yr)
Human Genetics (UCLA Medicine), Machine Learning

• MapR
Distributes open source components for Hadoop
Adds major enhancements for performance, high-availability, and
ease-of-use

• See Also
– “allenday” most places (twitter, github, etc.)
– aday@maprtech.com, @mapR
– http://slideshare.net/allenday
©MapR Technologies - Confidential
Three Business Use Cases
Personalized
Search

©MapR Technologies - Confidential

Personalized
Medicine

Market
Segmentation
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data

Personal data

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

Marketing

Which ones are similar?
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data

Personal data

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

Marketing

Which ones are similar?
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data

Personal data

©MapR Technologies - Confidential

Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

Marketing

Surprise! How can you tell?
But First…

WHAT IS A DESIGN PATTERN?

©MapR Technologies - Confidential
But Before That…

SURPRISE!

©MapR Technologies - Confidential
Design Pattern Idea
• a general reusable solution to a commonly
occurring problem
• not a finished design
• not code
• can be used in many different situations

©MapR Technologies - Confidential
History of SW Design Patterns

1977
Architecture &
Civil Engineering

©MapR Technologies - Confidential

1994
OO Software
Architecture

2012
Parallelization
Software

?
Application
Parallelization
Not Just Software Designs

http://en.wikipedia.org/wiki/A-line
©MapR Technologies - Confidential
Identifying the Pattern
Pattern Dimensions
1.
2.
3.
4.
5.

Volume
Variety
Velocity
Business Intents & Methods
SLAs

©MapR Technologies - Confidential
Choose a Pattern: Volume & Velocity
1. How big is your target data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

2. How big is your query data?
One pass
over 100%

B

C

Big storage

Streaming

Multiple passes
over big chunks

3. How fast do you need a result?
Throughput >
response
D

©MapR Technologies - Confidential

Nearline
Analytics

< 100s
(human scale)
E
Exploratory
Analysis
Twitter Zeitgeist as a
Composite of Design Patterns
Live data source
e.g.
Twitter Firehose

B

C

Big storage

Streaming

D
©MapR Technologies - Confidential

Nearline
Analytics

Downstream applications
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Large-scale Incremental Processing Using Distributed Transactions and Notifications
http://research.google.com/pubs/pub36726.html
©MapR Technologies - Confidential
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Queued data are unavailable for
action – not percolation
Queue
©MapR Technologies - Confidential

Real-time
insertion

Delayed
insertion

Data
store
Percolation in Classic Form
Real-time data
source
Real-time
insertion

©MapR Technologies - Confidential

Data
store

Offline
percolation
of recent data
Percolation of a Composite Store
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
Index

Both parts visible

©MapR Technologies - Confidential
Market Segmentation
• Divide customers into subsets with common
needs
• Design specific strategies for each subset
• Major emphasis on “fresh” data

©MapR Technologies - Confidential
Market Segmentation
Feature
Extraction
Real-time
transactions
Customer
history

What does
this have to
do with
percolation
©MapR Technologies - Confidential

Assign
Segment
(search)
db
Market
Segments

query
Clustering
Percolator 1
Feature
Extraction
Real-time
transactions
Customer
history

©MapR Technologies - Confidential

Feature extraction is
percolation because it is
triggered by the arrival of a
new record and because it
updates that new record.
Percolator 2
Real-time
transactions
Customer
history

Market segment assignment
is percolation because it is
triggered by the arrival of a
new record and because
only that record's segment is
updated.

©MapR Technologies - Confidential

Assign
Segment
(search)
db
Market
Segments

query

What about
the
clustering
Scheduled Update - Not Percolation

Customer
history

Clustering
The clustering loop is not
percolation since it runs at
fixed intervals instead of
incrementally as updates are
received. It also doesn't
update just a single
customer record.

©MapR Technologies - Confidential

Market
Segments
Personalized Search
• Observe web users’ activity over an extended
period
• Understand individual user interests
• Customize search results for each user
• …as fast as possible

©MapR Technologies - Confidential
Personal Search History and Web Index
Search
Persona
Activity

db
query

Persona update
Histories
trigger

query

Search
Web
Crawl

feature
extraction

Doc
Store
©MapR Technologies - Confidential

db

update

trigger

Doc
Index

Persona
Index
Percolator 1

Expensive feature
extraction does not
block document ingest

Web
Crawl

feature
extraction

Doc
Store
©MapR Technologies - Confidential
Percolators 2 and 3
Persona
Activity
Persona update
Histories

Web
Crawl
Doc
Store
©MapR Technologies - Confidential

update

Doc
Index

Persona
Index
Percolator 4
Updates to personas
trigger updates in
related personas

Search
Persona
Activity

db
query

Persona update
Histories

©MapR Technologies - Confidential

Persona
Index
Percolator 5?

Persona
Index

Persona
Histories
trigger

query

Search
db

trigger

Doc
Index
©MapR Technologies - Confidential

Persona and doc
index updates trigger a
personalization refresh
Pattern Context
Persona
Activity

Web
Crawl

©MapR Technologies - Confidential

Encapsulated
Process
Cyclic Dependency Graph

©MapR Technologies - Confidential
Percolator Thoughts
• M7 tables are great as the first persistence point
in percolation
• In-memory flag column family works great for
triggering updates
– Efficient - eliminates need for queuing
– Fast triggering with row & column Bloom filters

• Percolation is best supported by dedicated
column families
– Percolators I/O characteristics differ
– M7 works especially well because it supports lots of
column families

©MapR Technologies - Confidential
Cyclic Dependency Graph, M7 Schema

©MapR Technologies - Confidential
Personalized Medicine
5. Interpretation
& Follow-up

4. Reporting

1. Select Tests

2. Draw Biosample

3. Genome Sequencing
& Analysis
©MapR Technologies - Confidential
Personalized Medicine Applications
• Pre-conception screening
• Clinical research & trials
– Drug re-targeting

• Therapeutics
– Companion diagnostics
– Therapy selection
©MapR Technologies - Confidential
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

Patient
health
context

query

Search

Ranked
therapies

Here we do not see real-time data
pushed to a persistence layer and
processed offline. This pattern does

©MapR Technologies - Confidential
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

Patient
health
context

query

Search

User-based recommendation pattern

Surprise! It’s the recommender
©MapR Technologies - Confidential

Ranked
therapies
Recommendation in Classic Form

Queue

History
Archive

db
Recent
history

©MapR Technologies - Confidential

query

User
Search

Ranked
similar
histories
Item-Based Recommendation
in Classic Form
Queue

History
archive

Cooccurrence
analysis

Off-line analysis

Recent
history
query

Item
linkage
db

Search

©MapR Technologies - Confidential

Interactive recommendation

Ranked
items
Recommendation Thoughts
• Item-based recommendation is for efficiency
– expensive step in computing co-occurrence can be
done offline and cached prior to a user query

• User-based recommendation is for accuracy
– user comparisons are done online to find the current
best recommendation

• MapR is great for recommendation
– M7 tables are high I/O performance, can eliminate
queues
– Faster archive updates with optimized MapReduce
– High-availability for mission life critical applications

©MapR Technologies - Confidential
Business Use Cases
& Design Patterns
Recommender –
Personalized
Medicine

Pattern X –
Health data

Percolator –
Personalized
Search

Percolator –
Other Industry

Percolator –
Personalized
Medicine

Pattern X –
Other Industry

©MapR Technologies - Confidential
Summary: Best Practices
• Look at the big picture
– Find recurring patterns

• Design systems at a high-level
– Solve problems once and reuse components
– Increase R&D productivity
– Decrease operational and maintenance overhead

©MapR Technologies - Confidential
Thank
You!

Allen Day, PhD
Principal Data Scientist, MapR Technologies
aday@maprtech.com, allenday@allenday.com
@allenday, @mapr
©MapR Technologies - Confidential
Evolution of Data Storage
Scalability
Over decades of progress,
Unix-based systems have set
the standard for compatibility
and functionality
Linux
POSIX

Functionality
Compatibility
©MapR Technologies - Confidential
Evolution of Data Storage
Scalability
Hadoop achieves much higher
Hadoop
scalability by trading away
essentially all of this compatibility

Linux
POSIX

Functionality
Compatibility
©MapR Technologies - Confidential
Evolution of Data Storage
Scalability
Hadoop

MapR enhances Apache Hadoop by
restoring the compatibility while
increasing scalability and performance
Linux
POSIX

Functionality
Compatibility
©MapR Technologies - Confidential
MapR Data Storage: How it’s done
HBase
NoSQL Tables API

POSIX NFS

implements

depends

Apache
HBase

implements

implements
depends
Hadoop
HDFS API

implements
MapR
Filesystem

©MapR Technologies - Confidential

implements
Apache Hadoop
HDFS
MapR Data Storage: How it’s done
Vertical Integration = High Performance
HBase
NoSQL Tables API

POSIX NFS

implements

depends

Apache
HBase

implements

implements
depends
Hadoop
HDFS API

implements
MapR
Filesystem

©MapR Technologies - Confidential

implements
Apache Hadoop
HDFS
Hadoop on MapR No Longer Stands
Apart
Legacy code &
applications

New technologies
d3
node.js
Apache Storm

Multiple types of
data sources

New custom applications

MapR cluster

©MapR Technologies - Confidential

Más contenido relacionado

Similar a 2013.12.12 - Sydney - Big Data Analytics

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarImpetus Technologies
 
R for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesR for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesRevolution Analytics
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Arun Karthick Manoharan
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticssai Pramoda
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
RapidMiner - From Data Mining To Decision Making In One Platform.pdf
RapidMiner - From Data Mining To Decision Making In One Platform.pdfRapidMiner - From Data Mining To Decision Making In One Platform.pdf
RapidMiner - From Data Mining To Decision Making In One Platform.pdfDataSpace Academy
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendationsRik Van Bruggen
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadooplucenerevolution
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use CasesInSemble
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 

Similar a 2013.12.12 - Sydney - Big Data Analytics (20)

20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
SciDB
SciDBSciDB
SciDB
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
 
R for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two StrategiesR for SAS Users Complement or Replace Two Strategies
R for SAS Users Complement or Replace Two Strategies
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
RapidMiner - From Data Mining To Decision Making In One Platform.pdf
RapidMiner - From Data Mining To Decision Making In One Platform.pdfRapidMiner - From Data Mining To Decision Making In One Platform.pdf
RapidMiner - From Data Mining To Decision Making In One Platform.pdf
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendations
 
Crowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoopCrowd sourced intelligence built into search over hadoop
Crowd sourced intelligence built into search over hadoop
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 

Más de Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 

Más de Allen Day, PhD (20)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

2013.12.12 - Sydney - Big Data Analytics

  • 1. Design Patterns for Big Data Architecture: Best Strategies for Streamlined [Simple, Powerful] Design Allen Day, PhD Data Scientist, MapR Technologies December 2013 ©MapR Technologies - Confidential
  • 3. Me, Us • Allen Day, Principal Data Scientist, MapR R contributor (10 yr), Hadoop developer (6 yr) Human Genetics (UCLA Medicine), Machine Learning • MapR Distributes open source components for Hadoop Adds major enhancements for performance, high-availability, and ease-of-use • See Also – “allenday” most places (twitter, github, etc.) – aday@maprtech.com, @mapR – http://slideshare.net/allenday ©MapR Technologies - Confidential
  • 4. Three Business Use Cases Personalized Search ©MapR Technologies - Confidential Personalized Medicine Market Segmentation
  • 5. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign
  • 6. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
  • 7. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
  • 8. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Surprise! How can you tell?
  • 9. But First… WHAT IS A DESIGN PATTERN? ©MapR Technologies - Confidential
  • 10. But Before That… SURPRISE! ©MapR Technologies - Confidential
  • 11. Design Pattern Idea • a general reusable solution to a commonly occurring problem • not a finished design • not code • can be used in many different situations ©MapR Technologies - Confidential
  • 12. History of SW Design Patterns 1977 Architecture & Civil Engineering ©MapR Technologies - Confidential 1994 OO Software Architecture 2012 Parallelization Software ? Application Parallelization
  • 13. Not Just Software Designs http://en.wikipedia.org/wiki/A-line ©MapR Technologies - Confidential
  • 14. Identifying the Pattern Pattern Dimensions 1. 2. 3. 4. 5. Volume Variety Velocity Business Intents & Methods SLAs ©MapR Technologies - Confidential
  • 15. Choose a Pattern: Volume & Velocity 1. How big is your target data? <10 GB mid ? ? A Single element at a time >200 GB 2. How big is your query data? One pass over 100% B C Big storage Streaming Multiple passes over big chunks 3. How fast do you need a result? Throughput > response D ©MapR Technologies - Confidential Nearline Analytics < 100s (human scale) E Exploratory Analysis
  • 16. Twitter Zeitgeist as a Composite of Design Patterns Live data source e.g. Twitter Firehose B C Big storage Streaming D ©MapR Technologies - Confidential Nearline Analytics Downstream applications
  • 17. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html ©MapR Technologies - Confidential
  • 18. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Queued data are unavailable for action – not percolation Queue ©MapR Technologies - Confidential Real-time insertion Delayed insertion Data store
  • 19. Percolation in Classic Form Real-time data source Real-time insertion ©MapR Technologies - Confidential Data store Offline percolation of recent data
  • 20. Percolation of a Composite Store Real-time data source Real-time insertion Data store Offline percolation Index Both parts visible ©MapR Technologies - Confidential
  • 21. Market Segmentation • Divide customers into subsets with common needs • Design specific strategies for each subset • Major emphasis on “fresh” data ©MapR Technologies - Confidential
  • 22. Market Segmentation Feature Extraction Real-time transactions Customer history What does this have to do with percolation ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query Clustering
  • 23. Percolator 1 Feature Extraction Real-time transactions Customer history ©MapR Technologies - Confidential Feature extraction is percolation because it is triggered by the arrival of a new record and because it updates that new record.
  • 24. Percolator 2 Real-time transactions Customer history Market segment assignment is percolation because it is triggered by the arrival of a new record and because only that record's segment is updated. ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query What about the clustering
  • 25. Scheduled Update - Not Percolation Customer history Clustering The clustering loop is not percolation since it runs at fixed intervals instead of incrementally as updates are received. It also doesn't update just a single customer record. ©MapR Technologies - Confidential Market Segments
  • 26. Personalized Search • Observe web users’ activity over an extended period • Understand individual user interests • Customize search results for each user • …as fast as possible ©MapR Technologies - Confidential
  • 27. Personal Search History and Web Index Search Persona Activity db query Persona update Histories trigger query Search Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential db update trigger Doc Index Persona Index
  • 28. Percolator 1 Expensive feature extraction does not block document ingest Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential
  • 29. Percolators 2 and 3 Persona Activity Persona update Histories Web Crawl Doc Store ©MapR Technologies - Confidential update Doc Index Persona Index
  • 30. Percolator 4 Updates to personas trigger updates in related personas Search Persona Activity db query Persona update Histories ©MapR Technologies - Confidential Persona Index
  • 31. Percolator 5? Persona Index Persona Histories trigger query Search db trigger Doc Index ©MapR Technologies - Confidential Persona and doc index updates trigger a personalization refresh
  • 33. Cyclic Dependency Graph ©MapR Technologies - Confidential
  • 34. Percolator Thoughts • M7 tables are great as the first persistence point in percolation • In-memory flag column family works great for triggering updates – Efficient - eliminates need for queuing – Fast triggering with row & column Bloom filters • Percolation is best supported by dedicated column families – Percolators I/O characteristics differ – M7 works especially well because it supports lots of column families ©MapR Technologies - Confidential
  • 35. Cyclic Dependency Graph, M7 Schema ©MapR Technologies - Confidential
  • 36. Personalized Medicine 5. Interpretation & Follow-up 4. Reporting 1. Select Tests 2. Draw Biosample 3. Genome Sequencing & Analysis ©MapR Technologies - Confidential
  • 37. Personalized Medicine Applications • Pre-conception screening • Clinical research & trials – Drug re-targeting • Therapeutics – Companion diagnostics – Therapy selection ©MapR Technologies - Confidential
  • 38. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample Patient health context query Search Ranked therapies Here we do not see real-time data pushed to a persistence layer and processed offline. This pattern does ©MapR Technologies - Confidential
  • 40. Recommendation in Classic Form Queue History Archive db Recent history ©MapR Technologies - Confidential query User Search Ranked similar histories
  • 41. Item-Based Recommendation in Classic Form Queue History archive Cooccurrence analysis Off-line analysis Recent history query Item linkage db Search ©MapR Technologies - Confidential Interactive recommendation Ranked items
  • 42. Recommendation Thoughts • Item-based recommendation is for efficiency – expensive step in computing co-occurrence can be done offline and cached prior to a user query • User-based recommendation is for accuracy – user comparisons are done online to find the current best recommendation • MapR is great for recommendation – M7 tables are high I/O performance, can eliminate queues – Faster archive updates with optimized MapReduce – High-availability for mission life critical applications ©MapR Technologies - Confidential
  • 43. Business Use Cases & Design Patterns Recommender – Personalized Medicine Pattern X – Health data Percolator – Personalized Search Percolator – Other Industry Percolator – Personalized Medicine Pattern X – Other Industry ©MapR Technologies - Confidential
  • 44. Summary: Best Practices • Look at the big picture – Find recurring patterns • Design systems at a high-level – Solve problems once and reuse components – Increase R&D productivity – Decrease operational and maintenance overhead ©MapR Technologies - Confidential
  • 45. Thank You! Allen Day, PhD Principal Data Scientist, MapR Technologies aday@maprtech.com, allenday@allenday.com @allenday, @mapr ©MapR Technologies - Confidential
  • 46. Evolution of Data Storage Scalability Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
  • 47. Evolution of Data Storage Scalability Hadoop achieves much higher Hadoop scalability by trading away essentially all of this compatibility Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
  • 48. Evolution of Data Storage Scalability Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
  • 49. MapR Data Storage: How it’s done HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS
  • 50. MapR Data Storage: How it’s done Vertical Integration = High Performance HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS
  • 51. Hadoop on MapR No Longer Stands Apart Legacy code & applications New technologies d3 node.js Apache Storm Multiple types of data sources New custom applications MapR cluster ©MapR Technologies - Confidential

Notas del editor

  1. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  2. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  3. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  4. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  5. Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  6. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  7. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  8. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  9. Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  10. Talk track: In market segmentation, you want to identify useful segments of your customer base to target for a market campaign, for retention, for specific product offerings, etc. What makes “good” segments depends on what you want to do and how the environment changes. You may not know ahead of time what categories make useful segments. One way to find this is to capture customer histories and do a clustering step for discovery and definition of the market segments.This market segment db is then queried and updated in response to new real-time data insertion or new rounds of clustering. Specific feature extraction may also be a useful step from the customer history persistence layer.
  11. Talk track: the feature extraction step could be triggered by real-time data insertion…
  12. Talk track: a second percolator processes new customer histories relative to the market segments.
  13. Talk track: the clustering step is not triggered by the real-time insertion; it is a scheduled step and thus not an example of percolation.What about the other use case we said was similar, the Genotyping?
  14. Here, we trigger updates to the persona index based on EITHERUpdates to persona history, ORUpdates to the document indexThe idea here being that if enough docs have changed or personas are finding “unusual” stuff, the persona is stale and we should recompute it
  15. Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  16. Best practice: use one column family per percolator to manage their independent i/o characteristicsPrevent i/o storms
  17. Talk track: Now let’s consider the other health data example, genome sequencing for personalized medicine. This is an approach that can be used to get the particular genomic characteristics of a cancerous tumor and compare to known patient histories in order to select the best option for a customized therapy.
  18. Talk track: While percolation is not used in this example, it does represent a specialized form of recommendation: user-based recommendation.In this genome sequencing/ personalized medicine example, A very high bar is set for the accuracy of the recommendation. Here a user-based pattern is best. Let’s look at the generalized form…
  19. Talk track: here is the basic pattern for user-based recommendation, as used in the real use case of personalized medicine. In contrast, In consumer recommendation for shopping or movie or music recommendation, rapid response is key and accuracy is slightly less important. There item-based recommendation is generally best, because the expensive step in computing co-occurrence can be done offline prior to a user query.
  20. Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  21. Gives up random access read on filesGives up strong authentication / authorization modelGives up random access write / append on files