Más contenido relacionado La actualidad más candente (20) Similar a predictive-analytics-san-diego-2013-02-21 (20) predictive-analytics-san-diego-2013-02-212. My Background
University, Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– big data since before it was big
Open source
– even before the internet
– Apache Hadoop, Mahout, Zookeeper, Drill
– bought the beer at first HUG
MapR
Founding member of Apache Drill
©MapR Technologies - Confidential 2
3. MapR Technologies
Silicon Valley Startup
– Top investors
– Top technical and management team
• Google, Microsoft, EMC, NetApp, Oracle
Enterprise quality distribution for
Hadoop
Many extensions to basic Hadoop function
Strong supporter of Apache Drill
©MapR Technologies - Confidential 3
5. The study of the past
(what came before now)
©MapR Technologies - Confidential 5
6. What is the future?
(it comes after now)
©MapR Technologies - Confidential 6
10. But the future also
has a past!
©MapR Technologies - Confidential 10
17. Some things
turned out
as expected
©MapR Technologies - Confidential 17
18. Guys wearing
Fedoras
©MapR Technologies - Confidential 18
19. Many things
are
different!
©MapR Technologies - Confidential 19
20. Hadoop has
a history
©MapR Technologies - Confidential 20
21. Hadoop also
has a
future
©MapR Technologies - Confidential 21
22. The Old Future of Hadoop
Map-reduce and HDFS
– more and more, but not really different
Eco-system additions
– Simpler programming (Hive and Pig)
– Key-value store
– Ad hoc query
Stands apart from other computing
– Required by HDFS and other limitations
©MapR Technologies - Confidential 22
23. The New Future of Hadoop
Real-time processing
– Combines real-time and long-time
Integration with traditional IT
– No need to stand apart
Integration with new technologies
– Solr, Node.js, Twisted all should interface directly
Fast and flexible computation
– Drill logical plan language
©MapR Technologies - Confidential 23
24. Example #1
Search Abuse
©MapR Technologies - Confidential 24
25. History matrix
One row per user
One column per thing
©MapR Technologies - Confidential 25
26. Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and column per thing
©MapR Technologies - Confidential 26
28. SolR
SolR
Complete Cooccurrence Indexer
Solr
Indexer
history (Mahout) indexing
Item meta- Index
data shards
©MapR Technologies - Confidential 28
29. SolR
SolR
User Indexer
Solr
Web tier Indexer
history search
Item meta-
Index
data shards
©MapR Technologies - Confidential 29
30. Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3
minutes
©MapR Technologies - Confidential 30
31. Example #2
Web Technology
©MapR Technologies - Confidential 31
32. Real-time Fast analysis
data (Storm)
Analytic
Raw logs
output
©MapR Technologies - Confidential 32
33. Large analysis
(map-reduce)
Analytic
Raw logs
output
©MapR Technologies - Confidential 33
34. Presentation
Browser
tier (d3 +
query
node.js)
Analytic
Raw logs
output
©MapR Technologies - Confidential 34
35. Objective Results
Real-time + long-time analysis is seamless
Web tier can be rooted directly on Hadoop cluster
No need to move data
©MapR Technologies - Confidential 35
36. Example #3
Apache Drill
©MapR Technologies - Confidential 36
37. Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volume TBs to PBs
Programming MapReduce
model
Users Developers
Google project MapReduce
Open source Hadoop
project MapReduce
©MapR Technologies - Confidential 37
38. Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source Hadoop Storm or Apache S4
project MapReduce
©MapR Technologies - Confidential 38
39. Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming MapReduce DAG
model (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source Hadoop Storm and S4
project MapReduce
©MapR Technologies - Confidential 39
40. Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model (ad hoc) (pre-programmed)
Users Developers Analysts and Developers
developers
Google project MapReduce
Open source Hadoop Storm and S4
project MapReduce
©MapR Technologies - Confidential 40
41. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
©MapR Technologies - Confidential 41
42. Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce
Apache Drill
©MapR Technologies - Confidential 42
43. Design Principles
Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources
Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)
©MapR Technologies - Confidential 43
44. Simple Architecture
Query
Interface
language
Logical
Transform
Language
Physical
Optimize Execute
Plan
©MapR Technologies - Confidential 44
45. Standard Interfaces
Query SQL 2003
Interface
language
Drill logical
syntax
Logical
Transform Scanner
Language API
Physical
Optimize Execute
Plan
©MapR Technologies - Confidential 45
46. Logical Plan Syntax:
query:[
{
op:"sequence", do:[
{ op: "scan",
memo: "initial_scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{ op: "transform",
transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ]
},
{ op: "filter",
expr: "donuts.ppu < 1.00"
},
…
©MapR Technologies - Confidential 46
47. Logical Streaming Example
01
23
4
{ @id: <refnum>, op: “window-frame”,
input: <input>,
keys: [ 0
<name>,... 01
], 012
ref: <name>, 123
before: 2, 234
after: here
}
©MapR Technologies - Confidential 47
48. Logical Plan
scan-json "table-1"
filter exp1
flatten
aggregate exp2
©MapR Technologies - Confidential 48
49. Execution Plan
scan-json "table-1" scan-json "table-1" scan-json "table-1"
filter exp1 filter exp1 filter exp1
flatten flatten flatten
node1 node2 node3
aggregate exp2
©MapR Technologies - Confidential 49
50. Representing a DAG
18
aggregate exp2
19
{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat>,
keys: [<name>,...],
aggregations: [
{ref: <name>, expr: <aggexpr> },...
]
}
©MapR Technologies - Confidential 50
51. Non-SQL queries
scan-json "table-1" scan-json "table-1"
streaming
k-means
ball k-
k
means
aggregate exp2
k-means
join
cluster
features
©MapR Technologies - Confidential 51
52. Design Principles
Flexible Easy
• Pluggable query languages • Unzip and run
• Extensible execution engine • Zero configuration
• Pluggable data formats • Reverse DNS not needed
• Column-based and row-based • IP addresses can change
• Schema and schema-less • Clear and concise log messages
• Pluggable data sources
Dependable Fast
• No SPOF • C/C++ core with Java support
• Instant recovery from crashes • Google C++ style guide
• Min latency and max throughput
(limited only by hardware)
©MapR Technologies - Confidential 52
53. The future is
not what we
thought it
would be
©MapR Technologies - Confidential 53
55. Get Involved!
Tweet:
#hcj13w
#mapr
@ted_dunning
©MapR Technologies - Confidential 55
56. Get Involved!
Download these slides
– http://www.mapr.com/company/events/hcj-01-21-2013
Join the Drill project
– drill-dev-subscribe@incubator.apache.org
– #apachedrill
Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– @ted_dunning
Join MapR (in Japan!)
– jobs@mapr.com
©MapR Technologies - Confidential 56