Más contenido relacionado La actualidad más candente (20) Similar a Analyzing Real-World Data with Apache Drill (20) Analyzing Real-World Data with Apache Drill2. © 2014 MapR Technologies 2
Data is doubling in
size every two years
3. 44 ZETTABYTES
© 2014 MapR Technologies 3
IDC estimates that in 2020,
there will be 44 zettabytes
of data in the world
4.4 ZETTABYTES
1.8 ZETTABYTES
2011 2013
2020
Source: IDC Digital Universe
4. © 2014 MapR Technologies 4
UNSTRUCTURED
DATA
Unstructured data will account
for more than 80% of the data
collected by organizations
STRUCTURED DATA
1980 1990 2000 2010 2020
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data Stored
5. NoSchema Datastores are Capturing this Data
Volume MBs-GBs TBs-PBs
RELATIONAL DATABASES “NOSCHEMA” DATASTORES
Structure
Development
1980 1990 2000 2010 2020
© 2014 MapR Technologies 5
Fixed schema
DBA controls structure
Dynamic schema (schema-free)
Application controls structure
Database
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
6. WANT 2 DON’T WANT
© 2014 MapR Technologies 6
SQL in the Big Data World
• SQL
• BI (Tableau, MicroStrategy, etc.)
• Low latency
• Scalability
• Create and maintain schemas on:
– HDFS (Parquet, JSON, etc.)
– HBase
– MongoDB
• Transform or copy data
We want SQL and BI support without compromising the
flexibility and agility of NoSchema datastores
7. • Schema-free scale-out query engine for Hadoop and NoSQL
• Point-and-query vs. schema-first
• Low latency
• Extreme ease of use
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
© 2014 MapR Technologies 7
APACHE DRILL
40+ contributors
150+ years of experience building
databases and distributed systems
8. Evolution Towards Self-Service Data Exploration
© 2014 MapR Technologies 8
Data Modeling and
Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI
w/ RDBMS
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
Zero-day analytics
10. RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 10
Drill’s Data Model is Flexible
Fixed schema Schema-less
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Flat
Complex
Flexibility
Flexibility
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
11. Drill Supports Schema Discovery On-The-Fly
Schema Declared In Advance Schema2 Discovered On-The-Fly
© 2014 MapR Technologies 11
• Fixed schema
• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or
schema-less
• Leverage schema in centralized
repository or self-describing data
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
SCHEMA ON THE
FLY
12. SELECT po_document.AllowPartialShipment
FROM j_purchaseorder;
© 2014 MapR Technologies 12
Native JSON
SELECT json_value(po_document,
'$.AllowPartialShipment’ RETURNING
NUMBER)
FROM j_purchaseorder;
JSON query with Drill:
JSON query with Oracle:
Relational databases cannot provide true schema-free JSON support.
13. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 13
Architecture
14. © 2014 MapR Technologies 14
High Level Architecture
• Cluster of commodity servers
– Daemon (drillbit) on each node
• No dependency on other execution engines (MapReduce, Spark, Tez)
– Better performance and manageability
• ZooKeeper maintains ephemeral cluster membership information
– drillbit uses ZooKeeper to find other drillbits in the cluster
– Client uses ZooKeeper to find drillbits
• Data processing unit is columnar record batches
– Enables schema flexibility with negligible performance impact
15. ZooKeeper
ZooKeeper
ZooKeeper
© 2014 MapR Technologies 15
Drill Maximizes Data Locality
drillbit
DataNode/Regi
onServer/mong
od
drillbit
DataNode/Regi
onServer/mong
od
drillbit
DataNode/Regi
onServer/mong
od
…
Data Source Best Practice
HDFS or MapR-FS drillbit on each DataNode
HBase or MapR-DB drillbit on each RegionServer
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
16. 5. Return results
to client
© 2014 MapR Technologies 16
SELECT* Query Execution
drillbit
ZooKeeper
Client
(JDBC, ODBC,
REST)
1. Find drillbits
(once per session)
2. Submit query to
drillbit
3. Create logical and physical execution plans
4. Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper
ZooKeeper
drillbit drillbit
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
17. DFS
© 2014 MapR Technologies 17
Core Modules within drillbit
SQL Parser
Hive
HBase
Distributed Cache
Storage Plugins
MongoDB
Physical Plan
Execution
Logical Plan
Optimizer
RPC Endpoint
19. © 2014 MapR Technologies 19
Demo Plan
1. Run Drill
2. Configure DFS and MongoDB storage plugins
3. Explore the data
– Basics
– Complex data
– Views
20. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 20
Run Drill
21. You can now access the Web UI:
http://localhost:8047
© 2014 MapR Technologies 21
Run Drill in Embedded Mode (sqlline)
$ tar xf apache-drill-0.7.0.tar.gz
$ cd apache-drill-0.7.0
$ bin/sqlline -u jdbc:drill:zk=local
> SELECT *
FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json`
LIMIT 1;
+---------------+------------+--------------+------------+------------+
| yelping_since | votes | review_count | name | user_id |
+---------------+------------+--------------+------------+------------+
| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |
qtrmBGNqCvupHMHL_bKFgQ |
• drillbit (Drill daemon) starts automatically in embedded mode
• No ZooKeeper in embedded mode (hence zk=local)
• Can’t use BI clients (JDBC/ODBC) in embedded mode
22. • Define the Drill cluster name and ZooKeeper nodes in conf/drill-override.conf
• Start drillbit:
$ bin/drillbit.sh start
© 2014 MapR Technologies 22
Or Run Drill in Distributed Mode…
• Make sure ZooKeeper (zkServer) is running:
$ zkServer start
• Access the Web UI: http://localhost:8047
• Connect a client to the cluster (eg, sqlline):
$ bin/sqlline -u jdbc:drill:zk=localhost:2181
• Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes
• If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired
cluster in the JDBC connection string:
jdbc:drill:zk=localhost:2181/drill/<clustername>
• Not sure if ZooKeeper is running? Run telnet localhost 2181 and make sure it connects
23. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 23
Configure Storage Plugins
24. © 2014 MapR Technologies 24
Enable MongoDB Storage Plugin
26. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 26
Explore the Data: Basics
27. © 2014 MapR Technologies 27
Inventory: DFS Files
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
28. © 2014 MapR Technologies 28
Inventory: MongoDB Collections
$ mongo
MongoDB shell version: 2.6.5
> show databases;
admin (empty)
local 0.078GB
yelp 0.453GB
> use yelp
> db.users.findOne()
{
"_id" : ObjectId("54566cdf3237149de181a92a"),
"yelping_since" : "2012-02",
"votes" : {
"funny" : 1,
"useful" : 5,
"cool" : 0
},
"review_count" : 6,
"name" : "Lee",
"user_id" : "qtrmBGNqCvupHMHL_bKFgQ",
"friends" : [ ]
}
29. © 2014 MapR Technologies 29
Let’s Go!
> SELECT *
FROM
dfs.root.`/Users/tshiran/Development/demo/data/y
elp/review.json`
WHERE stars = 1
LIMIT 1;
+------------+------------+------------+------------+------------+------------+------------+-------------+
| votes | user_id | review_id | stars | date | text | type | business_id |
+------------+------------+------------+------------+------------+------------+------------+-------------+
| {"funny":0,"useful":0,"cool":0} | Qrs3EICADUKNFoUq2iHStA | _ePLBPrkrf4bhyiKWEn4Qg | 1 | 2013-04-19
| I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this
doctor and this office. | review | vcNAWiLM4dR7D2nwwJ7nCA |
+------------+------------+------------+------------+------------+------------+------------+-------------+
30. © 2014 MapR Technologies 30
Using Storage Plugins and Workspaces
Storage plugin
Workspace
Path relative to workspace
> SELECT * FROM
dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json`
LIMIT 1;
> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;
> SELECT * FROM mongo.yelp.users LIMIT 1;
> USE mongo.yelp;
> SELECT * FROM users LIMIT 1;
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
31. © 2014 MapR Technologies 31
Most Common User Names (MongoDB)
> SELECT name, count(*) AS users
FROM mongo.yelp.users
GROUP BY name
ORDER BY users DESC LIMIT 10;
+------------+------------+
| name | users |
+------------+------------+
| David | 2453 |
| John | 2378 |
| Michael | 2322 |
| Chris | 2202 |
| Mike | 2037 |
| Jennifer | 1867 |
| Jessica | 1463 |
| Jason | 1457 |
| Michelle | 1439 |
| Brian | 1436 |
+------------+------------+
32. © 2014 MapR Technologies 32
Cities with the Most Businesses
> SELECT state, city, count(*) AS businesses
FROM dfs.demo.`/yelp/business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
33. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 33
Explore the Data: Complex Data
34. © 2014 MapR Technologies 34
business.json (1)
{
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
35. © 2014 MapR Technologies 35
business.json (2)
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}
36. Which Places Are Open Right Now (22:00)?
> SELECT name, b.hours
© 2014 MapR Technologies 36
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Saturday.`open` < '22:00' AND
b.hours.Saturday.`close` > '22:00'
LIMIT 2;
+------------+------------+
| name | hours |
+------------+------------+
| Chang Jiang Chinese Kitchen |
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{"
close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"
22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","ope
n":"11:00"}} |
| Grand China Restaurant |
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{"
close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"
22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","ope
n":"11:00"}} |
+------------+------------+
37. It’s 10pm in Vegas and I Want Good Hummus!
> SELECT name, stars, b.hours.Friday, categories
© 2014 MapR Technologies 37
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+------------+------------+------------+------------+
| name | stars | EXPR$2 | categories |
+------------+------------+------------+------------+
| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] |
| Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} |
["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+------------+------------+------------+------------+
38. © 2014 MapR Technologies 38
Flatten Repeated Values
> SELECT name, categories
FROM dfs.demo.`yelp/business.json` LIMIT 3;
+------------+------------+
| name | categories |
+------------+------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+------------+------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.demo.`yelp/business.json` LIMIT 5;
+------------+------------+
| name | categories |
+------------+------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+------------+------------+
39. Most and Least Common Business Categories
> SELECT category, count(*) AS businesses
FROM (SELECT name, FLATTEN(categories) AS category
© 2014 MapR Technologies 39
FROM dfs.demo.`yelp/business.json`) c
GROUP BY category ORDER BY businesses DESC;
+------------+------------+
| category | businesses |
+------------+------------+
| Restaurants | 14303 |
…
| Australian | 1 |
| Boat Dealers | 1 |
| Firewood | 1 |
+------------+------------+
715 rows selected (3.439 seconds)
> SELECT name, categories FROM dfs.demo.`yelp/business.json`
WHERE true and REPEATED_CONTAINS(categories, 'Australian');
+------------+------------+
| name | categories |
+------------+------------+
| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |
+------------+------------+
40. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 40
Explore the Data: Views
41. columns[0] columns[4]
© 2014 MapR Technologies 41
Create a View for Name-Gender Mapping
names.csv:
> CREATE VIEW dfs.tmp.`names` AS
SELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> USE dfs.tmp;
> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM
dfs.demo.`names.csv`;
> SELECT * FROM dfs.tmp.names WHERE name = 'John';
+------------+------------+
| name | gender |
+------------+------------+
| John | Male |
+------------+------------+
42. Most Common Names (and their Genders) on Yelp
> SELECT u.name, n.gender, count(*) AS number
© 2014 MapR Technologies 42
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY u.name, n.gender
ORDER BY number DESC LIMIT 10;
+------------+------------+------------+
| name | gender | number |
+------------+------------+------------+
| David | Male | 2453 |
| John | Male | 2378 |
| Michael | Male | 2322 |
| Chris | Unknown | 2202 |
| Mike | Male | 2037 |
| Jennifer | Female | 1867 |
| Jessica | Female | 1463 |
| Jason | Male | 1457 |
| Michelle | Female | 1439 |
| Brian | Male | 1436 |
+------------+------------+------------+
43. © 2014 MapR Technologies 43
Who Rates Higher – Men or Women?
> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY n.gender;
+------------+------------+------------+
| gender | users | stars |
+------------+------------+------------+
| Female | 103684 | 3.77 |
| Male | 97430 | 3.696 |
| Unknown | 18409 | 3.727 |
+------------+------------+------------+
44. © 2014 MapR Technologies 44
Who Writes More – Men or Women?
It takes a 3-way join to find out…
> SELECT n.gender, round(avg(length(r.text))) AS review_length
FROM dfs.demo.`yelp/review.json` r,
mongo.yelp.users u,
dfs.tmp.names n
WHERE u.name = n.name AND r.user_id = u.user_id
GROUP BY n.gender;
+------------+---------------+
| gender | review_length |
+------------+---------------+
| Male | 665 |
| Female | 730 |
| Unknown | 711 |
+------------+---------------+
45. © 2014 MapR Technologies 45
Drill Tweets (@ApacheDrill)
46. © 2014 MapR Technologies 46
Thank You
• Learn: incubator.apache.org/drill/
• Download: incubator.apache.org/drill/download/
• Ask questions: drill-user@incubator.apache.org
• Contact me: tshiran@apache.org
47. © 2014 MapR Technologies 47
Thank You
Tomer Shiran, VP Product Management
@mapr maprtech
tshiran@mapr.com
MapRTechnologies
maprtech
mapr-technologies
Notas del editor Have someone introduce me.
Thank audience (tie to morning activities), sponsors, HP, etc.
We’re here because this is the biggest thing that has happened to Hadoop… Here at the conference we’re talking about data science. But before we can appreciate the changes happening in data science, we must first talk about Data. Data is doubling every two years. The fast growing volume, variety and velocity of data is overwhelming traditional systems and approaches. A revolutionary approach is required to leverage this data. And with this new technology, Data science as we know, is undergoing tremendous change. To give you a sense of the data volumes that we’re talking about, I’ve included this chart that shows why a revolutionary approach is needed. You can see the amount of data growth moving from 1.8 Zettabytes to 44 Zettabytes in just over 5 years. To put this into perspective a large datawarehouse contains terabytes of data. A zettabye is 1 billion terabytes.
Numbers in chart are from two IDC reports (sponsored by emc).
http://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf
http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm What is the source of this data growth? While structured data growth has been relatively modest, the growth in unstructured data has been exponential.
Source of statistic: http://link.springer.com/chapter/10.1007/978-3-642-39146-0_2 The database/datastore landscape is evolving to meet the new requirements. 2009 was the inflection point. NoSchema systems in which applications control structure. Developers are being empowered and they are voting for the agility offered by these systems.
In the early days if this revolution we sacrificed the query language, and we eliminated the ability to leverage the knowledge and tools available to millions of people. We’re changing that by a distributed SQL engine. But when we do that, we have to keep in mind that this transition to a NoSchema world happened for a reason, and we don’t want to reintroduce the centralized, DBA-managed schema. TODO: Add Impala and Splunk logos IT-driven = months of delay, unnecessary work (data is no longer relevant, etc.)
The so-what needs to be conveyed. Why does it matter that it’s not needed.
6 months -> 3 months -> 3 months -> day zero
So imagine now what you can get…
Data Agility is needed for Business Agility
>>> Stand still during slide, move in at the punchline (why does this matter to YOU) Organizations are realizing that they have to move towards self-service All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.
If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries. TODO: Add Impala and Splunk logos