SlideShare una empresa de Scribd logo
1 de 54
Introduction to Apache Drill
Big Data Bellevue Meetup

@tnachen
Timothy Chen
Motivation
Key Facts
Architecture Overview
About me
I Open Source
Use Case: Marketing Campaign
Jane, a marketing analyst
Determine target segments
Data from different sources
Use Case: Crime Detection
•
•
•
•

Online purchases
Fraud, billing, etc.
Batch-generated overview
Modes
– Explorative
– Alerts
Requirements
•
•
•
•
•

Support for different data sources
Support for different query interfaces
Low-latency/real-time
Ad-hoc queries
Scalable, reliable
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…

“

“

Google’s Dremel

http://research.google.com/pubs/pub36632.html
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar,
Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases
(2010), pp. 330-339
Google’s Dremel

multi-level execution trees

columnar data layout
Google’s Dremel

nested data + schema

column-striped representation

map nested data to tables
Google’s Dremel
experiments:
datasets & query performance
Apache Drill–key facts
Inspired by Google’s Dremel
Standard SQL 2003 support
Plug-able data sources
Nested data is a first-class citizen
Schema is optional
Community driven, open, 100’s involved
High-level Architecture
Principled Query Execution
Source query—what we want to do (analyst
friendly)
Logical Plan— what we want to do (language
agnostic, computer friendly)
Physical Plan—how we want to do it (the best
way we can tell)
Execution Plan—where we want to do it
Principled Query Execution

Source
Query

SQL 2003
DrQL
MongoQL
DSL

Parser

parser API

Logical
Plan

query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op: "filter",
condition:
"x > 3”
},

Optimizer

Topology
CF
etc.

Physical
Plan

Execution

scanner API
Wire-level Architecture
Each node: Drillbit - maximize data locality
Co-ordination, query planning, execution, etc, are distributed
Any node can act as endpoint for a query—foreman

Drillbit

Drillbit

Drillbit

Drillbit

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

node

node

node

node
Wire-level Architecture
Curator/Zookeeper for ephemeral cluster membership info
Distributed cache (Hazelcast) for metadata, locality
information, etc.
Curator/Zk
Curator/Zk

Drillbit

Drillbit

Drillbit

Drillbit

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

node

node

node

node
Wire-level Architecture
Originating Drillbit acts as foreman: manages query execution,
scheduling, locality information, etc.
Streaming data communication avoiding SerDe
Curator/Zk
Curator/Zk

Drillbit

Drillbit

Drillbit

Drillbit

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

node

node

node

node
Wire-level Architecture
Foreman turns into
root of the multi-level
execution tree, leafs
activate their storage
engine interface.

node

Curator/Zk
Curator/Zk
node

node
On the shoulders of giants …
Jackson for JSON SerDe for metadata
Typesafe HOCON for configuration and module management
Netty4 as core RPC engine, protobuf for communication
Vanilla Java, LArray and Netty ByteBuf for off-heap large data
structures
Hazelcast for distributed cache
Netflix Curator on top of Zookeeper for service registry
Optiq for SQL parsing and cost optimization
Parquet (http://parquet.io)/ ORC
Janino for expression compilation
ASM for ByteCode manipulation
Yammer Metrics for metrics
Guava extensively
Carrot HPC for primitive collections
Key features
Full SQL – ANSI SQL 2003
Nested Data as first class citizen
Optional Schema
Extensibility Points …
Extensibility Points
Source query  parser API
Custom operators, UDF  logical plan
Serving tree, CF, topology  physical plan/optimizer
Data sources &formats  scanner API

Source
Query

Parser

Logical
Plan

Optimizer

Physical
Plan

Execution
User Interfaces
API—DrillClient
Encapsulates endpoint discovery
Supports logical and physical plan submission, query
cancellation, query status
Supports streaming return results

JDBC driver, converting JDBC into DrillClient
communication.
REST proxy for DrillClient
User Interfaces
Let’s get our hands
dirty…
Demo
Install
Preparation
Usage

$ wget http://people.apache.org/~jacques/apache-drill-1.0.0m1.rc3/apache-drill-1.0.0-m1-binary-release.tar.gz
$ tar -zxf apache-drill-1.0.0-m1-binary-release.tar.gz

$ export
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents
/Home
$ export DRILL_LOG_DIR=$PWD
$ ./bin/drillbit.sh start
$ ./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
Useful Resources
Getting Started guide
https://github.com/vrtx/incubatordrill/blob/getting_started/docs/getting_started.rst
Demo HowTo
https://cwiki.apache.org/confluence/display/DRILL/
Demo+HowTo
How to build/install Apache Drill on Ubuntu 13.04
http://www.confusedcoders.com/bigdata/apachedrill/how-to-build-apache-drill-on-ubuntu-13-04
Be a part of it!
Status
Heavy development by multiple organizations
(MapR, Pentaho, Microsoft, Thoughtworks,
XingCloud, etc.)
Currently more than 100k LOC
Alpha available via
http://people.apache.org/~jacques/apache-drill1.0.0-m1.rc3/
Kudos to …
Julian Hyde, Pentaho
Lisen Mu, XingCloud
Tim Chen, Microsoft
Chris Merrick, RJMetrics
David Alves, UT Austin
Sree Vaadi, SSS
Srihari Srinivasan,
ThoughtWorks

•
•
•
•
•
•
•
•
•

Ben Becker, MapR
Jacques Nadeau, MapR
Ted Dunning, MapR
Keys Botzum, MapR
Jason Frantz
Ellen Friedman
Chris Wensel, Concurrent
Gera Shegalov, Oracle
Ryan Rawson, Ohm Data

Alexandre Beche, CERN
Jason Altekruse, MapR

http://incubator.apache.org/drill/team.html
Contributing
Contributions appreciated—not only code drops …

Test data & test queries
Use case scenarios (textual/SQL queries)
Documentation
Engage!
Follow @ApacheDrill on Twitter
Sign up at mailing lists (user | dev)

http://incubator.apache.org/drill/mailing-lists.html

Standing G+ hangouts every Tuesday at 18:00 CET
http://j.mp/apache-drill-hangouts

Keep an eye on http://drill-user.org/
Twitter: @tnachen
Email: tnachen@gmail.com

Más contenido relacionado

La actualidad más candente

Microsoft Azure Storage Basics
Microsoft Azure Storage BasicsMicrosoft Azure Storage Basics
Microsoft Azure Storage BasicsSai Kishore Naidu
 
Docker y azure container service
Docker y azure container serviceDocker y azure container service
Docker y azure container serviceFernando Mejía
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computingBrian Bullard
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraDataStax
 
Consistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud ConsistencyConsistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud ConsistencyLakshmiPriya UdayaKumar
 
Consistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud ConsistencyConsistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud ConsistencyPapitha Velumani
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...Citus Data
 
Consistency as a service auditing cloud consistency
Consistency as a service  auditing cloud consistencyConsistency as a service  auditing cloud consistency
Consistency as a service auditing cloud consistencyPapitha Velumani
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparisonKrishna-Kumar
 
MMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single clickMMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single clickMatias Cascallares
 
Managing MySQL Scale Through Consolidation
Managing MySQL Scale Through ConsolidationManaging MySQL Scale Through Consolidation
Managing MySQL Scale Through ConsolidationNetApp
 
Release 8.1 - Breakfast Paris
Release 8.1 - Breakfast ParisRelease 8.1 - Breakfast Paris
Release 8.1 - Breakfast ParisNuxeo
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computingBrian Bullard
 
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...Magalix Corporation
 
Cloudstack Open source and you
Cloudstack Open source and you Cloudstack Open source and you
Cloudstack Open source and you Brian Bullard
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupJohannes Moser
 
Study and implementation a cloud solution based on
Study and implementation a cloud solution based onStudy and implementation a cloud solution based on
Study and implementation a cloud solution based onDendani Bilal
 
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIn-Memory Computing Summit
 

La actualidad más candente (20)

Microsoft Azure Storage Basics
Microsoft Azure Storage BasicsMicrosoft Azure Storage Basics
Microsoft Azure Storage Basics
 
Docker y azure container service
Docker y azure container serviceDocker y azure container service
Docker y azure container service
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to Cassandra
 
Tech Days 2010
Tech  Days 2010Tech  Days 2010
Tech Days 2010
 
Consistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud ConsistencyConsistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud Consistency
 
Consistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud ConsistencyConsistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud Consistency
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
 
Consistency as a service auditing cloud consistency
Consistency as a service  auditing cloud consistencyConsistency as a service  auditing cloud consistency
Consistency as a service auditing cloud consistency
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparison
 
NoSQL benchmarking
NoSQL benchmarkingNoSQL benchmarking
NoSQL benchmarking
 
MMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single clickMMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single click
 
Managing MySQL Scale Through Consolidation
Managing MySQL Scale Through ConsolidationManaging MySQL Scale Through Consolidation
Managing MySQL Scale Through Consolidation
 
Release 8.1 - Breakfast Paris
Release 8.1 - Breakfast ParisRelease 8.1 - Breakfast Paris
Release 8.1 - Breakfast Paris
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
 
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
 
Cloudstack Open source and you
Cloudstack Open source and you Cloudstack Open source and you
Cloudstack Open source and you
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
 
Study and implementation a cloud solution based on
Study and implementation a cloud solution based onStudy and implementation a cloud solution based on
Study and implementation a cloud solution based on
 
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
 

Similar a Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 

Similar a Introduction to Apache Drill - Big Data Bellevue Meetup 20131023 (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
BDA ( haoop ).pptx
BDA ( haoop ).pptxBDA ( haoop ).pptx
BDA ( haoop ).pptx
 

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Último (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

  • 1. Introduction to Apache Drill Big Data Bellevue Meetup @tnachen Timothy Chen
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Use Case: Marketing Campaign Jane, a marketing analyst Determine target segments Data from different sources
  • 23. Use Case: Crime Detection • • • • Online purchases Fraud, billing, etc. Batch-generated overview Modes – Explorative – Alerts
  • 24. Requirements • • • • • Support for different data sources Support for different query interfaces Low-latency/real-time Ad-hoc queries Scalable, reliable
  • 25.
  • 26. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Google’s Dremel http://research.google.com/pubs/pub36632.html Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339
  • 27. Google’s Dremel multi-level execution trees columnar data layout
  • 28. Google’s Dremel nested data + schema column-striped representation map nested data to tables
  • 30. Apache Drill–key facts Inspired by Google’s Dremel Standard SQL 2003 support Plug-able data sources Nested data is a first-class citizen Schema is optional Community driven, open, 100’s involved
  • 31.
  • 32.
  • 34. Principled Query Execution Source query—what we want to do (analyst friendly) Logical Plan— what we want to do (language agnostic, computer friendly) Physical Plan—how we want to do it (the best way we can tell) Execution Plan—where we want to do it
  • 35. Principled Query Execution Source Query SQL 2003 DrQL MongoQL DSL Parser parser API Logical Plan query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, Optimizer Topology CF etc. Physical Plan Execution scanner API
  • 36. Wire-level Architecture Each node: Drillbit - maximize data locality Co-ordination, query planning, execution, etc, are distributed Any node can act as endpoint for a query—foreman Drillbit Drillbit Drillbit Drillbit Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
  • 37. Wire-level Architecture Curator/Zookeeper for ephemeral cluster membership info Distributed cache (Hazelcast) for metadata, locality information, etc. Curator/Zk Curator/Zk Drillbit Drillbit Drillbit Drillbit Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
  • 38. Wire-level Architecture Originating Drillbit acts as foreman: manages query execution, scheduling, locality information, etc. Streaming data communication avoiding SerDe Curator/Zk Curator/Zk Drillbit Drillbit Drillbit Drillbit Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
  • 39. Wire-level Architecture Foreman turns into root of the multi-level execution tree, leafs activate their storage engine interface. node Curator/Zk Curator/Zk node node
  • 40.
  • 41. On the shoulders of giants … Jackson for JSON SerDe for metadata Typesafe HOCON for configuration and module management Netty4 as core RPC engine, protobuf for communication Vanilla Java, LArray and Netty ByteBuf for off-heap large data structures Hazelcast for distributed cache Netflix Curator on top of Zookeeper for service registry Optiq for SQL parsing and cost optimization Parquet (http://parquet.io)/ ORC Janino for expression compilation ASM for ByteCode manipulation Yammer Metrics for metrics Guava extensively Carrot HPC for primitive collections
  • 42. Key features Full SQL – ANSI SQL 2003 Nested Data as first class citizen Optional Schema Extensibility Points …
  • 43. Extensibility Points Source query  parser API Custom operators, UDF  logical plan Serving tree, CF, topology  physical plan/optimizer Data sources &formats  scanner API Source Query Parser Logical Plan Optimizer Physical Plan Execution
  • 44. User Interfaces API—DrillClient Encapsulates endpoint discovery Supports logical and physical plan submission, query cancellation, query status Supports streaming return results JDBC driver, converting JDBC into DrillClient communication. REST proxy for DrillClient
  • 46. Let’s get our hands dirty…
  • 47. Demo Install Preparation Usage $ wget http://people.apache.org/~jacques/apache-drill-1.0.0m1.rc3/apache-drill-1.0.0-m1-binary-release.tar.gz $ tar -zxf apache-drill-1.0.0-m1-binary-release.tar.gz $ export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents /Home $ export DRILL_LOG_DIR=$PWD $ ./bin/drillbit.sh start $ ./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
  • 48. Useful Resources Getting Started guide https://github.com/vrtx/incubatordrill/blob/getting_started/docs/getting_started.rst Demo HowTo https://cwiki.apache.org/confluence/display/DRILL/ Demo+HowTo How to build/install Apache Drill on Ubuntu 13.04 http://www.confusedcoders.com/bigdata/apachedrill/how-to-build-apache-drill-on-ubuntu-13-04
  • 49. Be a part of it!
  • 50. Status Heavy development by multiple organizations (MapR, Pentaho, Microsoft, Thoughtworks, XingCloud, etc.) Currently more than 100k LOC Alpha available via http://people.apache.org/~jacques/apache-drill1.0.0-m1.rc3/
  • 51. Kudos to … Julian Hyde, Pentaho Lisen Mu, XingCloud Tim Chen, Microsoft Chris Merrick, RJMetrics David Alves, UT Austin Sree Vaadi, SSS Srihari Srinivasan, ThoughtWorks • • • • • • • • • Ben Becker, MapR Jacques Nadeau, MapR Ted Dunning, MapR Keys Botzum, MapR Jason Frantz Ellen Friedman Chris Wensel, Concurrent Gera Shegalov, Oracle Ryan Rawson, Ohm Data Alexandre Beche, CERN Jason Altekruse, MapR http://incubator.apache.org/drill/team.html
  • 52. Contributing Contributions appreciated—not only code drops … Test data & test queries Use case scenarios (textual/SQL queries) Documentation
  • 53. Engage! Follow @ApacheDrill on Twitter Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html Standing G+ hangouts every Tuesday at 18:00 CET http://j.mp/apache-drill-hangouts Keep an eye on http://drill-user.org/

Notas del editor

  1. Greet First time talking in front of meetup
  2. - Search Data pipeline, real-time ingesting social data (FB/Twitter)
  3. - Battle nations, Top grossing #3, Matchmaking, Online real-time PvP
  4. - Open Source PaaS, Metrics and Usage data
  5. - Halo new Big Data pipeline, working on data ingestion, with open source like Kafka
  6. Love open source, with enthusiastic people offering their time and energy Smart people and a great sense of community
  7. - Also some less well known projects such as MMORPG game engine, etc.
  8. And I found Drill!
  9. Interactive / real-time is HOT Batch processing doesn’t serve all our needs anymore
  10. - TC originally article that led me to start contributing
  11. - Drill’s Apache incubator proposal, outlining what Drill is trying to achieve
  12. Pro: Can handle big data, MapReduce abstracts all the distribution and management, flexible code to process Con: Slow Hive query startup requires a long processing time….
  13. Pro: Highly available solutions, handle large writes/reads Con: Not easy to do adhoc SQL like queries
  14. Stream processing is becoming very popular, and new projects are rising up such as Samza based on Kafka. Walmart labs has a project called Muppet that they called Fast data.. Con: Need to definite topology, and cannot do adhoc querying
  15. AWS CloudSearch, Lucene, all these technology performs searches with the basis of an index they maintain, therefore needs preprocessing of data.
  16. What most people do in querying multiple sources is to engineer an ETL pipeline to a common DW, and query a DW for any cross segments data. But obviously ETL implies a delay, we can do better.
  17. Drill is not to replace MapReduce, but to supplement it
  18. Two innovations: handle nested-data column style (column-striped representation) and multi-level execution trees
  19. repetition levels (r) — at what repeated field in the field’s path the value has repeated. definition levels (d) — how many fields in path that could be undefined (because they are optional or repeated) are actually present Only repeated fields increment the repetition level, only non-required fields increment the definition level. Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level. An optional field requires one extra bit to store zero if it is NULL and one if it is defined. NULL values do not need to be stored as the definition level captures this information.
  20. Source query - Human (eg DSL) or tool written(eg SQL/ANSI compliant) query Source query is parsed and transformed to produce the logical plan Logical plan: dataflow of what should logically be done Typically, the logical plan lives in memory in the form of Java objects, but also has a textual form The logical query is then transformed and optimized into the physical plan. Optimizer introduces of parallel computation, taking topology into account Optimizer handles columnar data to improve processing speed The physical plan represents the actual structure of computation as it is done by the system How physical and exchange operators should be applied Assignment to particular nodes and cores + actual query execution per node
  21. Drillbits per node, maximize data locality Co-ordination, query planning, optimization, scheduling, execution are distributed By default, Drillbits hold all roles, modules can optionally be disabled. Any node/Drillbit can act as endpoint for particular query.
  22. Zookeeper maintains ephemeral cluster membership information only Small distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc.
  23. Originating Drillbit acts as foreman, manages all execution for their particular query, scheduling based on priority, queue depth and locality information. Drillbit data communication is streaming and avoids any serialization/deserialization
  24. Red: originating drillbit, is the root of the multi-level execution tree, per query/job Leafs use their storage engine interface to scan respective data source (DB, file, etc.)
  25. Handing over to Ted
  26. Michael?