SlideShare una empresa de Scribd logo
1 de 21
Apache’s Answer to Low Latency
Interactive Query for Big Data
February 13, 2013
Agenda
• Apache Drill overview
• Key features
• Status and progress
• Discuss potential use cases and cooperation
Big Data Workloads
• ETL
• Data mining
• Blob store
• Lightweight OLTP on large datasets
• Index and model generation
• Web crawling
• Stream processing
• Clustering, anomaly detection and classification
• Interactive analysis
Interactive Queries and Hadoop
Compile SQL to
MapReduce
SQL based
analytics
Impala
Real-time
interactive queries
Real-time
interactive queries
Emerging Technologies
Common Solutions
Export MapReduce
results to RDBMS and
query the RDBMS
External tables in an
MPP database
Example Problem
• Jane works as an
analyst at an e-
commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas
and lots of data
User
profiles
Transaction
information
Access
logs
Solving the Problem with Traditional Systems
• Use an RDBMS
– ETL the data from MongoDB and Hadoop into the RDBMS
• MongoDB data must be flattened, schematized, filtered and aggregated
• Hadoop data must be filtered and aggregated
– Query the data using any SQL-based tool
• Use MapReduce
– ETL the data from Oracle and MongoDB into Hadoop
– Work with the MapReduce team to generate the desired analyses
• Use Hive
– ETL the data from Oracle and MongoDB into Hadoop
• MongoDB data must be flattened and schematized
– But HiveQL is limited, queries take too long and BI tool support is
limited
WWGD
Distributed
File System
NoSQL
Interactive
analysis
Batch
processing
GFS BigTable Dremel MapReduce
HDFS HBase ???
Hadoop
MapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
Apache Drill Overview
• Interactive analysis of Big Data using standard SQL
• Fast
– Low latency queries
– Columnar execution
• Inspired by Google Dremel/BigQuery
– Complement native interfaces and
MapReduce/Hive/Pig
• Open
– Community driven open source project
– Under Apache Software Foundation
• Modern
– Standard ANSI SQL:2003 (select/into)
– Nested/hierarchical data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
Hive
Pig
Apache Drill
How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
Key Features
• Full SQL (ANSI SQL:2003)
• Nested data
• Schema is optional
• Flexible and extensible architecture
Full SQL (ANSI SQL:2003)
• Drill supports standard ANSI SQL:2003
– Correlated subqueries, analytic functions, …
– SQL-like is not enough
• Use any SQL-based tool with Apache Drill
– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
– Standard ODBC and JDBC drivers
Drill%Worker
Drill%Worker
Driver
Client
Drillbit
SQL%Query%
Parser
Query%
Planner
Drillbits
Drill%ODBC%
Driver
Tableau
MicroStrategy
Excel
SAP%Crystal%
Reports
Nested Data
• Nested data is becoming prevalent
– JSON, BSON, XML, Protocol Buffers, Avro, etc.
– The data source may or may not be aware
• MongoDB supports nested data natively
• A single HBase value could be a JSON document
(compound nested type)
– Google Dremel’s innovation was efficient columnar
storage and querying of nested data
• Flattening nested data is error-prone and often
impossible
– Think about repeated and optional fields at every
level…
• Apache Drill supports nested data
– Extensions to ANSI SQL:2003
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
{
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa”}
]
}
JSON
Avro
Schema is Optional
• Many data sources do not have rigid schemas
– Schemas change rapidly
– Each record may have a different schema
• Sparse and wide rows in HBase and Cassandra, MongoDB
• Apache Drill supports querying against unknown schemas
– Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it automatically
– System of record may already have schema information
• Why manage it in a separate system?
– No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com"
anchor:cnnsi.com = "CNN"
"com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News"
… … …
Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
– Implement a custom scanner to support a new data source or file format
• Query languages
– SQL:2003 is the primary language
– Implement a custom Parser to support a Domain Specific Language
– UDFs and UDTFs
• Optimizers
– Drill will have a cost-based optimizer
– Clear surrounding APIs support easy optimizer exploration
• Operators
– Custom operators can be implemented
• Special operators for Mahout (k-means) being designed
– Operator push-down to data source (RDBMS)
How Does Impala Fit In?
Impala Strengths
• Beta currently available
• Easy install and setup on top of
Cloudera
• Faster than Hive on some queries
• SQL-like query language
Questions
• Open Source ‘Lite’
• Doesn’t support RDBMS or other
NoSQLs (beyond Hadoop/HBase)
• Early row materialization increases
footprint and reduces performance
• Limited file format support
• Query results must fit in memory!
• Rigid schema is required
• No support for nested data
• Compound APIs restrict optimizer
progression
• SQL-like (not SQL)
Many important features are “coming soon”. Architectural foundation is constrained. No
community development.
Status: In Progress
• Heavy active development by multiple organizations
• Available
– Logical plan syntax and interpreter
– Reference interpreter
• In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
• Significant community momentum
– Over 200 people on the Drill mailing list
– Over 200 members of the Bay Area Drill User Group
– Drill meetups across the US and Europe
– OpenDremel team joined Apache Drill
• Anticipated schedule:
– Prototype: Q1
– Alpha: Q2
– Beta: Q3
Why Apache Drill Will Be Successful
Resources
• Contributors have strong
backgrounds from
companies like Oracle,
IBM Netezza, Informatica,
Clustrix and Pentaho
Community
• Development done in the
open
• Active contributors from
multiple companies
• Rapidly growing
Architecture
• Full SQL
• New data support
• Extensible APIs
• Full Columnar Execution
• Beyond Hadoop
Questions?
• What problems can Drill solve for you?
• Where does it fit in the organization?
• Which data sources and BI tools are important
to you?
Let’s Talk!
• tdunning@maprtech.com
• tdunning@apache.org
• @ted_dunning @ApacheDrill
• Slides at http://bit.ly/YxZq8X
• See also
http://www.mapr.com/support/community-
resources/drill
APPENDIX
Why Not Leverage MapReduce?
• Scheduling Model
– Coarse resource model reduces hardware utilization
– Acquisition of resources typically takes 100’s of millis to seconds
• Barriers
– Map completion required before shuffle/reduce
commencement
– All maps must complete before reduce can start
– In chained jobs, one job must finish entirely before the next one
can start
• Persistence and Recoverability
– Data is persisted to disk between each barrier
– Serialization and deserialization are required between execution
phase

Más contenido relacionado

La actualidad más candente

Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorImply
 
Embracing DevOps through database migrations with Flyway
Embracing DevOps through database migrations with FlywayEmbracing DevOps through database migrations with Flyway
Embracing DevOps through database migrations with FlywayRed Gate Software
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)
TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)
TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)Drupal Portugal
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafkaconfluent
 
Apache Kafka® and API Management
Apache Kafka® and API ManagementApache Kafka® and API Management
Apache Kafka® and API Managementconfluent
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data LakeCalum Miller
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Microservice Design Patterns.pdf
Microservice Design Patterns.pdfMicroservice Design Patterns.pdf
Microservice Design Patterns.pdfSimform
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
 
Introduction to azure cosmos db
Introduction to azure cosmos dbIntroduction to azure cosmos db
Introduction to azure cosmos dbRatan Parai
 
Domain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesDomain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesRadosław Maziarka
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDatabricks
 
Migrating biz talk solutions to azure
Migrating biz talk solutions to azureMigrating biz talk solutions to azure
Migrating biz talk solutions to azureBizTalk360
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisAmazon Web Services
 

La actualidad más candente (20)

Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operator
 
Embracing DevOps through database migrations with Flyway
Embracing DevOps through database migrations with FlywayEmbracing DevOps through database migrations with Flyway
Embracing DevOps through database migrations with Flyway
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)
TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)
TURN YOUR DRUPAL INTO A DIGITAL EXPERIENCE PLATFORM (DXP)
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafka
 
Apache Kafka® and API Management
Apache Kafka® and API ManagementApache Kafka® and API Management
Apache Kafka® and API Management
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Microservice Design Patterns.pdf
Microservice Design Patterns.pdfMicroservice Design Patterns.pdf
Microservice Design Patterns.pdf
 
Mysql security 5.7
Mysql security 5.7 Mysql security 5.7
Mysql security 5.7
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Introduction to azure cosmos db
Introduction to azure cosmos dbIntroduction to azure cosmos db
Introduction to azure cosmos db
 
Domain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesDomain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and Microservices
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
 
Migrating biz talk solutions to azure
Migrating biz talk solutions to azureMigrating biz talk solutions to azure
Migrating biz talk solutions to azure
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon Kinesis
 

Destacado

Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?inovex GmbH
 
Oracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL ClusterOracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL ClusterBenedita Paúl Vasconcelos
 
Big Data mit Apache Hadoop
Big Data mit Apache HadoopBig Data mit Apache Hadoop
Big Data mit Apache HadoopAlexander Alten
 
MapReduce & Apache Hadoop
MapReduce & Apache HadoopMapReduce & Apache Hadoop
MapReduce & Apache HadoopOliver Fischer
 
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula HannemannSocial Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula HannemannPaula Peters
 
Eval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - GermanEval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - GermanErik Schmiegelow
 

Destacado (7)

Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
 
Oracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL ClusterOracle OpenWorld - Getting started with MySQL Cluster
Oracle OpenWorld - Getting started with MySQL Cluster
 
Big Data mit Apache Hadoop
Big Data mit Apache HadoopBig Data mit Apache Hadoop
Big Data mit Apache Hadoop
 
Webinar: Kennzahlen in der Produktion - gewusst wie!
Webinar: Kennzahlen in der Produktion - gewusst wie!Webinar: Kennzahlen in der Produktion - gewusst wie!
Webinar: Kennzahlen in der Produktion - gewusst wie!
 
MapReduce & Apache Hadoop
MapReduce & Apache HadoopMapReduce & Apache Hadoop
MapReduce & Apache Hadoop
 
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula HannemannSocial Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
Social Media: 10 Shit Storm Tips - How to survive a shit storm - Paula Hannemann
 
Eval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - GermanEval Apache Storm vs. Spark Streaming - German
Eval Apache Storm vs. Spark Streaming - German
 

Similar a Apache drill

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 

Similar a Apache drill (20)

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 

Más de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Más de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Último (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Apache drill

  • 1. Apache’s Answer to Low Latency Interactive Query for Big Data February 13, 2013
  • 2. Agenda • Apache Drill overview • Key features • Status and progress • Discuss potential use cases and cooperation
  • 3. Big Data Workloads • ETL • Data mining • Blob store • Lightweight OLTP on large datasets • Index and model generation • Web crawling • Stream processing • Clustering, anomaly detection and classification • Interactive analysis
  • 4. Interactive Queries and Hadoop Compile SQL to MapReduce SQL based analytics Impala Real-time interactive queries Real-time interactive queries Emerging Technologies Common Solutions Export MapReduce results to RDBMS and query the RDBMS External tables in an MPP database
  • 5. Example Problem • Jane works as an analyst at an e- commerce company • How does she figure out good targeting segments for the next marketing campaign? • She has some ideas and lots of data User profiles Transaction information Access logs
  • 6. Solving the Problem with Traditional Systems • Use an RDBMS – ETL the data from MongoDB and Hadoop into the RDBMS • MongoDB data must be flattened, schematized, filtered and aggregated • Hadoop data must be filtered and aggregated – Query the data using any SQL-based tool • Use MapReduce – ETL the data from Oracle and MongoDB into Hadoop – Work with the MapReduce team to generate the desired analyses • Use Hive – ETL the data from Oracle and MongoDB into Hadoop • MongoDB data must be flattened and schematized – But HiveQL is limited, queries take too long and BI tool support is limited
  • 7. WWGD Distributed File System NoSQL Interactive analysis Batch processing GFS BigTable Dremel MapReduce HDFS HBase ??? Hadoop MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 8. Apache Drill Overview • Interactive analysis of Big Data using standard SQL • Fast – Low latency queries – Columnar execution • Inspired by Google Dremel/BigQuery – Complement native interfaces and MapReduce/Hive/Pig • Open – Community driven open source project – Under Apache Software Foundation • Modern – Standard ANSI SQL:2003 (select/into) – Nested/hierarchical data support – Schema is optional – Supports RDBMS, Hadoop and NoSQL Interactive queries Data analyst Reporting 100 ms-20 min Data mining Modeling Large ETL 20 min-20 hr MapReduce Hive Pig Apache Drill
  • 9. How Does It Work? • Drillbits run on each node, designed to maximize data locality • Processing is done outside MapReduce paradigm (but possibly within YARN) • Queries can be fed to any Drillbit • Coordination, query planning, optimization, scheduling, and execution are distributed SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
  • 10. Key Features • Full SQL (ANSI SQL:2003) • Nested data • Schema is optional • Flexible and extensible architecture
  • 11. Full SQL (ANSI SQL:2003) • Drill supports standard ANSI SQL:2003 – Correlated subqueries, analytic functions, … – SQL-like is not enough • Use any SQL-based tool with Apache Drill – Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, … – Standard ODBC and JDBC drivers Drill%Worker Drill%Worker Driver Client Drillbit SQL%Query% Parser Query% Planner Drillbits Drill%ODBC% Driver Tableau MicroStrategy Excel SAP%Crystal% Reports
  • 12. Nested Data • Nested data is becoming prevalent – JSON, BSON, XML, Protocol Buffers, Avro, etc. – The data source may or may not be aware • MongoDB supports nested data natively • A single HBase value could be a JSON document (compound nested type) – Google Dremel’s innovation was efficient columnar storage and querying of nested data • Flattening nested data is error-prone and often impossible – Think about repeated and optional fields at every level… • Apache Drill supports nested data – Extensions to ANSI SQL:2003 enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } { "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ] } JSON Avro
  • 13. Schema is Optional • Many data sources do not have rigid schemas – Schemas change rapidly – Each record may have a different schema • Sparse and wide rows in HBase and Cassandra, MongoDB • Apache Drill supports querying against unknown schemas – Query any HBase, Cassandra or MongoDB table • User can define the schema or let the system discover it automatically – System of record may already have schema information • Why manage it in a separate system? – No need to manage schema evolution Row Key CF contents CF anchor "com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN" "com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News" … … …
  • 14. Flexible and Extensible Architecture • Apache Drill is designed for extensibility • Well-documented APIs and interfaces • Data sources and file formats – Implement a custom scanner to support a new data source or file format • Query languages – SQL:2003 is the primary language – Implement a custom Parser to support a Domain Specific Language – UDFs and UDTFs • Optimizers – Drill will have a cost-based optimizer – Clear surrounding APIs support easy optimizer exploration • Operators – Custom operators can be implemented • Special operators for Mahout (k-means) being designed – Operator push-down to data source (RDBMS)
  • 15. How Does Impala Fit In? Impala Strengths • Beta currently available • Easy install and setup on top of Cloudera • Faster than Hive on some queries • SQL-like query language Questions • Open Source ‘Lite’ • Doesn’t support RDBMS or other NoSQLs (beyond Hadoop/HBase) • Early row materialization increases footprint and reduces performance • Limited file format support • Query results must fit in memory! • Rigid schema is required • No support for nested data • Compound APIs restrict optimizer progression • SQL-like (not SQL) Many important features are “coming soon”. Architectural foundation is constrained. No community development.
  • 16. Status: In Progress • Heavy active development by multiple organizations • Available – Logical plan syntax and interpreter – Reference interpreter • In progress – SQL interpreter – Storage engine implementations for Accumulo, Cassandra, HBase and various file formats • Significant community momentum – Over 200 people on the Drill mailing list – Over 200 members of the Bay Area Drill User Group – Drill meetups across the US and Europe – OpenDremel team joined Apache Drill • Anticipated schedule: – Prototype: Q1 – Alpha: Q2 – Beta: Q3
  • 17. Why Apache Drill Will Be Successful Resources • Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community • Development done in the open • Active contributors from multiple companies • Rapidly growing Architecture • Full SQL • New data support • Extensible APIs • Full Columnar Execution • Beyond Hadoop
  • 18. Questions? • What problems can Drill solve for you? • Where does it fit in the organization? • Which data sources and BI tools are important to you?
  • 19. Let’s Talk! • tdunning@maprtech.com • tdunning@apache.org • @ted_dunning @ApacheDrill • Slides at http://bit.ly/YxZq8X • See also http://www.mapr.com/support/community- resources/drill
  • 21. Why Not Leverage MapReduce? • Scheduling Model – Coarse resource model reduces hardware utilization – Acquisition of resources typically takes 100’s of millis to seconds • Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start • Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase

Notas del editor

  1. With the recent explosion of everything related to Hadoop, it is no surprise that new projects/implementations related to the Hadoop ecosystem keep appearing. There have been quite a few initiatives that provide SQL interfaces into Hadoop. The Apache Drill project is a distributed system for interactive analysis of large-scale datasets, inspired by Google&apos;s Dremel. Drill is not trying to replace existing Big Data batch processing frameworks, such as Hadoop MapReduce or stream processing frameworks, such as S4 or Storm. It rather fills the existing void – real-time interactive processing of large data sets.------------------------------Technical DetailSimilar to Dremel, the Drill implementation is based on the processing of nested, tree-like data. In Dremel this data is based on protocol buffers – nested schema-based data model. Drill is planning to extend this data model by adding additional schema-based implementations, for example, Apache Avro and schema-less data models such asJSON and BSON. In addition to a single data structure, Drill is also planning to support “baby joins” – joins to the small, loadable in memory, data structures.