SlideShare una empresa de Scribd logo
1 de 17
Why is data independence
        (still) so important?
Julian Hyde @julianhyde

http://github.com/julianhyde/optiq
http://github.com/julianhyde/optiq-splunk

Apache Drill Meeting
2012/9/13
Data independence
This is my opinion about data management systems in general. I don't
   claim that it is the right answer for Apache Drill.
I claim that a logical/physical separation can make a data management
    system more widely applicable, therefore more widely adopted,
    therefore better.
What “data independence” means in today's “big data” world.
About me
Julian Hyde


Database hacker (Oracle, Broadbase, SQLstream, LucidDB)
Open source hacker (Mondrian, olap4j, LucidDB, Optiq)


@julianhyde
http://github.com/julianhyde
http://www.flickr.com/photos/torkildr/3462606643
http://www.flickr.com/photos/sylvar/31436961/
“Big Data”
Right data, right time
Diverse data sources / Performance / Suitable format
Volume / Velocity / Variety


Volume – solved :)
Velocity – not one of Drill's goals (?)
Variety – ?
Variety
Variety of source formats (csv, avro, json, weblogs)
Variety of storage structures (indexes, projections, sort
  order, materialized views) now or in future
Variety of query languages (DrQL, SQL)
Combine with other data (join, union)
Embed within other systems, e.g. Hive
Source for other systems, e.g. Drill | Cascading > Teradata
Tools generate SQL
Use case: Optiq* at Splunk
    SQL interface on NoSQL system
    “Smart” JDBC driver – pushes processing down to Splunk




    * Truth in advertising: I am the author of Optiq.
Expression tree                                 SELECT p.“product_name”, COUNT(*) AS c
                                                FROM “splunk”.”splunk” AS s
                                                  JOIN “mysql”.”products” AS p
                                                  ON s.”product_id” = p.”product_id”
                                                WHERE s.“action” = 'purchase'
                                                GROUP BY p.”product_name”
  Splunk                                        ORDER BY c DESC

 Table: splunk
                                                      Key: product_name
                     Key: product_id                  Agg: count
                                       Condition:                         Key: c DESC
                                         action =
                                       'purchase'
  scan
                          join
  MySQL                                filter             group           sort
     scan
                 Table: products
Expression tree                               SELECT p.“product_name”, COUNT(*) AS c
                                              FROM “splunk”.”splunk” AS s
(optimized)                                     JOIN “mysql”.”products” AS p
                                                ON s.”product_id” = p.”product_id”
                                              WHERE s.“action” = 'purchase'
                                              GROUP BY p.”product_name”
                 Splunk                       ORDER BY c DESC
                          Condition:
 Table: splunk              action =
                          'purchase'                     Key: product_name
                                                         Agg: count
                                                                             Key: c DESC
                                       Key: product_id
  scan                     filter

  MySQL
                                       join                  group           sort
     scan
                   Table: products
Conventional DBMS architecture
                JDBC client


                JDBC server
                SQL parser /
                  validator           Metadata
                   Query
                 optimizer
                 Data-flow
                 operators

         Data                  Data
Drill architecture
                      DrQL client


                     DrQL parser /
                       validator


                         ?
                                            Metadata


                      Data-flow
                      operators

          Data                       Data
Optiq architecture
                         JDBC client


                          JDBC server
                 Optional SQL parser /          Metadata
                            validator             SPI
                   Core       Query             Pluggable
                            optimizer             rules
                           3rd     3rd
                Pluggable party party
                           ops     ops
         3rd party                       3rd party
           data                            data
Analogy: Compiler architecture

  front end    C++        C          Fortran



  middle end         Optimizations



  back end     x86       ARM         Fortran
Conclusions
     Clear logical / physical separation allows a data
        management system to handle a wider variety of data,
        query languages, and packaging.
     Also provides a clear interface between the sub-teams
        working on query language and operators.
     A query optimizer allows new operators, and alternative
        algorithms and data structures, to be easily added to
        the system.
Extra material follows...
Writing an adapter
Driver – if you want a vanity URL like “jdbc:drill:”
Schema – describes what tables exist
Table – what are the columns, and how to get the data.
Operators (optional) – non-relational operators, if any
Rules (optional, but recommended) – improve efficiency by changing the
   question
Parser (optional) – additional source languages

Más contenido relacionado

La actualidad más candente

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 

La actualidad más candente (20)

SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Optiq: a SQL front-end for everything
Optiq: a SQL front-end for everythingOptiq: a SQL front-end for everything
Optiq: a SQL front-end for everything
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 

Destacado (10)

Data independence
Data independenceData independence
Data independence
 
physical and logical data independence
physical and logical data independencephysical and logical data independence
physical and logical data independence
 
Database management systems
Database management systemsDatabase management systems
Database management systems
 
Dbms
DbmsDbms
Dbms
 
A N S I S P A R C Architecture
A N S I  S P A R C  ArchitectureA N S I  S P A R C  Architecture
A N S I S P A R C Architecture
 
DBMS an Example
DBMS an ExampleDBMS an Example
DBMS an Example
 
Data Base Management System
Data Base Management SystemData Base Management System
Data Base Management System
 
Basic DBMS ppt
Basic DBMS pptBasic DBMS ppt
Basic DBMS ppt
 
Dbms slides
Dbms slidesDbms slides
Dbms slides
 
Database management system presentation
Database management system presentationDatabase management system presentation
Database management system presentation
 

Similar a Why is data independence (still) so important? Optiq and Apache Drill.

Projeto-web-services-Spring-Boot-JPA.pdf
Projeto-web-services-Spring-Boot-JPA.pdfProjeto-web-services-Spring-Boot-JPA.pdf
Projeto-web-services-Spring-Boot-JPA.pdf
AdrianoSantos888423
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
In-Memory Computing Summit
 

Similar a Why is data independence (still) so important? Optiq and Apache Drill. (20)

Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Scaling the Content Repository with Elasticsearch
Scaling the Content Repository with ElasticsearchScaling the Content Repository with Elasticsearch
Scaling the Content Repository with Elasticsearch
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
The Django Book - Chapter 5: Models
The Django Book - Chapter 5: ModelsThe Django Book - Chapter 5: Models
The Django Book - Chapter 5: Models
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Projeto-web-services-Spring-Boot-JPA.pdf
Projeto-web-services-Spring-Boot-JPA.pdfProjeto-web-services-Spring-Boot-JPA.pdf
Projeto-web-services-Spring-Boot-JPA.pdf
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Scale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | GimelScale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | Gimel
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
 
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAutomatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010
 

Más de Julian Hyde

Más de Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Why is data independence (still) so important? Optiq and Apache Drill.

  • 1. Why is data independence (still) so important? Julian Hyde @julianhyde http://github.com/julianhyde/optiq http://github.com/julianhyde/optiq-splunk Apache Drill Meeting 2012/9/13
  • 2. Data independence This is my opinion about data management systems in general. I don't claim that it is the right answer for Apache Drill. I claim that a logical/physical separation can make a data management system more widely applicable, therefore more widely adopted, therefore better. What “data independence” means in today's “big data” world.
  • 3. About me Julian Hyde Database hacker (Oracle, Broadbase, SQLstream, LucidDB) Open source hacker (Mondrian, olap4j, LucidDB, Optiq) @julianhyde http://github.com/julianhyde
  • 6. “Big Data” Right data, right time Diverse data sources / Performance / Suitable format Volume / Velocity / Variety Volume – solved :) Velocity – not one of Drill's goals (?) Variety – ?
  • 7. Variety Variety of source formats (csv, avro, json, weblogs) Variety of storage structures (indexes, projections, sort order, materialized views) now or in future Variety of query languages (DrQL, SQL) Combine with other data (join, union) Embed within other systems, e.g. Hive Source for other systems, e.g. Drill | Cascading > Teradata Tools generate SQL
  • 8. Use case: Optiq* at Splunk SQL interface on NoSQL system “Smart” JDBC driver – pushes processing down to Splunk * Truth in advertising: I am the author of Optiq.
  • 9. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” Splunk ORDER BY c DESC Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products
  • 10. Expression tree SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s (optimized) JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” Splunk ORDER BY c DESC Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products
  • 11. Conventional DBMS architecture JDBC client JDBC server SQL parser / validator Metadata Query optimizer Data-flow operators Data Data
  • 12. Drill architecture DrQL client DrQL parser / validator ? Metadata Data-flow operators Data Data
  • 13. Optiq architecture JDBC client JDBC server Optional SQL parser / Metadata validator SPI Core Query Pluggable optimizer rules 3rd 3rd Pluggable party party ops ops 3rd party 3rd party data data
  • 14. Analogy: Compiler architecture front end C++ C Fortran middle end Optimizations back end x86 ARM Fortran
  • 15. Conclusions Clear logical / physical separation allows a data management system to handle a wider variety of data, query languages, and packaging. Also provides a clear interface between the sub-teams working on query language and operators. A query optimizer allows new operators, and alternative algorithms and data structures, to be easily added to the system.
  • 17. Writing an adapter Driver – if you want a vanity URL like “jdbc:drill:” Schema – describes what tables exist Table – what are the columns, and how to get the data. Operators (optional) – non-relational operators, if any Rules (optional, but recommended) – improve efficiency by changing the question Parser (optional) – additional source languages

Notas del editor

  1. The obligatory “big data” definition slide. What is “big data”? It's not really about “big”. We need to access data from different parts of the organization, when we need it (which often means we don't have time to copy it), and the performance needs to be reasonable. If the data is large, it is often larger than the disks one can fit on one machine. It helps if we can process the data in place, leveraging the CPU and memory of the machines where the data is stored. We'd rather not copy it from one system to another. It needs to be flexible, to deal with diverse systems and formats. That often means that open source is involved. Some systems (e.g. reporting tools) can't easily be changed to accommodate new formats. So it helps if the data can be presented in standard formats, e.g. SQL.
  2. It's much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn't have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
  3. It's much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn't have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
  4. Conventional database has ODBC/JDBC driver, SQL parser, . Data sources. Expression tree. Expression transformation rules. Optimizer. For NoSQL databases, the language may not be SQL, and the optimizer may be less sophisticated, but the picture is basically the same. For frameworks, such as Hadoop, there is no planner. You end up writing code (e.g MapReduce jobs).
  5. Conventional database has ODBC/JDBC driver, SQL parser, . Data sources. Expression tree. Expression transformation rules. Optimizer. For NoSQL databases, the language may not be SQL, and the optimizer may be less sophisticated, but the picture is basically the same. For frameworks, such as Hadoop, there is no planner. You end up writing code (e.g MapReduce jobs).
  6. In Optiq, the query optimizer (we modestly call it the planner) is central. The JDBC driver/server and SQL parser are optional; skip them if you have another language. Plug-ins provide metadata (the schema), planner rules, and runtime operators. There are built-in relational operators and rules, and there are built-in operators implemented in Java. But to access data, you need to provide at least one operator.