Apache Drill in the toolbox

•

2 recomendaciones•4,103 vistas

takezoe

at Tokyo Apache Drill Meetup Vol.3 Mar 22, 2016 http://drill.connpass.com/event/27414/

Software

in the toolbox
Naoki Takezoe
@takezoen
BizReach, Inc

A lot of JSON in the world
● Configuration
● Data
● Log

What is Apache Drill?
● Storage
○ Classpath, Local file system / HDFS / S3, HBase,
Hive, MongoDB, JDBC
● File format
○ JSON, Parquet, CSV / TSV / PSV
Schema-free SQL Query Engine for
Hadoop, NoSQL and Cloud Storage

Installation
1. Download and expand Drill distribution
2. cd apache-drill-1.6.0/bin
3. ./drill-embedded http://localhost:8047/

Query local JSON files
{"name": "suzuki", "dept": "sales"}
{"name": "yamada", "dept": "development"}
{"name": "sato", "dept": "development"}
...
SELECT * FROM dfs.`/tmp/users.json` T1
WHERE T1.name = 'takezoe'

Access to RDB tables
Configure jdbc storage plugin at the web console:
{
"type": "jdbc",
"driver": "org.h2.Driver",
"url": "jdbc:h2:~/.gitbucket/data",
"username": "sa",
"password": "sa",
"enabled": true
}

Join JSON and RDB
SELECT
T1.`user`.name AS name,
T2.MAIL_ADDRESS AS mail
FROM dfs.`/tmp/users.json` T1
INNER JOIN h2.DATA.PUBLIC.ACCOUNT T2
ON T1.`user`.name = T2.USER_NAME

Connect to Drill via JDBC
We can use any JDBC frontend or BI tool with Drill
JDBC
Requires ZooKeeper

Connect to Drill via JDBC
Setup ZooKeeper
$ tar xvzf zookeeper-3.4.8.tar.gz
$ cd zookeeper-3.4.8
$ mv conf/zoo_sample.cfg conf/zoo.cfg
$ cd bin
$ ./zkServer.sh start
Run drillbit
$ cd apache-drill-1.6.0/bin
$ ./drillbit.sh start

Connect to Drill via JDBC
● JDBC Driver
○ DRILL_HOME/jars/jdbc-driver/drill-jdbc-all-1.6.0.jar
● Class
○ org.apache.drill.jdbc.Driver
● URL
○ jdbc:drill:drillbit=localhost

Query nested JSON
{"user": {"name": "suzuki", "dept": "sales"}}
{"user": {"name": "yamada", "dept": "development"}}
{"user": {"name": "sato", "dept": "development"}}
...
SELECT
T.`user`.name AS name,
T.`user`.dept AS dept
FROM dfs.`/tmp/users.json` T
WHERE T.`user`.name = 'yamada';
Extract JSON
property as column

Expand nested JSON property to records
{"user": {
"name": "yamada",
"experience": [ {"lang": "Java"}, {"lang": "Scala"} ]
}}
SELECT
T2.name AS name,
T2.experience.lang AS lang,
FROM (
SELECT
T1.`user`.name AS name,
FLATTEN(T1.`user`.experience) AS experience
FROM dfs.`/tmp/users.json` T1
) T2
Expand nested array
as individual table

In the case of jq
$ cat users.json | jq '.user | select(.name == "yamada")'
Nested JSON in Drill brings complexy.
Maybe jq is better for simple query?

Action log
● Store action log into the local file as JSON
● We can query them using Drill if necessary

Data warehouse
● Aggregate various datasources to Drill
● Data synchronization is no need

e.g. Access Elasticsearch through Hive
● elasticsearch-hadoop supports Hive
● Drill supports Hive
http://takezoe.hatenablog.com/entry/20150524/p1
Can we access Elasticsearch from Drill?

Conclusion
Apache Drill is
● good tool for querying various datasets
● easy setup and user friendly
● pre-investment is not required
● useful for small data, not only big data
Put Apache Drill into your toolbox!

Más contenido relacionado

Destacado

Scala Frustrationstakezoe

An introduction to apache drill presentationMapR Technologies

JavaからScalaへtakezoe

ネタじゃないScala.jstakezoe

Play2実践tips集takezoe

Drilling into Data with Apache Drill - Tokyo Apache Drill Meetup 2015/11/12MapR Technologies Japan

Scala界隈の近況takezoe

Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive

GitBucket: The perfect Github clone by Scalatakezoe

Reactive database access with Slick3takezoe

Lightbend Lagom: Microservices Just Rightmircodotta

そんなトランザクションマネージャで大丈夫か？takezoe

Apache Drill で日本語を扱ってみよう + オープンデータ解析MapR Technologies Japan

Scala が支える医療系ウェブサービス #jissenscalaKazuhiro Sera

Java9 and Project Jigsawtakezoe

SIerでScalaを使うために私がしたことtakezoe

イマドキの現場で使えるJavaライブラリ事情takezoe

Slick eventsourcingAdam Warski

ビズリーチの新サービスをScalaで作ってみた〜マイクロサービスの裏側 #jissenscalatakezoe

Killing ETL with Apache DrillCharles Givre

Destacado (20)

Scala Frustrations

An introduction to apache drill presentation

JavaからScalaへ

ネタじゃないScala.js

Play2実践tips集

Drilling into Data with Apache Drill - Tokyo Apache Drill Meetup 2015/11/12

Scala界隈の近況

Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...

GitBucket: The perfect Github clone by Scala

Reactive database access with Slick3

Lightbend Lagom: Microservices Just Right

そんなトランザクションマネージャで大丈夫か？

Apache Drill で日本語を扱ってみよう + オープンデータ解析

Scala が支える医療系ウェブサービス #jissenscala

Java9 and Project Jigsaw

SIerでScalaを使うために私がしたこと

イマドキの現場で使えるJavaライブラリ事情

Slick eventsourcing

ビズリーチの新サービスをScalaで作ってみた〜マイクロサービスの裏側 #jissenscala

Killing ETL with Apache Drill

Más de takezoe

Journey of Migrating Millions of Queries on The Cloudtakezoe

GitBucket: Open source self-hosting Git server built by Scalatakezoe

Testing Distributed Query Engine as a Servicetakezoe

Revisit Dependency Injection in scalatakezoe

How to keep maintainability of long life Scala applicationstakezoe

頑張りすぎないScalatakezoe

GitBucket: Git Centric Software Development Platform by Scalatakezoe

Non-Functional Programming in Scalatakezoe

Scala警察のすすめtakezoe

Scala製機械学習サーバ「Apache PredictionIO」takezoe

The best of AltJava is Xtendtakezoe

Scala Warrior and type-safe front-end development with Scala.jstakezoe

Excel方眼紙を支えるJava技術 2015takezoe

Más de takezoe (13)

Journey of Migrating Millions of Queries on The Cloud

GitBucket: Open source self-hosting Git server built by Scala

Testing Distributed Query Engine as a Service

Revisit Dependency Injection in scala

How to keep maintainability of long life Scala applications

頑張りすぎないScala

GitBucket: Git Centric Software Development Platform by Scala

Non-Functional Programming in Scala

Scala警察のすすめ

Scala製機械学習サーバ「Apache PredictionIO」

The best of AltJava is Xtend

Scala Warrior and type-safe front-end development with Scala.js

Excel方眼紙を支えるJava技術 2015

Último

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Right Money Management App For Your Financial GoalsJhone kinadey

TECUNIQUE: Success Stories: IT Service providermohitmore19

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Test Automation Strategy for Frontend and BackendArshad QA

Professional Resume Template for Software DevelopersVinodh Ram

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

5 Signs You Need a Fashion PLM Software.pdfWave PLM

What is Binary Language? Computer Number SystemsJheuzeDellosa

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Apache Drill in the toolbox

1. in the toolbox Naoki Takezoe @takezoen BizReach, Inc

2. A lot of JSON in the world ● Configuration ● Data ● Log

3. We want to query or analyze them. How?

4. Solutions for searching JSON

5. We♥SQL

6. What is Apache Drill? ● Storage ○ Classpath, Local file system / HDFS / S3, HBase, Hive, MongoDB, JDBC ● File format ○ JSON, Parquet, CSV / TSV / PSV Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

7. Let's begin!!

8. Installation 1. Download and expand Drill distribution 2. cd apache-drill-1.6.0/bin 3. ./drill-embedded http://localhost:8047/

9. Query local JSON files {"name": "suzuki", "dept": "sales"} {"name": "yamada", "dept": "development"} {"name": "sato", "dept": "development"} ... SELECT * FROM dfs.`/tmp/users.json` T1 WHERE T1.name = 'takezoe'

10. Access to RDB tables Configure jdbc storage plugin at the web console: { "type": "jdbc", "driver": "org.h2.Driver", "url": "jdbc:h2:~/.gitbucket/data", "username": "sa", "password": "sa", "enabled": true }

11. Join JSON and RDB SELECT T1.`user`.name AS name, T2.MAIL_ADDRESS AS mail FROM dfs.`/tmp/users.json` T1 INNER JOIN h2.DATA.PUBLIC.ACCOUNT T2 ON T1.`user`.name = T2.USER_NAME

12. Connect to Drill via JDBC We can use any JDBC frontend or BI tool with Drill JDBC Requires ZooKeeper

13. Connect to Drill via JDBC Setup ZooKeeper $ tar xvzf zookeeper-3.4.8.tar.gz $ cd zookeeper-3.4.8 $ mv conf/zoo_sample.cfg conf/zoo.cfg $ cd bin $ ./zkServer.sh start Run drillbit $ cd apache-drill-1.6.0/bin $ ./drillbit.sh start

14. Connect to Drill via JDBC ● JDBC Driver ○ DRILL_HOME/jars/jdbc-driver/drill-jdbc-all-1.6.0.jar ● Class ○ org.apache.drill.jdbc.Driver ● URL ○ jdbc:drill:drillbit=localhost

15. Handling nested JSON

16. Query nested JSON {"user": {"name": "suzuki", "dept": "sales"}} {"user": {"name": "yamada", "dept": "development"}} {"user": {"name": "sato", "dept": "development"}} ... SELECT T.ùser`.name AS name, T.ùser`.dept AS dept FROM dfs.`/tmp/users.json` T WHERE T.ùser`.name = 'yamada'; Extract JSON property as column

17. Expand nested JSON property to records {"user": { "name": "yamada", "experience": [ {"lang": "Java"}, {"lang": "Scala"} ] }} SELECT T2.name AS name, T2.experience.lang AS lang, FROM ( SELECT T1.`user`.name AS name, FLATTEN(T1.`user`.experience) AS experience FROM dfs.`/tmp/users.json` T1 ) T2 Expand nested array as individual table

18. In the case of jq $ cat users.json | jq '.user | select(.name == "yamada")' Nested JSON in Drill brings complexy. Maybe jq is better for simple query?

19. Use cases

20. Action log ● Store action log into the local file as JSON ● We can query them using Drill if necessary

21. Data warehouse ● Aggregate various datasources to Drill ● Data synchronization is no need

22. e.g. Access Elasticsearch through Hive ● elasticsearch-hadoop supports Hive ● Drill supports Hive http://takezoe.hatenablog.com/entry/20150524/p1 Can we access Elasticsearch from Drill?

23. Conclusion

24. Conclusion Apache Drill is ● good tool for querying various datasets ● easy setup and user friendly ● pre-investment is not required ● useful for small data, not only big data Put Apache Drill into your toolbox!

Apache Drill in the toolbox

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Más de takezoe

Más de takezoe (13)

Último

Último (20)

Apache Drill in the toolbox