8. Installation
1. Download and expand Drill distribution
2. cd apache-drill-1.6.0/bin
3. ./drill-embedded http://localhost:8047/
9. Query local JSON files
{"name": "suzuki", "dept": "sales"}
{"name": "yamada", "dept": "development"}
{"name": "sato", "dept": "development"}
...
SELECT * FROM dfs.`/tmp/users.json` T1
WHERE T1.name = 'takezoe'
10. Access to RDB tables
Configure jdbc storage plugin at the web console:
{
"type": "jdbc",
"driver": "org.h2.Driver",
"url": "jdbc:h2:~/.gitbucket/data",
"username": "sa",
"password": "sa",
"enabled": true
}
11. Join JSON and RDB
SELECT
T1.`user`.name AS name,
T2.MAIL_ADDRESS AS mail
FROM dfs.`/tmp/users.json` T1
INNER JOIN h2.DATA.PUBLIC.ACCOUNT T2
ON T1.`user`.name = T2.USER_NAME
12. Connect to Drill via JDBC
We can use any JDBC frontend or BI tool with Drill
JDBC
Requires ZooKeeper
13. Connect to Drill via JDBC
Setup ZooKeeper
$ tar xvzf zookeeper-3.4.8.tar.gz
$ cd zookeeper-3.4.8
$ mv conf/zoo_sample.cfg conf/zoo.cfg
$ cd bin
$ ./zkServer.sh start
Run drillbit
$ cd apache-drill-1.6.0/bin
$ ./drillbit.sh start
14. Connect to Drill via JDBC
● JDBC Driver
○ DRILL_HOME/jars/jdbc-driver/drill-jdbc-all-1.6.0.jar
● Class
○ org.apache.drill.jdbc.Driver
● URL
○ jdbc:drill:drillbit=localhost
16. Query nested JSON
{"user": {"name": "suzuki", "dept": "sales"}}
{"user": {"name": "yamada", "dept": "development"}}
{"user": {"name": "sato", "dept": "development"}}
...
SELECT
T.`user`.name AS name,
T.`user`.dept AS dept
FROM dfs.`/tmp/users.json` T
WHERE T.`user`.name = 'yamada';
Extract JSON
property as column
17. Expand nested JSON property to records
{"user": {
"name": "yamada",
"experience": [ {"lang": "Java"}, {"lang": "Scala"} ]
}}
SELECT
T2.name AS name,
T2.experience.lang AS lang,
FROM (
SELECT
T1.`user`.name AS name,
FLATTEN(T1.`user`.experience) AS experience
FROM dfs.`/tmp/users.json` T1
) T2
Expand nested array
as individual table
18. In the case of jq
$ cat users.json | jq '.user | select(.name == "yamada")'
Nested JSON in Drill brings complexy.
Maybe jq is better for simple query?
22. e.g. Access Elasticsearch through Hive
● elasticsearch-hadoop supports Hive
● Drill supports Hive
http://takezoe.hatenablog.com/entry/20150524/p1
Can we access Elasticsearch from Drill?
24. Conclusion
Apache Drill is
● good tool for querying various datasets
● easy setup and user friendly
● pre-investment is not required
● useful for small data, not only big data
Put Apache Drill into your toolbox!