SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
© 2018 Arm Limited
• Kentaro Yoshida
Improve data engineering work
with Digdag and Presto UDF
• 2018/10/17
at Plazma TD TechTalk 2018 Fall
© 2018 Arm Limited2
About me
• @yoshi_ken
• Leading DATA Team
• Support data driven work at TD
• Published DWH Platform books
Familiar Products
© 2018 Arm Limited3
What is DATA Team?
• Management for internal data ETL & Analysis Platform on TreasureData
• As historical reason, using Luigi, Airflow(with embulk) and Digdag
• Management data visualizing and reporting workflow for business
• Not only for engineers but also sales, marketing and operation
• Make simple solution insight from complexed data ocean
• Kind of data science(analysis) solution
• A rare team who use TreasureData internally as daily basis
• We can tell feedback as user mind for new improvements
© 2018 Arm Limited4
Technical Challenge of DATA Team
• Make scalable&robust data pipeline
• ex) 1 query generates numerous metrics logs from each components
• Improve fact data for supporting data-driven business/engineering
• ex) make easier to use data beforehand enrich/pre-processing
• Seek performance tuning insights for presto/hive at the platform side
• ex) root cause of making table fragmentation
• Change semi-realtime data processing from daily jobs
• ex) fresh/quick stat data make good insight for engineer/support
© 2018 Arm Limited
Introduce nice improvements
For Presto UDF and digdag
© 2018 Arm Limited6
Introduced nice improvements in Digdag and Presto
• New feature of Digdag
1. Added ${td.last_job.num_records}
• Which has number of records for job results
2. Added “_else_do” after if> operator since digdag v0.9.31
3. Added param_set> and param_get>
• For parameter sharing between workflow (not available in TD workflow)
• New feature of Presto
1. Added TD_TIME_STRING() UDF
• In SELECT clause, Make easier to format date string
2. Added TD_INTERVAL() UDF
• In WHERE clause, Make easier to specify time range extraction
© 2018 Arm Limited
New Feature of Digdag
© 2018 Arm Limited8
Situation of zero result error in workflow
• Due to some reason, in the case of final results got zero result unexpectedly.
• It need to investigate result number of rows for each step-by-step.
• I wish if digdag check the result number of rows at each step…
• I wish if digdag has function of result output with job_id…
Oops!
© 2018 Arm Limited9
Situation of zero result error in workflow
• Introduced ${td.last_job.num_records} has number of records for job
results
$ cat num_records.dig
+query:
td>:
data: SELECT DISTINCT symbol FROM nasdaq
database: sample_datasets
+fail_if_zero:
if>: ${td.last_job.num_records < 1}
_do:
fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
© 2018 Arm Limited10
Situation of zero result error in workflow
• Introduced “_else_do” after if> operator since digdag v0.9.31
$ cat num_records.dig
+query:
td>:
data: SELECT DISTINCT symbol FROM nasdaq
database: sample_datasets
+fail_if_zero:
if>: ${td.last_job.num_records < 1}
_do:
fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
_else_do:
sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job
_export:
result_path: td://@/workflow_logs/jobid_${td.last_job_id}
© 2018 Arm Limited
New Feature of Presto
TD_TIME_STRING() UDF
© 2018 Arm Limited12
Efficient way to format date string in SELECT
• It was required to use burden of writing date format conversion.
• This type of query has used GROUP BY statement in generally.
• So, I have used to be add preset custom dictionary with “td” for my IME.
© 2018 Arm Limited13
Efficient way to format date string in SELECT
• TD_TIME_STRING() is awesome UDF
• Easier way to truncate timestamp
format
string
format example
y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700
q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700
M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700
w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700
d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700
h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700
m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700
s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700
y! yyyy 2018
q! yyyy-MM 2018-04
M! yyyy-MM 2018-09
w! yyyy-MM-dd 2018-09-09
d! yyyy-MM-dd 2018-09-13
h! yyyy-MM-dd HH 2018-09-13 16
m! yyyy-MM-dd HH:mm 2018-09-13 16:45
s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34
—- Before
TD_TIME_FORMAT(
TD_DATE_TRUNC('day', time),
'yyyy-MM-dd')
—- After
TD_TIME_STRING(time, 'd!') day,
© 2018 Arm Limited
New Feature of Presto
TD_INTERVAL() UDF
© 2018 Arm Limited15
Efficient way to specify range of date in WHERE
• There are many complicated technique to gather specific range
—- cover 6 months of the data until today. 156=31*5+1
TD_TIME_RANGE(time,
TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')),
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME())
)
-— cover the beginning of day until now
TD_TIME_RANGE(time,
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME()
)
© 2018 Arm Limited16
Efficient way to specify range of date in WHERE
• TD_INTERVAL() UDF make easier
—- BEFORE
—- cover 6 months of the data until today. 156=31*5+1
TD_TIME_RANGE(time,
TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')),
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME())
)
—- AFTER
—- it can be specify with short UDF
TD_INTERVAL(time, '-6M/0d')
© 2018 Arm Limited17
Efficient way to specify range of date in WHERE
• TD_INTERVAL() UDF make easier
—- BEFORE
-— cover the beginning of day until now
TD_TIME_RANGE(time,
TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME()
)
—- AFTER
—- it can be specify with short UDF
TD_INTERVAL(time, '-1d')
© 2018 Arm Limited18
Efficient way to specify range of date in WHERE
© 2018 Arm Limited19
Efficient way to specify range of date in WHERE
-— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC)
# The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-1h')
# From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45)
SELECT ... WHERE TD_INTERVAL(time, '-1h/now')
# The last hour since the beginning of today [2018-08-13 23:00:00,
2018-08-14 00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-1h/0d')
• After slash, it can specify the borderline of the day.
© 2018 Arm Limited20
Efficient way to specify range of date in WHERE
-— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC)
# The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25
00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25')
# The last 10 days since the beginning of the last month [2018-06-21
00:00:00, 2018-07-01 00:00:00)
SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M')
• After slash, it can specify the borderline of the day.
• Effective way, It also work ${session_date} if using digdag.
© 2018 Arm Limited21
Tips about handling time range
-- recommend to test with such a time_series table
CREATE TABLE time_series AS
SELECT
time,
TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date
FROM (
SELECT times
FROM (
VALUES
SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60)
) AS x (times)
) t1
CROSS JOIN UNNEST(times) AS t (time)
ORDER BY time
https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
© 2018 Arm Limited22
Let’s enjoy data engineering work with digdag!
And also feel free to talk to me
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫תודה‬© 2018 Arm Limited23

Más contenido relacionado

La actualidad más candente

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVKevin Xu
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @ShanghaiLuke Han
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analyticsDataWorks Summit
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020Taro L. Saito
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Sid Anand
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...DataWorks Summit/Hadoop Summit
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarStreamlio
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 

La actualidad más candente (20)

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
ElastiCache and Redis
ElastiCache and RedisElastiCache and Redis
ElastiCache and Redis
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache Pulsar
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 

Similar a Improve data engineering work with Digdag and Presto UDF

Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...InfluxData
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxData
 
201809 DB tech showcase
201809 DB tech showcase201809 DB tech showcase
201809 DB tech showcaseKeisuke Suzuki
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataInfluxData
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKInfluxData
 
Hash join use memory optimization
Hash join use memory optimizationHash join use memory optimization
Hash join use memory optimizationICTeam S.p.A.
 
Optimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real WorldOptimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real WorldDevOps.com
 
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Toru Takahashi
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Taro L. Saito
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...ScyllaDB
 
Performance tuning ColumnStore
Performance tuning ColumnStorePerformance tuning ColumnStore
Performance tuning ColumnStoreMariaDB plc
 
Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0oysteing
 
Using TICK Stack For System and App Metrics
Using TICK Stack For System and App MetricsUsing TICK Stack For System and App Metrics
Using TICK Stack For System and App MetricsAayush Tuladhar
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKInfluxData
 
Workload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud MarketplacesWorkload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud MarketplacesGravitant, Inc.
 
Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Dieter Plaetinck
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchRim Moussa
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 

Similar a Improve data engineering work with Digdag and Presto UDF (20)

Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
 
201809 DB tech showcase
201809 DB tech showcase201809 DB tech showcase
201809 DB tech showcase
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Hash join use memory optimization
Hash join use memory optimizationHash join use memory optimization
Hash join use memory optimization
 
Optimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real WorldOptimizing Time Series Performance in the Real World
Optimizing Time Series Performance in the Real World
 
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...
 
Performance tuning ColumnStore
Performance tuning ColumnStorePerformance tuning ColumnStore
Performance tuning ColumnStore
 
Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0Histogram Support in MySQL 8.0
Histogram Support in MySQL 8.0
 
Using TICK Stack For System and App Metrics
Using TICK Stack For System and App MetricsUsing TICK Stack For System and App Metrics
Using TICK Stack For System and App Metrics
 
Lecture 2a
Lecture 2aLecture 2a
Lecture 2a
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Workload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud MarketplacesWorkload Partitioning in Cloud Marketplaces
Workload Partitioning in Cloud Marketplaces
 
Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014Metrics 2.0 @ Monitorama PDX 2014
Metrics 2.0 @ Monitorama PDX 2014
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 

Más de Kentaro Yoshida

TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方Kentaro Yoshida
 
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方Kentaro Yoshida
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編Kentaro Yoshida
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスKentaro Yoshida
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話Kentaro Yoshida
 
Fluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターンFluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターンKentaro Yoshida
 
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"Kentaro Yoshida
 
MySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearchMySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearchKentaro Yoshida
 
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasualFluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasualKentaro Yoshida
 
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記Kentaro Yoshida
 
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウKentaro Yoshida
 
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記Kentaro Yoshida
 
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」Kentaro Yoshida
 

Más de Kentaro Yoshida (13)

TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
TREASUREDATAのエコシステムで作るロバストなETLデータ処理基盤の作り方
 
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
Fluentd, Digdag, Embulkを用いたデータ分析基盤の始め方
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービス
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Fluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターンFluentdのお勧めシステム構成パターン
Fluentdのお勧めシステム構成パターン
 
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
MySQLと組み合わせて始める全文検索プロダクト"elasticsearch"
 
MySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearchMySQLユーザ視点での小さく始めるElasticsearch
MySQLユーザ視点での小さく始めるElasticsearch
 
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasualFluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
Fluentdベースのミドルウェア"Yamabiko"でMySQLのテーブルをElasticsearchへレプリケートする話 #fluentdcasual
 
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
MySQL 5.6への完全移行を実現したTritonnからMroongaへの移行体験記
 
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
ElasticSearch+Kibanaでログデータの検索と視覚化を実現するテクニックと運用ノウハウ
 
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
Tritonn (MySQL5.0.87+Senna)からの mroonga (MySQL5.6) 移行体験記
 
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
 

Último

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 

Último (20)

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 

Improve data engineering work with Digdag and Presto UDF

  • 1. © 2018 Arm Limited • Kentaro Yoshida Improve data engineering work with Digdag and Presto UDF • 2018/10/17 at Plazma TD TechTalk 2018 Fall
  • 2. © 2018 Arm Limited2 About me • @yoshi_ken • Leading DATA Team • Support data driven work at TD • Published DWH Platform books Familiar Products
  • 3. © 2018 Arm Limited3 What is DATA Team? • Management for internal data ETL & Analysis Platform on TreasureData • As historical reason, using Luigi, Airflow(with embulk) and Digdag • Management data visualizing and reporting workflow for business • Not only for engineers but also sales, marketing and operation • Make simple solution insight from complexed data ocean • Kind of data science(analysis) solution • A rare team who use TreasureData internally as daily basis • We can tell feedback as user mind for new improvements
  • 4. © 2018 Arm Limited4 Technical Challenge of DATA Team • Make scalable&robust data pipeline • ex) 1 query generates numerous metrics logs from each components • Improve fact data for supporting data-driven business/engineering • ex) make easier to use data beforehand enrich/pre-processing • Seek performance tuning insights for presto/hive at the platform side • ex) root cause of making table fragmentation • Change semi-realtime data processing from daily jobs • ex) fresh/quick stat data make good insight for engineer/support
  • 5. © 2018 Arm Limited Introduce nice improvements For Presto UDF and digdag
  • 6. © 2018 Arm Limited6 Introduced nice improvements in Digdag and Presto • New feature of Digdag 1. Added ${td.last_job.num_records} • Which has number of records for job results 2. Added “_else_do” after if> operator since digdag v0.9.31 3. Added param_set> and param_get> • For parameter sharing between workflow (not available in TD workflow) • New feature of Presto 1. Added TD_TIME_STRING() UDF • In SELECT clause, Make easier to format date string 2. Added TD_INTERVAL() UDF • In WHERE clause, Make easier to specify time range extraction
  • 7. © 2018 Arm Limited New Feature of Digdag
  • 8. © 2018 Arm Limited8 Situation of zero result error in workflow • Due to some reason, in the case of final results got zero result unexpectedly. • It need to investigate result number of rows for each step-by-step. • I wish if digdag check the result number of rows at each step… • I wish if digdag has function of result output with job_id… Oops!
  • 9. © 2018 Arm Limited9 Situation of zero result error in workflow • Introduced ${td.last_job.num_records} has number of records for job results $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows.
  • 10. © 2018 Arm Limited10 Situation of zero result error in workflow • Introduced “_else_do” after if> operator since digdag v0.9.31 $ cat num_records.dig +query: td>: data: SELECT DISTINCT symbol FROM nasdaq database: sample_datasets +fail_if_zero: if>: ${td.last_job.num_records < 1} _do: fail>: job_id:${td.last_job.id} results ${td.last_job.num_records} rows. _else_do: sh>: td export:result ${td.last_job_id} ${result_path} # enqueue job _export: result_path: td://@/workflow_logs/jobid_${td.last_job_id}
  • 11. © 2018 Arm Limited New Feature of Presto TD_TIME_STRING() UDF
  • 12. © 2018 Arm Limited12 Efficient way to format date string in SELECT • It was required to use burden of writing date format conversion. • This type of query has used GROUP BY statement in generally. • So, I have used to be add preset custom dictionary with “td” for my IME.
  • 13. © 2018 Arm Limited13 Efficient way to format date string in SELECT • TD_TIME_STRING() is awesome UDF • Easier way to truncate timestamp format string format example y yyyy-MM-dd HH:mm:ssZ 2018-01-01 00:00:00+0700 q yyyy-MM-dd HH:mm:ssZ 2018-04-01 00:00:00+0700 M yyyy-MM-dd HH:mm:ssZ 2018-09-01 00:00:00+0700 w yyyy-MM-dd HH:mm:ssZ 2018-09-09 00:00:00+0700 d yyyy-MM-dd HH:mm:ssZ 2018-09-13 00:00:00+0700 h yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:00:00+0700 m yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:00+0700 s yyyy-MM-dd HH:mm:ssZ 2018-09-13 16:45:34+0700 y! yyyy 2018 q! yyyy-MM 2018-04 M! yyyy-MM 2018-09 w! yyyy-MM-dd 2018-09-09 d! yyyy-MM-dd 2018-09-13 h! yyyy-MM-dd HH 2018-09-13 16 m! yyyy-MM-dd HH:mm 2018-09-13 16:45 s! yyyy-MM-dd HH:mm:ss 2018-09-13 16:45:34 —- Before TD_TIME_FORMAT( TD_DATE_TRUNC('day', time), 'yyyy-MM-dd') —- After TD_TIME_STRING(time, 'd!') day,
  • 14. © 2018 Arm Limited New Feature of Presto TD_INTERVAL() UDF
  • 15. © 2018 Arm Limited15 Efficient way to specify range of date in WHERE • There are many complicated technique to gather specific range —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() )
  • 16. © 2018 Arm Limited16 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE —- cover 6 months of the data until today. 156=31*5+1 TD_TIME_RANGE(time, TD_DATE_TRUNC('month', TD_TIME_ADD(TD_SCHEDULED_TIME(), '-156d')), TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()) ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-6M/0d')
  • 17. © 2018 Arm Limited17 Efficient way to specify range of date in WHERE • TD_INTERVAL() UDF make easier —- BEFORE -— cover the beginning of day until now TD_TIME_RANGE(time, TD_DATE_TRUNC('day', TD_SCHEDULED_TIME()), TD_SCHEDULED_TIME() ) —- AFTER —- it can be specify with short UDF TD_INTERVAL(time, '-1d')
  • 18. © 2018 Arm Limited18 Efficient way to specify range of date in WHERE
  • 19. © 2018 Arm Limited19 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last hour [2018-08-14 00:00:00, 2018-08-14 01:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h') # From the last hour to now [2018-08-14 00:00:00, 2018-08-14 01:23:45) SELECT ... WHERE TD_INTERVAL(time, '-1h/now') # The last hour since the beginning of today [2018-08-13 23:00:00, 2018-08-14 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-1h/0d') • After slash, it can specify the borderline of the day.
  • 20. © 2018 Arm Limited20 Efficient way to specify range of date in WHERE -— Here is a example of query start time is 2018-08-14 01:23:45 (Tue, UTC) # The last 7 days since 2015-12-25 [2015-12-18 00:00:00, 2015-12-25 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-7d/2015-12-25') # The last 10 days since the beginning of the last month [2018-06-21 00:00:00, 2018-07-01 00:00:00) SELECT ... WHERE TD_INTERVAL(time, '-10d/-1M') • After slash, it can specify the borderline of the day. • Effective way, It also work ${session_date} if using digdag.
  • 21. © 2018 Arm Limited21 Tips about handling time range -- recommend to test with such a time_series table CREATE TABLE time_series AS SELECT time, TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ssZ', 'UTC') AS date FROM ( SELECT times FROM ( VALUES SEQUENCE(TD_TIME_PARSE('2018-01-01', 'UTC'), TD_TIME_PARSE('2018-12-31', 'UTC'), 60*60) ) AS x (times) ) t1 CROSS JOIN UNNEST(times) AS t (time) ORDER BY time https://qiita.com/reflet/items/151a10e9a0914e0ec3ee
  • 22. © 2018 Arm Limited22 Let’s enjoy data engineering work with digdag! And also feel free to talk to me