SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Internals of Presto Service
Taro L. Saito, Treasure Data
leo@treasure-data.com
March 11-12th, 2015
Treasure Data Tech Talk #1 at Tokyo
Taro L. Saito @taroleo
•  2007 University of Tokyo. Ph.D.
–  XML DBMS, Transaction Processing
•  Relational-Style XML Query [SIGMOD 2008]
•  ~ 2014 Assistant Professor at University of Tokyo
–  Genome Science Research
•  Distributed Computing, Personal Genome Analysis
•  March 2014 ~ Treasure Data
–  Software Engineer, MPP Team Leader
•  Open source projects at GitHub
–  snappy-java, msgpack-java, sqlite-jdbc
–  sbt-pack, sbt-sonatype, larray
–  silk
•  Distributed workflow engine
2
Hive
TD API /
Web Console
batch query
Presto
Treasure Data
PlazmaDB:
MessagePack Columnar Storage
td-presto connector
Interactive query
What is Presto?
•  A distributed SQL Engine developed by Facebook
–  For interactive analysis on peta-scale dataset
•  As a replacement of Hive
–  Nov. 2013: Open sourced at GitHub
•  Presto
–  Written in Java
–  In-memory query layer
–  CPU efficient for ad-hoc analysis
–  Based on ANSI SQL
–  Isolation of query layer and storage access layer
•  A connector provides data access (reading schema and records)
4
Presto: Distributed SQL Engine
5
TD Presto has its own
query retry mechanism
Tailored to throughput CPU-intensive. Faster response time
Fault
Tolerant
Treasure Data: Presto as a Service
6
Presto Public
Release
Topics
•  Challenges in providing Database as a Service
•  TD Presto Connector
–  Optimizing Scan Performance
–  Multi-tenancy Cluster Management
•  Resource allocation
•  Monitoring
•  Query Tuning
7
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
TableScanOperator	
•  s3 file list
•  table schema
header
request
S3 / RiakCS	
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue	
•  priority queue
•  max connections limit
Header	
Column Block 0
(column names)	
Column Block 1	
Column Block i	
Column Block m	
MPC1 file
HeaderReader	
•  callback to HeaderParser
ColumnBlockReader	
header
HeaderParser	
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker	
buffer
MessageUnpacker	
MessageUnpacker	
S3 read	
S3 read	
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read	
•  decompression
•  msgpack-java v07
S3 read	
S3 read	
S3 read
MessageBuffer
•  msgpack-java v06 was the bottleneck
–  Inefficient buffer access
•  v07
•  Fast memory access
•  sun.misc.Unsafe
•  Direct access to heap memory
•  extract primitive type value from byte[]
•  cast
•  No boxing
9
Unsafe memory access performance is comparable to C
•  http://frsyuki.hatenablog.com/entry/2014/03/12/155231
10
Why ByteBuffer is slow?
•  Following a good programming manner
–  Define interface, then implement classes
•  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer
implementations
•  In reality: TypeProfile slows down method access
–  JVM generates look-up table of method implementations
–  Simply importing one or more classes generates TypeProfile
•  v07 avoid TypeProfile generation
–  Load an implementation class through Reflection
11
Format Type Detection
•  MessageUnpacker
–  read prefix: 1 byte
–  detect format type
•  switch-case
–  ANTLR generates this
type of codes
12
Format Type Detection
•  Using cache-efficient lookup table: 20000x faster
13
2x performance improvement in v07
14
Database As A Service
15
Claremont Report on Database Research
•  Discussion on future of DBMS
–  Top researchers, vendors and
practitioners.
–  CACM, Vol. 52 No. 6, 2009
•  Predicts emergence of Cloud Data
Service
–  SQL has an important role
•  limited functionality
•  suited for service provider
–  A difficult example: Spark 
•  Need a secure application container
to run arbitrary Scala code.
16
Beckman Report on Database Research
•  2013
–  http://beckman.cs.wisc.edu/beckman-report2013.pdf
–  Topics of Big-Data
•  End-to-end service
–  From data collection to knowledge
•  Cloud Service has become popular
–  IaaS, PaaS, SaaS
–  Challenge is to migrate all of the functionalities of DBMS into Cloud
17
Results Push
Results Push
SQL
Big Data Simplified: The Treasure Data Approach
AppServers
Multi-structured Events!
•  register!
•  login!
•  start_event!
•  purchase!
•  etc!
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
ü  App log data!
ü  Mobile event data!
ü  Sensor data!
ü  Telemetry!
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent Treasure Agent
Treasure Agent
Treasure Agent
Treasure Agent
Embedded SDKs
Server-side Agents
18
Challenges in Database as a Service
•  Tradeoffs
–  Cost and service level objectives (SLOs)
•  Reference
–  Workload Management for Big Data Analytics. A. Aboulnaga
[SIGMOD2013 Tutorial]
19
Run each query set
on an independent
cluster
Run all queries
together on the
smallest possible
cluster
Fast
$$$
Limited performance guarantee
Reasonable price
Shift of Presto Query Usage
•  Initial phase
–  Try and error of queries
•  Many syntax errors, semantic errors
•  Next phase
–  Scheduled query execution
•  Increased Presto query usage
–  Some customers submit more than 1,000 Presto queries / day
–  Establishing typical query patterns
•  hourly, daily reports
•  query templates
•  Advanced phase: More elaborate data analysis
–  Complex queries
•  via data scientists and data analysts
–  High resource usage
20
Usage Shift: Simple to Complex queries
21
Monitoring Presto Usage with Fluentd
22
Hive
Presto
DataDog
•  Monitoring CPU, memory and network usage
•  Query stats
23
Query Collection in TD
•  SQL query logs
–  query, detailed query plan, elapsed time, processed rows, etc.
•  Presto is used for analyzing the query history
24
Daily/Hourly Query Usage
25
Query Running Time
•  More than 90% of queries finishes within 2 min.
expected response time for interactive queries
26
Processed Rows of Queries
27
Performance
•  Processed rows / sec. of a query
28
Collecting Recoverable Error Patterns
•  Presto has no fault tolerance
•  Error types
–  User error
•  Syntax errors
–  SQL syntax, missing function
•  Semantic errors
–  missing tables/columns
–  Insufficient resource
•  Exceeded task memory size
–  Internal failure
•  I/O error
–  S3/Riak CS
•  worker failure
•  etc.
29
TD Presto retries
these queries
Query Retry on Internal Errors
•  More than 99.8% of queries finishes without errors
30
Query Retry on Internal Errors (log scale)
•  Queries succeed eventually
31
Multi-tenancy: Resource Allocation
•  Price-plan based resource allocation
•  Parameters
–  The number of worker nodes to use (min-candidates)
–  The number of hash partitions (initial-hash-partitions)
–  The maximum number of running tasks per account
•  If running queries exceeds allowed number of tasks, the next queries need
to wait (queued)
•  Presto: SqlQueryExecution class
–  Controls query execution state: planning -> running -> finished
•  No resource allocation policy
–  Extended TDSqlQueryExection class monitors running tasks and limits
resource usage
•  Rewriting SqlQueryExecutionFactory at run-time by using ASM library
32
Query Queue
•  Presto 0.97
–  Introduces user-wise query queues
•  Can limit the number of concurrent queries per user
•  Problem
–  Running too many queries delays overall query
performance
33
Customer Feedback
•  A feedback:
–  We don’t care if large queries take long time
–  But interactive queries should run immediately
•  Challenges
–  How do we allocate resources even if preceding queries
occupies customer share of resources?
–  How do we know a submitted query is interactive one?
34
Admission control is necessary
•  Adjust resource utilization
–  Running Drivers (Splits)
–  MPL (Multi-Programming Level)
35
Challenge: Auto Scaling
•  Setting the cluster size based on the peak usage is expensive
•  But predicting customer usage is difficult
36
Typical Query Patterns [Li Juang]
•  Q: What are typical queries of a customer?
–  Customer feels some queries are slow
–  But we don’t know what to compare with, except scheduled queries
•  Approach: Clustering Customer SQLs
•  TF/IDF measure: TF x IDF vector
–  Split SQL statements into tokens
–  Term frequency (TF) = the number of each term in a query
–  Inverse document frequency (IDF) = log (# of queries / # of queries that
have a token)
•  k-means clustering
–  TF/IDF vector
–  Generates clusters of similar queries
•  x-means clustering for deciding number of clusters automatically
–  D. Pelleg [ICML2000]
37
Problematic Queries
•  90% of queries finishes within 2 min.
–  But remaining 10% is still large
•  10% of 10,000 queries is 1,000.
•  Long-running queries
•  Hog queries
38
Long Running Queries
•  Typical bottlenecks
–  Cross joins
–  IN (a, b, c, …)
•  semi-join filtering process is slow
–  Complex scan condition
•  pushing down selection
•  but delays column scan
–  Tuple materialization
•  coordinator generates json data
–  Many aggregation columns
•  group by 1, 2, 3, 4, 5, 6, …
–  Full scan
•  Scanning 100 billion rows…
•  Adding more resources does not always make query faster
•  Storing intermediate data to disks is necessary
39
Result are
buffered
(waiting fetch)
slow process
fast
fast
Hog Query
•  Queries consuming a lot of CPU/memory resources
–  Coined in S. Krompass et al. [EDBT2009]
•  Example:
–  select 1 as day, count(…) from … where time <= current_date - interval 1 day
union all
select 2 as day, count(…) from … where time <= current_date - interval 2 day
union all
–  …
–  (up to 190 days)
•  More than 1000 query stages.
•  Presto tries to run all of the stages at once.
–  High CPU usage at coordinator
40
•  Query rewriting (better)
–  With group by and window functions
–  Not a perfect solution
•  Need to understand the meaning of the query
•  Semantic change is not allowed
–  e.g., We cannot rewrite UNION to UNION ALL
–  UNION includes duplicate elimination
•  Workaround Idea
–  Bushy plan -> Deep plan
–  Introduce stage-wise resource assignment
Query Rewriting? Plan Optimization?
41
Future Work
•  Reducing Queuing/Response Time
–  Introducing shared queue between customers
•  For utilizing remaining cluster resources
–  Fair-Scheduling: C. Gupata [EDBT2009]
–  Self-tuning DBMS. S. Chaudhuri [VLDB2007]
•  Adjusting Running Query Size (hard)
–  Limiting driver resources as small as possible for hog queries
–  Query plan based cost estimation
•  Predicting Query Running Time
–  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011]
42
Summary: Treasures in Treasure Data
•  Treasures for our customers
–  Data collected by fluentd (td-agent)
–  Query analysis platform
–  Query results - values
•  For Treasure Data
–  SQL query logs
•  Stored in treasure data
–  We know how customers use SQL
•  Typical queries and failures
–  We know which part of query can be improved
43

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
 

Destacado

Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方
Takahiro Inoue
 

Destacado (20)

Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -Plazma - Treasure Data’s distributed analytical database -
Plazma - Treasure Data’s distributed analytical database -
 
Diary of Support Engineer
Diary of Support EngineerDiary of Support Engineer
Diary of Support Engineer
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Treasure Data and Fluentd
Treasure Data and FluentdTreasure Data and Fluentd
Treasure Data and Fluentd
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
 
hotdog a TD tool for DD
hotdog a TD tool for DDhotdog a TD tool for DD
hotdog a TD tool for DD
 
Treasure Data Mobile SDK
Treasure Data Mobile SDKTreasure Data Mobile SDK
Treasure Data Mobile SDK
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_case
 
トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方トレジャーデータ流,データ分析の始め方
トレジャーデータ流,データ分析の始め方
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
pagecache-memo
pagecache-memopagecache-memo
pagecache-memo
 
Pentaho CTools 20140902
Pentaho CTools 20140902Pentaho CTools 20140902
Pentaho CTools 20140902
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Building Physical in a Virtual World
Building Physical in a Virtual WorldBuilding Physical in a Virtual World
Building Physical in a Virtual World
 

Similar a Internals of Presto Service

Similar a Internals of Presto Service (20)

Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Introduction to .NET Performance Measurement
Introduction to .NET Performance MeasurementIntroduction to .NET Performance Measurement
Introduction to .NET Performance Measurement
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
 

Más de Treasure Data, Inc.

Más de Treasure Data, Inc. (20)

GDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for MarketersGDPR: A Practical Guide for Marketers
GDPR: A Practical Guide for Marketers
 
AR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and MarketAR and VR by the Numbers: A Data First Approach to the Technology and Market
AR and VR by the Numbers: A Data First Approach to the Technology and Market
 
Introduction to Customer Data Platforms
Introduction to Customer Data PlatformsIntroduction to Customer Data Platforms
Introduction to Customer Data Platforms
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowHands-On: Managing Slowly Changing Dimensions Using TD Workflow
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
 
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsBrand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
 
How to Power Your Customer Experience with Data
How to Power Your Customer Experience with DataHow to Power Your Customer Experience with Data
How to Power Your Customer Experience with Data
 
Why Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without DataWhy Your VR Game is Virtually Useless Without Data
Why Your VR Game is Virtually Useless Without Data
 
Connecting the Customer Data Dots
Connecting the Customer Data DotsConnecting the Customer Data Dots
Connecting the Customer Data Dots
 
Harnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company SuccessHarnessing Data for Better Customer Experience and Company Success
Harnessing Data for Better Customer Experience and Company Success
 
Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017Packaging Ecosystems -Monki Gras 2017
Packaging Ecosystems -Monki Gras 2017
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
 
Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14Keynote - Fluentd meetup v14
Keynote - Fluentd meetup v14
 
Introduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of HivemallIntroduction to New features and Use cases of Hivemall
Introduction to New features and Use cases of Hivemall
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 

Último

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 

Último (20)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 

Internals of Presto Service

  • 1. Internals of Presto Service Taro L. Saito, Treasure Data leo@treasure-data.com March 11-12th, 2015 Treasure Data Tech Talk #1 at Tokyo
  • 2. Taro L. Saito @taroleo •  2007 University of Tokyo. Ph.D. –  XML DBMS, Transaction Processing •  Relational-Style XML Query [SIGMOD 2008] •  ~ 2014 Assistant Professor at University of Tokyo –  Genome Science Research •  Distributed Computing, Personal Genome Analysis •  March 2014 ~ Treasure Data –  Software Engineer, MPP Team Leader •  Open source projects at GitHub –  snappy-java, msgpack-java, sqlite-jdbc –  sbt-pack, sbt-sonatype, larray –  silk •  Distributed workflow engine 2
  • 3. Hive TD API / Web Console batch query Presto Treasure Data PlazmaDB: MessagePack Columnar Storage td-presto connector Interactive query
  • 4. What is Presto? •  A distributed SQL Engine developed by Facebook –  For interactive analysis on peta-scale dataset •  As a replacement of Hive –  Nov. 2013: Open sourced at GitHub •  Presto –  Written in Java –  In-memory query layer –  CPU efficient for ad-hoc analysis –  Based on ANSI SQL –  Isolation of query layer and storage access layer •  A connector provides data access (reading schema and records) 4
  • 5. Presto: Distributed SQL Engine 5 TD Presto has its own query retry mechanism Tailored to throughput CPU-intensive. Faster response time Fault Tolerant
  • 6. Treasure Data: Presto as a Service 6 Presto Public Release
  • 7. Topics •  Challenges in providing Database as a Service •  TD Presto Connector –  Optimizing Scan Performance –  Multi-tenancy Cluster Management •  Resource allocation •  Monitoring •  Query Tuning 7
  • 8. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read •  decompression •  msgpack-java v07 S3 read S3 read S3 read
  • 9. MessageBuffer •  msgpack-java v06 was the bottleneck –  Inefficient buffer access •  v07 •  Fast memory access •  sun.misc.Unsafe •  Direct access to heap memory •  extract primitive type value from byte[] •  cast •  No boxing 9
  • 10. Unsafe memory access performance is comparable to C •  http://frsyuki.hatenablog.com/entry/2014/03/12/155231 10
  • 11. Why ByteBuffer is slow? •  Following a good programming manner –  Define interface, then implement classes •  ByteBuffer interface has HeapByteBuffer and DirectByteBuffer implementations •  In reality: TypeProfile slows down method access –  JVM generates look-up table of method implementations –  Simply importing one or more classes generates TypeProfile •  v07 avoid TypeProfile generation –  Load an implementation class through Reflection 11
  • 12. Format Type Detection •  MessageUnpacker –  read prefix: 1 byte –  detect format type •  switch-case –  ANTLR generates this type of codes 12
  • 13. Format Type Detection •  Using cache-efficient lookup table: 20000x faster 13
  • 15. Database As A Service 15
  • 16. Claremont Report on Database Research •  Discussion on future of DBMS –  Top researchers, vendors and practitioners. –  CACM, Vol. 52 No. 6, 2009 •  Predicts emergence of Cloud Data Service –  SQL has an important role •  limited functionality •  suited for service provider –  A difficult example: Spark  •  Need a secure application container to run arbitrary Scala code. 16
  • 17. Beckman Report on Database Research •  2013 –  http://beckman.cs.wisc.edu/beckman-report2013.pdf –  Topics of Big-Data •  End-to-end service –  From data collection to knowledge •  Cloud Service has become popular –  IaaS, PaaS, SaaS –  Challenge is to migrate all of the functionalities of DBMS into Cloud 17
  • 18. Results Push Results Push SQL Big Data Simplified: The Treasure Data Approach AppServers Multi-structured Events! •  register! •  login! •  start_event! •  purchase! •  etc! SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Familiar & Table-oriented Infinite & Economical Cloud Data Store ü  App log data! ü  Mobile event data! ü  Sensor data! ü  Telemetry! Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Treasure Agent Embedded SDKs Server-side Agents 18
  • 19. Challenges in Database as a Service •  Tradeoffs –  Cost and service level objectives (SLOs) •  Reference –  Workload Management for Big Data Analytics. A. Aboulnaga [SIGMOD2013 Tutorial] 19 Run each query set on an independent cluster Run all queries together on the smallest possible cluster Fast $$$ Limited performance guarantee Reasonable price
  • 20. Shift of Presto Query Usage •  Initial phase –  Try and error of queries •  Many syntax errors, semantic errors •  Next phase –  Scheduled query execution •  Increased Presto query usage –  Some customers submit more than 1,000 Presto queries / day –  Establishing typical query patterns •  hourly, daily reports •  query templates •  Advanced phase: More elaborate data analysis –  Complex queries •  via data scientists and data analysts –  High resource usage 20
  • 21. Usage Shift: Simple to Complex queries 21
  • 22. Monitoring Presto Usage with Fluentd 22 Hive Presto
  • 23. DataDog •  Monitoring CPU, memory and network usage •  Query stats 23
  • 24. Query Collection in TD •  SQL query logs –  query, detailed query plan, elapsed time, processed rows, etc. •  Presto is used for analyzing the query history 24
  • 26. Query Running Time •  More than 90% of queries finishes within 2 min. expected response time for interactive queries 26
  • 27. Processed Rows of Queries 27
  • 28. Performance •  Processed rows / sec. of a query 28
  • 29. Collecting Recoverable Error Patterns •  Presto has no fault tolerance •  Error types –  User error •  Syntax errors –  SQL syntax, missing function •  Semantic errors –  missing tables/columns –  Insufficient resource •  Exceeded task memory size –  Internal failure •  I/O error –  S3/Riak CS •  worker failure •  etc. 29 TD Presto retries these queries
  • 30. Query Retry on Internal Errors •  More than 99.8% of queries finishes without errors 30
  • 31. Query Retry on Internal Errors (log scale) •  Queries succeed eventually 31
  • 32. Multi-tenancy: Resource Allocation •  Price-plan based resource allocation •  Parameters –  The number of worker nodes to use (min-candidates) –  The number of hash partitions (initial-hash-partitions) –  The maximum number of running tasks per account •  If running queries exceeds allowed number of tasks, the next queries need to wait (queued) •  Presto: SqlQueryExecution class –  Controls query execution state: planning -> running -> finished •  No resource allocation policy –  Extended TDSqlQueryExection class monitors running tasks and limits resource usage •  Rewriting SqlQueryExecutionFactory at run-time by using ASM library 32
  • 33. Query Queue •  Presto 0.97 –  Introduces user-wise query queues •  Can limit the number of concurrent queries per user •  Problem –  Running too many queries delays overall query performance 33
  • 34. Customer Feedback •  A feedback: –  We don’t care if large queries take long time –  But interactive queries should run immediately •  Challenges –  How do we allocate resources even if preceding queries occupies customer share of resources? –  How do we know a submitted query is interactive one? 34
  • 35. Admission control is necessary •  Adjust resource utilization –  Running Drivers (Splits) –  MPL (Multi-Programming Level) 35
  • 36. Challenge: Auto Scaling •  Setting the cluster size based on the peak usage is expensive •  But predicting customer usage is difficult 36
  • 37. Typical Query Patterns [Li Juang] •  Q: What are typical queries of a customer? –  Customer feels some queries are slow –  But we don’t know what to compare with, except scheduled queries •  Approach: Clustering Customer SQLs •  TF/IDF measure: TF x IDF vector –  Split SQL statements into tokens –  Term frequency (TF) = the number of each term in a query –  Inverse document frequency (IDF) = log (# of queries / # of queries that have a token) •  k-means clustering –  TF/IDF vector –  Generates clusters of similar queries •  x-means clustering for deciding number of clusters automatically –  D. Pelleg [ICML2000] 37
  • 38. Problematic Queries •  90% of queries finishes within 2 min. –  But remaining 10% is still large •  10% of 10,000 queries is 1,000. •  Long-running queries •  Hog queries 38
  • 39. Long Running Queries •  Typical bottlenecks –  Cross joins –  IN (a, b, c, …) •  semi-join filtering process is slow –  Complex scan condition •  pushing down selection •  but delays column scan –  Tuple materialization •  coordinator generates json data –  Many aggregation columns •  group by 1, 2, 3, 4, 5, 6, … –  Full scan •  Scanning 100 billion rows… •  Adding more resources does not always make query faster •  Storing intermediate data to disks is necessary 39 Result are buffered (waiting fetch) slow process fast fast
  • 40. Hog Query •  Queries consuming a lot of CPU/memory resources –  Coined in S. Krompass et al. [EDBT2009] •  Example: –  select 1 as day, count(…) from … where time <= current_date - interval 1 day union all select 2 as day, count(…) from … where time <= current_date - interval 2 day union all –  … –  (up to 190 days) •  More than 1000 query stages. •  Presto tries to run all of the stages at once. –  High CPU usage at coordinator 40
  • 41. •  Query rewriting (better) –  With group by and window functions –  Not a perfect solution •  Need to understand the meaning of the query •  Semantic change is not allowed –  e.g., We cannot rewrite UNION to UNION ALL –  UNION includes duplicate elimination •  Workaround Idea –  Bushy plan -> Deep plan –  Introduce stage-wise resource assignment Query Rewriting? Plan Optimization? 41
  • 42. Future Work •  Reducing Queuing/Response Time –  Introducing shared queue between customers •  For utilizing remaining cluster resources –  Fair-Scheduling: C. Gupata [EDBT2009] –  Self-tuning DBMS. S. Chaudhuri [VLDB2007] •  Adjusting Running Query Size (hard) –  Limiting driver resources as small as possible for hog queries –  Query plan based cost estimation •  Predicting Query Running Time –  J. Duggan [SIGMOD2011], A.C. Konig [VLDB2011] 42
  • 43. Summary: Treasures in Treasure Data •  Treasures for our customers –  Data collected by fluentd (td-agent) –  Query analysis platform –  Query results - values •  For Treasure Data –  SQL query logs •  Stored in treasure data –  We know how customers use SQL •  Typical queries and failures –  We know which part of query can be improved 43