SlideShare una empresa de Scribd logo
1 de 37
© 2017 Dremio Corporation @DremioHQ
Using Apache Arrow, Calcite and Parquet to build a
Relational Cache
Halloween 2017
@DataEngConf
Jacques Nadeau
© 2017 Dremio Corporation @DremioHQ
Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)
© 2017 Dremio Corporation @DremioHQ
Agenda
• Tech Backgrounder
• Caching Techniques
• Relational Caching In Depth
• Definition and Matching
• Dealing with Updates
• Closing Words
© 2017 Dremio Corporation @DremioHQ
Tech Backgrounder
© 2017 Dremio Corporation @DremioHQ
What is Apache Arrow
• Columnar In-memory Data processing
library
• Designed to work with any programming
language
• Support for both relational and complex
data as-is
• Used by Pandas, Spark, Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Calcite
• SQL parser, Relational Algebra &
Optimizer
• Understands Materialized Views and
Lattices
• Used by many to add SQL functionality
including Apex, Drill, Hive, Flink, Kylin,
Phoenix, Samza, Storm, Cascading &
Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Parquet
• OSS implementation of Google Dremel
disk format for complex columnar data
• Support high-level of data-ware
columnar compression, vectorized
columnar readback
• Defacto standard for Analytical data on
disk in Big Data ecosystem
© 2017 Dremio Corporation @DremioHQ
Caching Techniques
© 2017 Dremio Corporation @DremioHQ
What does Caching Mean?
• Caching: Reduce the distance to data (DTD).
• Distance: How much time and resources it takes to
access data?
– How fast is the medium? How near is it?
– Is the data designed for efficient consumption?
– How similar is the data to what you need to answer a
question?
Perf & Proximity
Relevance
Consumability
Ways to reduce DTD
© 2017 Dremio Corporation @DremioHQ
Types of Caching
• In-Memory File Pinning
• Columnar Disk Caching
• In-Memory Block Caching
• Near-CPU Data Caching
• Cube Relational Caching
• Arbitrary Relational Caching
© 2017 Dremio Corporation @DremioHQ
In-Memory File Pinning
• Hold a File in Memory for frequent retrieval
• Pros
– Simple, standard and well-defined interface
– Improves the performance of the medium.
– If you’re performance is primarily bound by disk IO,
this might be a good option.
• Cons
– File structure not necessarily best in-memory
structure.
– Data manipulation almost always requires a copy of
data to also be held in memory (because the file
format is not directly consumable).
© 2017 Dremio Corporation @DremioHQ
Columnar Disk Caching
• Store the data in an optimized columnar
format.
• Pros
– Better compression reduces IO
– Good structure improves processing
– Benefits selective workloads (needed
subset of all columns)
• Cons
– Requires duplicating data
– Typically manual/semi-automated (e.g.
MapReduce/Spark to ETL persist/update)
© 2017 Dremio Corporation @DremioHQ
In-Memory Block Caching
• Maintain portions of on-disk data in
Memory (e.g. Linux page cache, HBase
block cache)
• Pros
– Very mature and usually had for free
• Cons
– Not easy to control/influence.
– Very disconnected from workloads.
© 2017 Dremio Corporation @DremioHQ
Near-CPU Data Caching (memory or disk)
• Hold the data directly in a representation that can
be processed without restructuring (e.g. Arrow
format)
• Pros
– Processing can be done without interpretation of
format
– Very efficient to consume
– Possible to consume data by multiple consumers
without duplicating memory
• Cons
– Larger than compressed formats
– Requires applications to agree on format
© 2017 Dremio Corporation @DremioHQ
Cube-Based Relational Caching
• Create several partially aggregated cuboids that can
satisfy a range of aggregation queries
• Pros
– Low-latency performance for common aggregate
query patterns
– Cube storage requirements can be small fraction of
original dataset size
• Cons
– Analysis latency is bi-modal: cube hit is great but a
miss is either unserved or served slowly
– Difficult or impossible to satisfy arbitrary queries
© 2017 Dremio Corporation @DremioHQ
Arbitrary Relational Caching
• Create arbitrary data fragments combined
with partitioning and sorting schemes to
speed any query
• Pros
– Base case is easy to understand
– Can improve the performance of any query
• Cons
– Complex to match to arbitrary queries
– Can be large depending on needs
© 2017 Dremio Corporation @DremioHQ
Types of Caching: The combination we found useful
• In-Memory File Pinning
– Too non-specific given memory scarcity
• ✔ Columnar Disk Caching
– Make sure everything is in Parquet (for any non-ephemeral data)
• ✔ In-Memory Block Caching
– Leverage existing page-cache, avoid additional memory cache layers
• ✔ Near-CPU Data Caching
– Used primarily for ephemeral/short-term persistence to avoid overhead
• ✔ Cube Relational Caching
– Useful for aggregation patterns
• ✔ Arbitrary Relational Caching
– Useful for unusual aggregation and non-aggregation needs
© 2017 Dremio Corporation @DremioHQ
Relational Caching In Depth
© 2017 Dremio Corporation @DremioHQ
Relational Algebra Refresher
• Relations: Source of data (a table)
• Operators: Define a set of transformations
– Join, Project, Scan, Filter, Aggregate, Window, etc
• Properties: Defining traits of data at a particular
relation
– Sorted by X, Hash distributed by Y, etc.
• Rules: Defining equality conditions between a
collection of operations
– Project > Filter can be changed to Filter > Project, A scan
doesn’t need to project columns that aren’t used later,
etc.
• Graph/Tree: A collection of operators that define a
particular dataset in a DAG
Project
Scan
Filter
Filter
Scan
Project
© 2017 Dremio Corporation @DremioHQ
Relational Caching: Basic Concept
• Store derived data that is
between what you want
and original dataset
• Shortens Distance to
Data (DTD)
• Reduces resource
requirements & latency
Original Data
What you
Want
What you
Want
What you
Want
Persisted Shared
Intermediate State
originalDTD
newDTDcostreduction
© 2017 Dremio Corporation @DremioHQ
You Probably Already Do This!
Data Alternatives (Manually Created)
• Sessionized
• Cleansed
• Partitioned by time or region
• Summarized for a particular
purpose
Users Choose Depending on Need
• Analysts trained on using different
tables depending on use case
• Custom datasets built for
reporting
• Summarization and/or extraction
for dashboards
© 2017 Dremio Corporation @DremioHQ
Benefit of Relational Caching over “Copy and Pick”
“Copy and Pick” Relational Caching
Physical
Optimizations
(transform, sort, partition,
aggregate)
Logical Model
Source Table
????
User picks best
optimization
Cache picks best optimization
Cache maintains
representations
Admin picks manage
maintenance
© 2017 Dremio Corporation @DremioHQ
Key Components of Relational Caching
• How to Express Transformations/States: SQL
• Hold and Match Relational algebra: Calcite
• Persist alternative datasets: Parquet
• A way to process: Arrow + Sabot
• And a lot of code to put it all together…
© 2017 Dremio Corporation @DremioHQ
Query Planner
Our Approach
Data Processing
System (Sabot)
End User Queries
UI to Define
Cached Patterns
Source Storage Interface (Arrow)
HDFS S3 Elastic
Relational Pattern
Matching System
Relational
Pattern
Database
Change
Detection
Database
Cache
Persistence
Parquet
Arrow
Refresh
System
© 2017 Dremio Corporation @DremioHQ
Definition and Matching
© 2017 Dremio Corporation @DremioHQ
Coming Back to Calcite
• Calcite is a Planner & Optimizer
• Comes with a prebuilt selection of
operators, rules, properties (called
traits) and ways to express relations
• Also has a basic Materialized View
facility (relevant!)
Perfect
Foundation
for Relational
Caching
© 2017 Dremio Corporation @DremioHQ
How We Built Caching: Reflections
• Reflection: A persisted alternative view of data in Parquet
format
– Raw Reflection: Persist all records of underlying dataset, controlling
partitioning and sortedness
– Aggregate Reflection: Persist a partially aggregated dataset based on a
selection of dimensions and measures, still controlling partitioning and
sortedness
• Reflections can be built on either source tables or arbitrarily
defined Virtual Datasets
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Aggregation Rollup
Given a user query, try to create an alternative version of the
query that matches the cached target.
P(a,c)
F(c’ < 10)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(a,b, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
F(c’ < 10)
S(r1)
A(a, sum(c) as c’)
Target
Materialization
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Join/Aggregation Transposition
Join(t1.id=t2.id)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(id, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
Target
MaterializationS(t2)
Join(r1.id=t2.id)
S(r1)
A(a, sum(c) as c’)
S(t2)
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Costing and Partitioning Benefits
F(a)
S(t1)
S(t1)
S(r1)
Part by a
User Query
Target
Materialization
S(t1)
S(r1)
Part by b
Target
Materialization
S(r1)
pruned on a
© 2017 Dremio Corporation @DremioHQ
Relational Matching, Other Examples
• Physical Property Matching
• Predicate Promotion
• Predicate Inference
• Join Decomposition
• Join Promotion
© 2017 Dremio Corporation @DremioHQ
Dealing with Updates
© 2017 Dremio Corporation @DremioHQ
Refresh Management
Importance of Cache Creation Ordering
• Not all updating
orderings are equal
• Want to order
updates based on
“Refresh Graph” and
dependencies
• Multiple orders
possible, cost against
each other to
minimize update cost
Freshness Management
• Underlying data
may change
• User Should define
refresh frequency
• Separately Define
Absolute TTL
Physical
dataset
1H refresh
3H expiration
Raw Reflection
Aggregate Reflection
© 2017 Dremio Corporation @DremioHQ
Multiple Update Modes (Depending on Mutation Pattern)
• Full: Always rebuild reflections from scratch (highly mutating)
• Incremental (files): Incrementally builds reflections based on new
files and folders (append-only)
• Incremental (rowstores): Incrementally builds reflections based on
monotonically increasing field (append-only)
• Partitioned Refresh: Maintains reflections based on source
partitions (e.g. Filesystem directories, Hive partitions). (partially
mutating)
© 2017 Dremio Corporation @DremioHQ
Closing Words
© 2017 Dremio Corporation @DremioHQ
What We’ve Seen Using these Techniques
• Frequent 10x-100x+ performance improvements in multiple
workloads
• Vast reduction in resources required to achieve performance
levels
• In many cases, a reduction in disk space
– Due to avoidance of excessive unused or rarely used physical copies
© 2017 Dremio Corporation @DremioHQ
Find out More and Get Involved
• Drop by my office hours (East Room Lounge - now)
• Drop by the Dremio table behind you
• Join us at @ApacheArrow meetup at @enigma_data Midtown
– Wes Mckinney, creator of Pandas and myself, tech deep dive
• Join the Dremio community (Relational Caching)
– github.com/dremio/dremio-oss (Apache Licensed)
– dremio.com
– community.dremio.com
• Find out more about the Building Blocks
– dev@[arrow|calcite|parquet].apache.org
– http://github.com/apache/[arrow|calcite|parquet-mr]
– http://[arrow|calcite|parquet].apache.org
• Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite,
@ApacheParquet

Más contenido relacionado

La actualidad más candente

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 

La actualidad más candente (20)

Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 

Destacado

Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Oracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始めOracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始めSatoshi Nagayasu
 
はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)Hiroshi Hayakawa
 

Destacado (12)

Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Oracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始めOracle対応アプリケーションのDockerize事始め
Oracle対応アプリケーションのDockerize事始め
 
はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)はじめてのDockerパーフェクトガイド(2017年版)
はじめてのDockerパーフェクトガイド(2017年版)
 

Similar a Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?Storage Switzerland
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015Doug O'Flaherty
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the CloudBuurst
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...Amazon Web Services
 
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesWebinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesStorage Switzerland
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsLower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsPerficient, Inc.
 
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-WarWebinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-WarStorage Switzerland
 

Similar a Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache (20)

Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?Webinar: Is Your Storage Ready for Disaster?
Webinar: Is Your Storage Ready for Disaster?
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud12 Architectural Requirements for Protecting Business Data in the Cloud
12 Architectural Requirements for Protecting Business Data in the Cloud
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
 
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery TimesWebinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud SolutionsLower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
 
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-WarWebinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
 

Último

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Último (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

  • 1. © 2017 Dremio Corporation @DremioHQ Using Apache Arrow, Calcite and Parquet to build a Relational Cache Halloween 2017 @DataEngConf Jacques Nadeau
  • 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  • 3. © 2017 Dremio Corporation @DremioHQ Agenda • Tech Backgrounder • Caching Techniques • Relational Caching In Depth • Definition and Matching • Dealing with Updates • Closing Words
  • 4. © 2017 Dremio Corporation @DremioHQ Tech Backgrounder
  • 5. © 2017 Dremio Corporation @DremioHQ What is Apache Arrow • Columnar In-memory Data processing library • Designed to work with any programming language • Support for both relational and complex data as-is • Used by Pandas, Spark, Dremio
  • 6. © 2017 Dremio Corporation @DremioHQ What is Apache Calcite • SQL parser, Relational Algebra & Optimizer • Understands Materialized Views and Lattices • Used by many to add SQL functionality including Apex, Drill, Hive, Flink, Kylin, Phoenix, Samza, Storm, Cascading & Dremio
  • 7. © 2017 Dremio Corporation @DremioHQ What is Apache Parquet • OSS implementation of Google Dremel disk format for complex columnar data • Support high-level of data-ware columnar compression, vectorized columnar readback • Defacto standard for Analytical data on disk in Big Data ecosystem
  • 8. © 2017 Dremio Corporation @DremioHQ Caching Techniques
  • 9. © 2017 Dremio Corporation @DremioHQ What does Caching Mean? • Caching: Reduce the distance to data (DTD). • Distance: How much time and resources it takes to access data? – How fast is the medium? How near is it? – Is the data designed for efficient consumption? – How similar is the data to what you need to answer a question? Perf & Proximity Relevance Consumability Ways to reduce DTD
  • 10. © 2017 Dremio Corporation @DremioHQ Types of Caching • In-Memory File Pinning • Columnar Disk Caching • In-Memory Block Caching • Near-CPU Data Caching • Cube Relational Caching • Arbitrary Relational Caching
  • 11. © 2017 Dremio Corporation @DremioHQ In-Memory File Pinning • Hold a File in Memory for frequent retrieval • Pros – Simple, standard and well-defined interface – Improves the performance of the medium. – If you’re performance is primarily bound by disk IO, this might be a good option. • Cons – File structure not necessarily best in-memory structure. – Data manipulation almost always requires a copy of data to also be held in memory (because the file format is not directly consumable).
  • 12. © 2017 Dremio Corporation @DremioHQ Columnar Disk Caching • Store the data in an optimized columnar format. • Pros – Better compression reduces IO – Good structure improves processing – Benefits selective workloads (needed subset of all columns) • Cons – Requires duplicating data – Typically manual/semi-automated (e.g. MapReduce/Spark to ETL persist/update)
  • 13. © 2017 Dremio Corporation @DremioHQ In-Memory Block Caching • Maintain portions of on-disk data in Memory (e.g. Linux page cache, HBase block cache) • Pros – Very mature and usually had for free • Cons – Not easy to control/influence. – Very disconnected from workloads.
  • 14. © 2017 Dremio Corporation @DremioHQ Near-CPU Data Caching (memory or disk) • Hold the data directly in a representation that can be processed without restructuring (e.g. Arrow format) • Pros – Processing can be done without interpretation of format – Very efficient to consume – Possible to consume data by multiple consumers without duplicating memory • Cons – Larger than compressed formats – Requires applications to agree on format
  • 15. © 2017 Dremio Corporation @DremioHQ Cube-Based Relational Caching • Create several partially aggregated cuboids that can satisfy a range of aggregation queries • Pros – Low-latency performance for common aggregate query patterns – Cube storage requirements can be small fraction of original dataset size • Cons – Analysis latency is bi-modal: cube hit is great but a miss is either unserved or served slowly – Difficult or impossible to satisfy arbitrary queries
  • 16. © 2017 Dremio Corporation @DremioHQ Arbitrary Relational Caching • Create arbitrary data fragments combined with partitioning and sorting schemes to speed any query • Pros – Base case is easy to understand – Can improve the performance of any query • Cons – Complex to match to arbitrary queries – Can be large depending on needs
  • 17. © 2017 Dremio Corporation @DremioHQ Types of Caching: The combination we found useful • In-Memory File Pinning – Too non-specific given memory scarcity • ✔ Columnar Disk Caching – Make sure everything is in Parquet (for any non-ephemeral data) • ✔ In-Memory Block Caching – Leverage existing page-cache, avoid additional memory cache layers • ✔ Near-CPU Data Caching – Used primarily for ephemeral/short-term persistence to avoid overhead • ✔ Cube Relational Caching – Useful for aggregation patterns • ✔ Arbitrary Relational Caching – Useful for unusual aggregation and non-aggregation needs
  • 18. © 2017 Dremio Corporation @DremioHQ Relational Caching In Depth
  • 19. © 2017 Dremio Corporation @DremioHQ Relational Algebra Refresher • Relations: Source of data (a table) • Operators: Define a set of transformations – Join, Project, Scan, Filter, Aggregate, Window, etc • Properties: Defining traits of data at a particular relation – Sorted by X, Hash distributed by Y, etc. • Rules: Defining equality conditions between a collection of operations – Project > Filter can be changed to Filter > Project, A scan doesn’t need to project columns that aren’t used later, etc. • Graph/Tree: A collection of operators that define a particular dataset in a DAG Project Scan Filter Filter Scan Project
  • 20. © 2017 Dremio Corporation @DremioHQ Relational Caching: Basic Concept • Store derived data that is between what you want and original dataset • Shortens Distance to Data (DTD) • Reduces resource requirements & latency Original Data What you Want What you Want What you Want Persisted Shared Intermediate State originalDTD newDTDcostreduction
  • 21. © 2017 Dremio Corporation @DremioHQ You Probably Already Do This! Data Alternatives (Manually Created) • Sessionized • Cleansed • Partitioned by time or region • Summarized for a particular purpose Users Choose Depending on Need • Analysts trained on using different tables depending on use case • Custom datasets built for reporting • Summarization and/or extraction for dashboards
  • 22. © 2017 Dremio Corporation @DremioHQ Benefit of Relational Caching over “Copy and Pick” “Copy and Pick” Relational Caching Physical Optimizations (transform, sort, partition, aggregate) Logical Model Source Table ???? User picks best optimization Cache picks best optimization Cache maintains representations Admin picks manage maintenance
  • 23. © 2017 Dremio Corporation @DremioHQ Key Components of Relational Caching • How to Express Transformations/States: SQL • Hold and Match Relational algebra: Calcite • Persist alternative datasets: Parquet • A way to process: Arrow + Sabot • And a lot of code to put it all together…
  • 24. © 2017 Dremio Corporation @DremioHQ Query Planner Our Approach Data Processing System (Sabot) End User Queries UI to Define Cached Patterns Source Storage Interface (Arrow) HDFS S3 Elastic Relational Pattern Matching System Relational Pattern Database Change Detection Database Cache Persistence Parquet Arrow Refresh System
  • 25. © 2017 Dremio Corporation @DremioHQ Definition and Matching
  • 26. © 2017 Dremio Corporation @DremioHQ Coming Back to Calcite • Calcite is a Planner & Optimizer • Comes with a prebuilt selection of operators, rules, properties (called traits) and ways to express relations • Also has a basic Materialized View facility (relevant!) Perfect Foundation for Relational Caching
  • 27. © 2017 Dremio Corporation @DremioHQ How We Built Caching: Reflections • Reflection: A persisted alternative view of data in Parquet format – Raw Reflection: Persist all records of underlying dataset, controlling partitioning and sortedness – Aggregate Reflection: Persist a partially aggregated dataset based on a selection of dimensions and measures, still controlling partitioning and sortedness • Reflections can be built on either source tables or arbitrarily defined Virtual Datasets
  • 28. © 2017 Dremio Corporation @DremioHQ Cache Matching: Aggregation Rollup Given a user query, try to create an alternative version of the query that matches the cached target. P(a,c) F(c’ < 10) S(t1) S(t1) A(a, sum(c) as c’) A(a,b, sum(c)) S(r1) User Query Reflection Definition Alternative Plan F(c’ < 10) S(r1) A(a, sum(c) as c’) Target Materialization
  • 29. © 2017 Dremio Corporation @DremioHQ Cache Matching: Join/Aggregation Transposition Join(t1.id=t2.id) S(t1) S(t1) A(a, sum(c) as c’) A(id, sum(c)) S(r1) User Query Reflection Definition Alternative Plan Target MaterializationS(t2) Join(r1.id=t2.id) S(r1) A(a, sum(c) as c’) S(t2)
  • 30. © 2017 Dremio Corporation @DremioHQ Cache Matching: Costing and Partitioning Benefits F(a) S(t1) S(t1) S(r1) Part by a User Query Target Materialization S(t1) S(r1) Part by b Target Materialization S(r1) pruned on a
  • 31. © 2017 Dremio Corporation @DremioHQ Relational Matching, Other Examples • Physical Property Matching • Predicate Promotion • Predicate Inference • Join Decomposition • Join Promotion
  • 32. © 2017 Dremio Corporation @DremioHQ Dealing with Updates
  • 33. © 2017 Dremio Corporation @DremioHQ Refresh Management Importance of Cache Creation Ordering • Not all updating orderings are equal • Want to order updates based on “Refresh Graph” and dependencies • Multiple orders possible, cost against each other to minimize update cost Freshness Management • Underlying data may change • User Should define refresh frequency • Separately Define Absolute TTL Physical dataset 1H refresh 3H expiration Raw Reflection Aggregate Reflection
  • 34. © 2017 Dremio Corporation @DremioHQ Multiple Update Modes (Depending on Mutation Pattern) • Full: Always rebuild reflections from scratch (highly mutating) • Incremental (files): Incrementally builds reflections based on new files and folders (append-only) • Incremental (rowstores): Incrementally builds reflections based on monotonically increasing field (append-only) • Partitioned Refresh: Maintains reflections based on source partitions (e.g. Filesystem directories, Hive partitions). (partially mutating)
  • 35. © 2017 Dremio Corporation @DremioHQ Closing Words
  • 36. © 2017 Dremio Corporation @DremioHQ What We’ve Seen Using these Techniques • Frequent 10x-100x+ performance improvements in multiple workloads • Vast reduction in resources required to achieve performance levels • In many cases, a reduction in disk space – Due to avoidance of excessive unused or rarely used physical copies
  • 37. © 2017 Dremio Corporation @DremioHQ Find out More and Get Involved • Drop by my office hours (East Room Lounge - now) • Drop by the Dremio table behind you • Join us at @ApacheArrow meetup at @enigma_data Midtown – Wes Mckinney, creator of Pandas and myself, tech deep dive • Join the Dremio community (Relational Caching) – github.com/dremio/dremio-oss (Apache Licensed) – dremio.com – community.dremio.com • Find out more about the Building Blocks – dev@[arrow|calcite|parquet].apache.org – http://github.com/apache/[arrow|calcite|parquet-mr] – http://[arrow|calcite|parquet].apache.org • Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite, @ApacheParquet