SlideShare una empresa de Scribd logo
1 de 44
FROM FLAT FILES TO
DECONSTRUCTED DATABASE
The evolution and future of the Big Data ecosystem.
Julien Le Dem
@J_
julien.ledem@wework.com
September 2018
Julien Le Dem
@J_
Principal Data Engineer
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
❖ At the beginning there was Hadoop (2005)
❖ Actually, SQL was invented in the 70s
“MapReduce: A major step
backwards”
❖ The deconstructed database
At the beginning there was Hadoop
Hadoop
Storage:
A distributed file system
Execution:
Map Reduce
Based on Google’s GFS and MapReduce papers
Great at looking for a needle in a haystack
… with
snowplows
Great at looking for a needle in a haystack …
Original Hadoop abstractions
Execution
Map/Reduce
Storage
File System
Just flat files
● Any binary
● No schema
● No standard
● The job must know how to
split the data
M
M
M
R
R
R
Shuffle
Read
locally
write
locally
Simple
● Flexible/Composable
● Logic and optimizations
tightly coupled
● Ties execution with
persistence
“MapReduce: A major step
backwards”
(2008)
SQL
Databases have been around for a long time
• Originally SEQUEL developed in the early 70s at IBM
• 1986: first SQL standard
• Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016
Relational model
Global Standard
• Universal Data access language
• Widely understood
• First described in 1969 by Edgar F. Codd
Underlying principles of relational databases
Standard
SQL is understood by many
Separation of logic and optimization
Separation of Schema and Application
High level language focusing on logic (SQL)
Indexing
Optimizer
Evolution
Views
Schemas
Integrity
Transactions
Integrity constraints
Referential integrity
Storage
Tables
Abstracted notion of data
● Defines a Schema
● Format/layout decoupled
from queries
● Has statistics/indexing
● Can evolve over time
Execution
SQL
SQL
● Decouples logic of query
from:
○ Optimizations
○ Data representation
○ Indexing
Relational Database abstractions
SELECT a, AVG(b)
FROM FOO GROUP BY a
FOO
Query evaluation
A well integrated system
Storage
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
Table
Metadata
(Schema,
stats,
layout,…)
Columnar
data
Push
downs
Columnar
data
Columnar
data
Push
downs
Push
downs
user
So why? Why Hadoop?
Why Snowplows?
The relational model was constrained
We need the right
Constraints
Need the right abstractions
Traditional SQL implementations:
• Flat schema
• Inflexible schema evolution
• History rewrite required
• No lower level abstractions
• Not scalable
Constraints are
good
They allow optimizations
• Statistics
• Pick the best join algorithm
• Change the data layout
• Reusable optimization logic
It’s just code
Hadoop is flexible and scalable
• Room to scale algorithms that are not part of the standard
• Machine learning
• Your imagination is the limit
No Data shape
constraint
Open source
• You can improve it
• You can expand it
• You can reinvent it
• Nested data structures
• Unstructured text with semantic annotations
• Graphs
• Non-uniform schemas
You can actually implement SQL
with this
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Parser
FOO
Execution
GROUP BY
JOIN
FILTER
BAR
And they did…
(open-sourced 2009)
10 years later
The deconstructed database
Author: gamene https://www.flickr.com/photos/gamene/4007091102
The deconstructed database
The deconstructed database
Query
model
Machine
learning
Storage
Batch
execution
Data
Exchange
Stream
Processing
We can mix and match individual components
*not exhaustive!
Specialized
Components
Stream processing
Storage
Execution SQL
Stream persistance
Streams
Resource management
Iceberg
Machine learning
We can mix and match individual components
Storage
Row oriented or columnar
Immutable or mutable
Stream storage vs analytics optimized
Query model
SQL
Functional
…
Machine Learning
Training models
Data Exchange
Row oriented
Columnar
Batch Execution
Optimized for high throughput and
historical analysis
Streaming Execution
Optimized for High Throughput and Low
latency processing
Emergence of standard
components
Emergence of standard components
Columnar Storage
Apache Parquet as columnar
representation at rest.
SQL parsing and
optimization
Apache Calcite as a versatile query
optimizer framework
Schema model
Apache Avro as pipeline friendly schema
for the analytics world.
Columnar Exchange
Apache Arrow as the next generation in-
memory representation and no-overhead
data exchange
Table abstraction
Netflix’s Iceberg has a great potential to
provide Snapshot isolation and layout
abstraction on top of distributed file
systems.
The deconstructed database’s optimizer: Calcite
Storage
Execution engine
Schema
plugins
Optimizer
rules
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
…
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Apache Calcite is used in:
Streaming SQL
• Apache Apex
• Apache Flink
• Apache SamzaSQL
• Apache StormSQL
…
Batch SQL
• Apache Hive
• Apache Drill
• Apache Phoenix
• Apache Kilin
…
Columnar Row oriented
Mutable
The deconstructed database’s storage
Optimized for analytics Optimized for serving
Immutable
Query integration
To be performant a query engine requires deep integration with the storage layer.
Implementing push down and a vectorized reader producing data in an efficient
representation (for example Apache Arrow).
Storage: Push downs
PROJECTION
Read only what you need
PREDICATE
Filter
AGGREGATION
Avoid materializing
intermediary data
To reduce IO, aggregation can
also be implemented during the
scan to:
• minimize materialization of
intermediary data
Evaluate filters during scan to:
• Leverage storage properties
(min/max stats, partitioning,
sort, etc)
• Avoid decoding skipped data.
• Reduce IO.
Read only the columns that are
needed:
• Columnar storage makes this
efficient.
The deconstructed database interchange: Apache
Arrow
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
SELECT SUM(a)
FROM t
WHERE c = 5
GROUP BY b
Projection and
predicate push
downs
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Data Infra
Stream persistance
Stream processing
Batch processing
Real time
dashboards
Interactive
analysis
Periodic
dashboards
Analyst
Real time
publishing
Batch
publishing
DataAPI
Data API
Schema
registry
Datasets
metadata
Scheduling/
Job deps
Persistence
Legend:
Processing
Metadata
Monitoring /
Observability
Data Storage
(S3, GCS, HDFS)
(Parquet, Avro)
Data API
UI
Eng
The Future
Still Improving
Better
interoperability
• More efficient interoperability:
Continued Apache Arrow adoption
A better data
abstraction
• A better metadata
repository: Marquez
• A better table abstraction:
Netflix/Iceberg
• Common Push down
implementations (filter,
projection, aggregation)
Better data
governance
• Global Data Lineage
• Access control
• Protect privacy
• Record User consent
Some predictions
A common access layer
Distributed access
service
Centralizes:
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Table abstraction layer
(Schema, push downs, access control, anonymization)
SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …)
File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
A multi tiered streaming batch storage system
Batch-Stream
storage integration
Convergence of Kafka and Kudu
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Time based
Ingestion
API
1) In memory
row oriented
store
2) in memory
column
oriented
store
3) On Disk Column oriented
store
Stream consumer API
Projection, Predicate, Aggregation push downs
Batch consumer API
Projection, Predicate, Aggregation push downs
Mainly reads from here
Mainly reads from here
THANK YOU!
julien.ledem@wework.com
Julien Le Dem @J_
September 2018
We’re hiring!
Contact: julien.ledem@wework.com
We’re growing
We’re hiring!
• WeWork doubled in size last year and the year before
• We expect to double in size this year as well
• Broadened scope: WeWork Labs, Powered by We, Flatiron
School, Meetup, WeLive, …
WeWork in numbers
Technology jobs
Locations
• San Francisco: Platform teams
• New York
• Tel Aviv
• Singapore
• WeWork has 287 locations, in 77 cities internationally
• Over 268,000 members worldwide
• 23 countries
Contact: julien.ledem@wework.com
Questions?
Julien Le Dem @J_ julien.ledem@wework.com
September 2018

Más contenido relacionado

La actualidad más candente

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 

La actualidad más candente (20)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Presto
PrestoPresto
Presto
 
Dash plotly data visualization
Dash plotly data visualizationDash plotly data visualization
Dash plotly data visualization
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 

Similar a Strata NY 2018: The deconstructed database

Similar a Strata NY 2018: The deconstructed database (20)

From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 

Más de Julien Le Dem

Más de Julien Le Dem (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Strata NY 2018: The deconstructed database

  • 1. FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ julien.ledem@wework.com September 2018
  • 2. Julien Le Dem @J_ Principal Data Engineer • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3. Agenda ❖ At the beginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database
  • 4. At the beginning there was Hadoop
  • 5. Hadoop Storage: A distributed file system Execution: Map Reduce Based on Google’s GFS and MapReduce papers
  • 6. Great at looking for a needle in a haystack
  • 7. … with snowplows Great at looking for a needle in a haystack …
  • 8. Original Hadoop abstractions Execution Map/Reduce Storage File System Just flat files ● Any binary ● No schema ● No standard ● The job must know how to split the data M M M R R R Shuffle Read locally write locally Simple ● Flexible/Composable ● Logic and optimizations tightly coupled ● Ties execution with persistence
  • 9. “MapReduce: A major step backwards” (2008)
  • 10. SQL Databases have been around for a long time • Originally SEQUEL developed in the early 70s at IBM • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 Relational model Global Standard • Universal Data access language • Widely understood • First described in 1969 by Edgar F. Codd
  • 11. Underlying principles of relational databases Standard SQL is understood by many Separation of logic and optimization Separation of Schema and Application High level language focusing on logic (SQL) Indexing Optimizer Evolution Views Schemas Integrity Transactions Integrity constraints Referential integrity
  • 12. Storage Tables Abstracted notion of data ● Defines a Schema ● Format/layout decoupled from queries ● Has statistics/indexing ● Can evolve over time Execution SQL SQL ● Decouples logic of query from: ○ Optimizations ○ Data representation ○ Indexing Relational Database abstractions SELECT a, AVG(b) FROM FOO GROUP BY a FOO
  • 14. A well integrated system Storage SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution Table Metadata (Schema, stats, layout,…) Columnar data Push downs Columnar data Columnar data Push downs Push downs user
  • 15. So why? Why Hadoop? Why Snowplows?
  • 16. The relational model was constrained We need the right Constraints Need the right abstractions Traditional SQL implementations: • Flat schema • Inflexible schema evolution • History rewrite required • No lower level abstractions • Not scalable Constraints are good They allow optimizations • Statistics • Pick the best join algorithm • Change the data layout • Reusable optimization logic
  • 17. It’s just code Hadoop is flexible and scalable • Room to scale algorithms that are not part of the standard • Machine learning • Your imagination is the limit No Data shape constraint Open source • You can improve it • You can expand it • You can reinvent it • Nested data structures • Unstructured text with semantic annotations • Graphs • Non-uniform schemas
  • 18. You can actually implement SQL with this
  • 19. SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Parser FOO Execution GROUP BY JOIN FILTER BAR And they did… (open-sourced 2009)
  • 21. The deconstructed database Author: gamene https://www.flickr.com/photos/gamene/4007091102
  • 24. We can mix and match individual components *not exhaustive! Specialized Components Stream processing Storage Execution SQL Stream persistance Streams Resource management Iceberg Machine learning
  • 25. We can mix and match individual components Storage Row oriented or columnar Immutable or mutable Stream storage vs analytics optimized Query model SQL Functional … Machine Learning Training models Data Exchange Row oriented Columnar Batch Execution Optimized for high throughput and historical analysis Streaming Execution Optimized for High Throughput and Low latency processing
  • 27. Emergence of standard components Columnar Storage Apache Parquet as columnar representation at rest. SQL parsing and optimization Apache Calcite as a versatile query optimizer framework Schema model Apache Avro as pipeline friendly schema for the analytics world. Columnar Exchange Apache Arrow as the next generation in- memory representation and no-overhead data exchange Table abstraction Netflix’s Iceberg has a great potential to provide Snapshot isolation and layout abstraction on top of distributed file systems.
  • 28. The deconstructed database’s optimizer: Calcite Storage Execution engine Schema plugins Optimizer rules SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution … Select Scan FOO Scan BAR JOIN GROUP BY FILTER
  • 29. Apache Calcite is used in: Streaming SQL • Apache Apex • Apache Flink • Apache SamzaSQL • Apache StormSQL … Batch SQL • Apache Hive • Apache Drill • Apache Phoenix • Apache Kilin …
  • 30. Columnar Row oriented Mutable The deconstructed database’s storage Optimized for analytics Optimized for serving Immutable Query integration To be performant a query engine requires deep integration with the storage layer. Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow).
  • 31. Storage: Push downs PROJECTION Read only what you need PREDICATE Filter AGGREGATION Avoid materializing intermediary data To reduce IO, aggregation can also be implemented during the scan to: • minimize materialization of intermediary data Evaluate filters during scan to: • Leverage storage properties (min/max stats, partitioning, sort, etc) • Avoid decoding skipped data. • Reduce IO. Read only the columns that are needed: • Columnar storage makes this efficient.
  • 32. The deconstructed database interchange: Apache Arrow Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Projection and predicate push downs
  • 35. Big Data infrastructure blueprint Data Infra Stream persistance Stream processing Batch processing Real time dashboards Interactive analysis Periodic dashboards Analyst Real time publishing Batch publishing DataAPI Data API Schema registry Datasets metadata Scheduling/ Job deps Persistence Legend: Processing Metadata Monitoring / Observability Data Storage (S3, GCS, HDFS) (Parquet, Avro) Data API UI Eng
  • 37. Still Improving Better interoperability • More efficient interoperability: Continued Apache Arrow adoption A better data abstraction • A better metadata repository: Marquez • A better table abstraction: Netflix/Iceberg • Common Push down implementations (filter, projection, aggregation) Better data governance • Global Data Lineage • Access control • Protect privacy • Record User consent
  • 39. A common access layer Distributed access service Centralizes: • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Table abstraction layer (Schema, push downs, access control, anonymization) SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
  • 40. A multi tiered streaming batch storage system Batch-Stream storage integration Convergence of Kafka and Kudu • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Time based Ingestion API 1) In memory row oriented store 2) in memory column oriented store 3) On Disk Column oriented store Stream consumer API Projection, Predicate, Aggregation push downs Batch consumer API Projection, Predicate, Aggregation push downs Mainly reads from here Mainly reads from here
  • 43. We’re growing We’re hiring! • WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Singapore • WeWork has 287 locations, in 77 cities internationally • Over 268,000 members worldwide • 23 countries Contact: julien.ledem@wework.com
  • 44. Questions? Julien Le Dem @J_ julien.ledem@wework.com September 2018

Notas del editor

  1. Focusing on the analytics use case
  2. Well integrated, efficient