SlideShare una empresa de Scribd logo
1 de 23
1© Cloudera, Inc. All rights reserved.
Presented by:
An Introduction to
&
Big Data Warehousing Meetup
December 9, 2015
2© Cloudera, Inc. All rights reserved.
Introducing Kudu
Jeremy Beard | Senior Solutions Architect
December 2015
3© Cloudera, Inc. All rights reserved.
Presenter
• Jeremy Beard
• Senior Solutions Architect at Cloudera
• Three years in big data
• Six years in data warehousing
• jeremy@cloudera.com
4© Cloudera, Inc. All rights reserved.
Current storage landscape in Hadoop
HDFS excels at:
• Efficiently scanning large amounts
of data
• Accumulating data with high
throughput
HBase excels at:
• Efficiently finding and writing
individual rows
• Making data mutable
Gaps exist when these properties
are needed simultaneously
5© Cloudera, Inc. All rights reserved.
Changing hardware landscape
• Spinning disk -> solid state storage
• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and
1.5GB/sec write throughput, at a price of less than $3/GB and dropping
• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)
• RAM is cheaper and more abundant:
• 64->128->256GB over last few years
• Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t
designed with CPU efficiency in mind
• Takeaway 2: Column stores are feasible for random access
6© Cloudera, Inc. All rights reserved.
Kudu
Storage for Fast Analytics on Fast Data
• New updating column store for Hadoop
• Simplifies the architecture for building analytic
applications on changing data
• Designed for fast analytic performance
• Natively integrated with Hadoop
• Donated as incubating project at
Apache Software Foundation
• Beta now available
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
7© Cloudera, Inc. All rights reserved.
• High throughput for big scans (columnar
storage and replication)
Goal: Within 2x of Parquet
• Low-latency for short accesses (primary key
indexes and quorum design)
Goal: 1ms read/write on SSD
• Database-like semantics (initially single-row
ACID)
• Relational data model
• SQL query
• “NoSQL” style scan/insert/update (Java client)
Kudu design goals
8© Cloudera, Inc. All rights reserved.
Kudu basic design
• Apache-licensed open source software
• Structured data model
• Basic construct: tables
• Tables broken down into tablets (roughly equivalent to partitions)
• Architecture supports geographically disparate, active/active systems
• Not the initial design goal
9© Cloudera, Inc. All rights reserved.
What Kudu is not
• Not a SQL interface
• Just the storage layer
• “BYO SQL”
• Not a file system
• Data must have tabular structure
• Not an application that runs on HDFS
• An alternative, native Hadoop storage engine
• Not a replacement for HDFS or HBase
• Select the right storage for the right use case
• Cloudera will continue to support and invest in all three
10© Cloudera, Inc. All rights reserved.
Kudu data model
• Tables have a RDBMS-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8/16/32/64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP
• Some subset of columns makes up a primary key
• Fast random reads/writes by primary key
• No secondary indexes (yet)
• Columnar layout on disk
• Lazy materialization
• Encoding and compression options
10
11© Cloudera, Inc. All rights reserved.
Table partitioning
• Hash bucketing
• Distribute records by hash of partition column(s)
• N buckets leads to N tablets
• Range partitioning
• Distribute records by ranges of the partition column(s)
• N split keys leads to N tablets
• Can be a mix for different columns of the primary key
12© Cloudera, Inc. All rights reserved.
Consistency model
• Consistency and replication enforced by Raft consensus (similar to Paxos)
• Replication by operation not data
• Single-row transactions now
• Multi-row transactions later
• Geo-distributed replicas will be possible under strict time synchronization
• Techniques drawn from Google Spanner and others
13© Cloudera, Inc. All rights reserved.
Kudu interfaces
• NoSQL-style APIs
• Insert(), Update(), Delete(), Scan()
• Java and C++ now
• Python soon
• Integrations with MapReduce, Spark, and Impala
• No direct access to underlying Kudu tablet files
• Beta does not have authentication, authorization, encryption
14© Cloudera, Inc. All rights reserved.
Impala integration
• Opens up Kudu to JDBC/ODBC clients
• Intuitive way to get data into Kudu
• INSERT INTO kudu SELECT * FROM csv;
• Additional commands
• UPDATE
• DELETE
• Efficient INSERT VALUES
• Runs on the Kudu C++ client
15© Cloudera, Inc. All rights reserved.
Performance characteristics
• Very CPU efficient
• Written in modern C++, uses specialized CPU instructions, JIT compilation
• Latency mostly driven by storage hardware capabilities
• Expect sub-millisecond response on SSDs and upcoming technologies
• No garbage collection allows very large memory footprint with no pauses
• Bloom filters reduce the need for many disk accesses
16© Cloudera, Inc. All rights reserved.
Operating Kudu
• Easiest through Cloudera Manager integration
• Separate parcel for now
• Kudu is always compacting
• No minor vs. major compaction
• No compaction latency spikes
• Web UI is full of metrics and logs
17© Cloudera, Inc. All rights reserved.
Cluster layout
• One or multiple masters
• Only one in current beta
• Low CPU and memory impact
• One tablet server per worker node
• Can share disks with HDFS
• One SSD per worker node just for Kudu WAL can speed up writes
• No dependencies on other Hadoop ecosystem components
• But interfacing components like Impala or Spark do
18© Cloudera, Inc. All rights reserved.
Real-time analytics in Hadoop today
Merging in new data = storage complexity
Downsides:
● Multiple storage layers
● Latest data is hidden
● Files are messy
● Complex to do updates
without breaking running
queriesNew Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Reporting
Request
HDFS + Impala
19© Cloudera, Inc. All rights reserved.
Real-time analytics in Hadoop with Kudu
Improvements:
● One system to operate
● No schedules or
background processes
● Handle late arrivals or data
corrections with ease
● New data available
immediately for analytics
or operations
Historical and Real-time
Data
Incoming Data
(Messaging
System)
Reporting
Request
Kudu + Impala
20© Cloudera, Inc. All rights reserved.
Kudu for data warehousing
• Near real time data visibility
• BI tools can display events that happened seconds earlier
• Excellent for star schemas
• Fast scans of deep fact tables
• Efficient wide fact tables
• Simplified updates of slowly changing dimensions
21© Cloudera, Inc. All rights reserved.
Near real time data warehousing on Kudu
Files
RDBMS
Streams
K
A
F
K
A
K
U
D
U
IMPALA
HUE
BI tools
User
FLUME
SPARK
STREAMING
Simple
Complex
22© Cloudera, Inc. All rights reserved.
Resources
Join the community
http://getkudu.io
kudu-user@googlegroups.com
Download the beta
cloudera.com/downloads
Read the whitepaper
getkudu.io/kudu.pdf
23© Cloudera, Inc. All rights reserved.
Thank You
Dan Alecia
Regional Vice President, Sales
(203) 275-6534
dan.alecia@casertaconcepts.com

Más contenido relacionado

La actualidad más candente

cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data PlatformRakuten Group, Inc.
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 

La actualidad más candente (20)

Kudu demo
Kudu demoKudu demo
Kudu demo
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Similar a Kudu: Storage for Fast Analytics on Fast Data

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 

Similar a Kudu: Storage for Fast Analytics on Fast Data (20)

Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 

Más de Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 

Más de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Kudu: Storage for Fast Analytics on Fast Data

  • 1. 1© Cloudera, Inc. All rights reserved. Presented by: An Introduction to & Big Data Warehousing Meetup December 9, 2015
  • 2. 2© Cloudera, Inc. All rights reserved. Introducing Kudu Jeremy Beard | Senior Solutions Architect December 2015
  • 3. 3© Cloudera, Inc. All rights reserved. Presenter • Jeremy Beard • Senior Solutions Architect at Cloudera • Three years in big data • Six years in data warehousing • jeremy@cloudera.com
  • 4. 4© Cloudera, Inc. All rights reserved. Current storage landscape in Hadoop HDFS excels at: • Efficiently scanning large amounts of data • Accumulating data with high throughput HBase excels at: • Efficiently finding and writing individual rows • Making data mutable Gaps exist when these properties are needed simultaneously
  • 5. 5© Cloudera, Inc. All rights reserved. Changing hardware landscape • Spinning disk -> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM) • RAM is cheaper and more abundant: • 64->128->256GB over last few years • Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind • Takeaway 2: Column stores are feasible for random access
  • 6. 6© Cloudera, Inc. All rights reserved. Kudu Storage for Fast Analytics on Fast Data • New updating column store for Hadoop • Simplifies the architecture for building analytic applications on changing data • Designed for fast analytic performance • Natively integrated with Hadoop • Donated as incubating project at Apache Software Foundation • Beta now available STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  • 7. 7© Cloudera, Inc. All rights reserved. • High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet • Low-latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD • Database-like semantics (initially single-row ACID) • Relational data model • SQL query • “NoSQL” style scan/insert/update (Java client) Kudu design goals
  • 8. 8© Cloudera, Inc. All rights reserved. Kudu basic design • Apache-licensed open source software • Structured data model • Basic construct: tables • Tables broken down into tablets (roughly equivalent to partitions) • Architecture supports geographically disparate, active/active systems • Not the initial design goal
  • 9. 9© Cloudera, Inc. All rights reserved. What Kudu is not • Not a SQL interface • Just the storage layer • “BYO SQL” • Not a file system • Data must have tabular structure • Not an application that runs on HDFS • An alternative, native Hadoop storage engine • Not a replacement for HDFS or HBase • Select the right storage for the right use case • Cloudera will continue to support and invest in all three
  • 10. 10© Cloudera, Inc. All rights reserved. Kudu data model • Tables have a RDBMS-like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8/16/32/64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a primary key • Fast random reads/writes by primary key • No secondary indexes (yet) • Columnar layout on disk • Lazy materialization • Encoding and compression options 10
  • 11. 11© Cloudera, Inc. All rights reserved. Table partitioning • Hash bucketing • Distribute records by hash of partition column(s) • N buckets leads to N tablets • Range partitioning • Distribute records by ranges of the partition column(s) • N split keys leads to N tablets • Can be a mix for different columns of the primary key
  • 12. 12© Cloudera, Inc. All rights reserved. Consistency model • Consistency and replication enforced by Raft consensus (similar to Paxos) • Replication by operation not data • Single-row transactions now • Multi-row transactions later • Geo-distributed replicas will be possible under strict time synchronization • Techniques drawn from Google Spanner and others
  • 13. 13© Cloudera, Inc. All rights reserved. Kudu interfaces • NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • Java and C++ now • Python soon • Integrations with MapReduce, Spark, and Impala • No direct access to underlying Kudu tablet files • Beta does not have authentication, authorization, encryption
  • 14. 14© Cloudera, Inc. All rights reserved. Impala integration • Opens up Kudu to JDBC/ODBC clients • Intuitive way to get data into Kudu • INSERT INTO kudu SELECT * FROM csv; • Additional commands • UPDATE • DELETE • Efficient INSERT VALUES • Runs on the Kudu C++ client
  • 15. 15© Cloudera, Inc. All rights reserved. Performance characteristics • Very CPU efficient • Written in modern C++, uses specialized CPU instructions, JIT compilation • Latency mostly driven by storage hardware capabilities • Expect sub-millisecond response on SSDs and upcoming technologies • No garbage collection allows very large memory footprint with no pauses • Bloom filters reduce the need for many disk accesses
  • 16. 16© Cloudera, Inc. All rights reserved. Operating Kudu • Easiest through Cloudera Manager integration • Separate parcel for now • Kudu is always compacting • No minor vs. major compaction • No compaction latency spikes • Web UI is full of metrics and logs
  • 17. 17© Cloudera, Inc. All rights reserved. Cluster layout • One or multiple masters • Only one in current beta • Low CPU and memory impact • One tablet server per worker node • Can share disks with HDFS • One SSD per worker node just for Kudu WAL can speed up writes • No dependencies on other Hadoop ecosystem components • But interfacing components like Impala or Spark do
  • 18. 18© Cloudera, Inc. All rights reserved. Real-time analytics in Hadoop today Merging in new data = storage complexity Downsides: ● Multiple storage layers ● Latest data is hidden ● Files are messy ● Complex to do updates without breaking running queriesNew Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request HDFS + Impala
  • 19. 19© Cloudera, Inc. All rights reserved. Real-time analytics in Hadoop with Kudu Improvements: ● One system to operate ● No schedules or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Incoming Data (Messaging System) Reporting Request Kudu + Impala
  • 20. 20© Cloudera, Inc. All rights reserved. Kudu for data warehousing • Near real time data visibility • BI tools can display events that happened seconds earlier • Excellent for star schemas • Fast scans of deep fact tables • Efficient wide fact tables • Simplified updates of slowly changing dimensions
  • 21. 21© Cloudera, Inc. All rights reserved. Near real time data warehousing on Kudu Files RDBMS Streams K A F K A K U D U IMPALA HUE BI tools User FLUME SPARK STREAMING Simple Complex
  • 22. 22© Cloudera, Inc. All rights reserved. Resources Join the community http://getkudu.io kudu-user@googlegroups.com Download the beta cloudera.com/downloads Read the whitepaper getkudu.io/kudu.pdf
  • 23. 23© Cloudera, Inc. All rights reserved. Thank You Dan Alecia Regional Vice President, Sales (203) 275-6534 dan.alecia@casertaconcepts.com