SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Fast Time-Series
Analytics with
HBase and R
Big Data Bootcamp, Austin
February 19, 2015
Copyright © 2015 Accenture All rights reserved.
2
Aaron Benz
@aaronbenz33
Aaron is currently a Data Scientist and Big Data
Modeler at Accenture. He has extensive experience in
R, Cassandra, and HBase, particularly in time-series
analysis, discovery, and modeling.
Aaron graduated from the Templeton Honors College
at Eastern University with a Bachelors of Mathematics.
While attending he played lacrosse and was awarded
All-America and Academic All-America​
About the Speaker
Copyright © 2015 Accenture All rights reserved.
3
Fast Time-Series Analytics with HBase and R
Brief Introduction to NoSQL and HBase
Use Case and Data Model
Other Considered Solutions (in Hadoop)
Working with HBase in R
Live Demo
Copyright © 2015 Accenture All rights reserved.
4
“Use Apache HBase when you need random, realtime read/write access to
your Big Data. This project's goal is the hosting of very large tables -- billions of
rows X millions of columns -- atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, non-relational database
modeled after Google's Bigtable... Just as Bigtable leverages the distributed data
storage provided by the Google File System, Apache HBase provides Bigtable-like
capabilities on top of Hadoop and HDFS.”
-  From http://hbase.apache.org/
What is HBase?
Copyright © 2015 Accenture All rights reserved.
5
Like I Said, What is HBase?
A Map (at its core)
•  Key-value pairs (think dictionary in Python or hash in Ruby)
Persistent and Consistent
•  Strictly consistent Read & Write protocol
Distributed
•  Uses HDFS for replication across Hadoop cluster
Sorted
•  Stores information alphabetically according to rowkey
Multi-Dimensional
•  Helpful to think of a map of maps. Essentially, each key can have an many
different key-value pairs (key-column-value pairs)
Copyright © 2015 Accenture All rights reserved.
6
The role of HBase within the Hadoop system involves the co-ordination and scaling
of data storage and processing.
HBase atop HDFS
HDFS
(Distribution File System)
MapReduce
(Data/Batch Processing)
Pig
(Dataflow)
Hive
(SQL)
Zookeeper
(Co-ordination)
HBase
(BigTable)
Hadoop Core
Related Project
YARN
(Distributed Resource Management)
HadoopEcosystem
Copyright © 2015 Accenture All rights reserved.
7
HBase: Multi-Dimensional Key-Column-Value Pairs
rowkey 1 column a column b column c
record 1a record 1b record 1c
rowkey 2 column d column e column f
record 2d record 2e record 2f
Timestamp is also available as an additional dimension
Copyright © 2015 Accenture All rights reserved.
8
SQL to NoSQL Transformation
C2	
  &	
  the	
  Brothers	
  Reed:Hot	
  Mess	
  
C2	
  &	
  the	
  Brothers	
  Reed:Weigh	
  Sta6on	
  Tour	
  
track:1	
  
mollie	
  
track:2	
  
the	
  line	
  
track:3	
  
introducing	
  
track:1	
  
revolu6on	
  
track:2	
  
sicko	
  
track:3	
  
smokin	
  gun	
  
year	
  
2013	
  
year	
  
2014	
  
NoSQL Columnar Store
SQL Store
‹#›Copyright © 2015 Accenture All rights reserved.
Use Case and Data Model
9
Copyright © 2015 Accenture All rights reserved.
10
Data
•  Hundreds of thousands of machines
•  Hundreds of time-series metrics, per
machine
•  Every metric reporting on their own
intervals
•  Altogether, millions of time series
datasets
Use Case
Requirements
•  Fast reads and writes of datasets on
the fly
•  Calculate new time-series variables
•  Perform batch calculations across all
datasets
-  Use parallelization to improve
performance
•  Quickly create time series plots on
demand
•  Do all of this in R
Copyright © 2015 Accenture All rights reserved.
11
•  Trucks are equipped with sensors,
measuring
•  Gear
•  Engine RPM
•  Vehicle Speed
•  Metrics are reported across time
•  Inconsistent time indices across
different metrics and vehicles
Use Case: Airport Ground Support Vehicles
Copyright © 2015 Accenture All rights reserved.
12
•  The higher levels determine directory structure
•  The lowest level is the dataset level
•  Each dataset file has a unique hierarchical filepath
airport/date/vehicle/metric
Data Hierarchy
»  airport
»  date
»  vehicle
»  metric
timestamp metric
1393681805 996.183209557728
1393681805.2 1117.07736425362
1393681805.4 1537.63856473972
1393681805.5 1547.38554623813
1393681805.7 1499.64107879763
1393681805.9 2016.95743109826
1393681806 2224.40342820991
... …
dataset
Copyright © 2015 Accenture All rights reserved.
13
Data Hierarchy
Copyright © 2015 Accenture All rights reserved.
14
Requirements
•  Fast reads and writes of datasets on the fly
•  A data model that easily translates from the hierarchical nature of the data
•  Cost-effective scalability
•  Ability to access data store directly from R (rhbase)
HBase Fulfills Our Requirements for Data Storage
Copyright © 2015 Accenture All rights reserved.
15
•  The directory names make up the rowkey design
•  The dataset names make up the column names
•  Datasets are stored as blobs at the record level
Our NoSQL Data Model
airport::date::vin metric
dataset
f:gear f:rpm f:speedjfk::20140301::1CKPH77
f:gear f:rpm f:speedlax::20140302::1CWPJ732
Records are
whole datasets
Let your query patterns
inform rowkey design!
NoSQL offers less flexibility in
how you query your data.
Copyright © 2015 Accenture All rights reserved.
16
•  Thrift API to HBase
•  Needed for R connection to HBase
•  Size of records
•  By default, the keyvalue max is set to 10 MB
•  Load Balancer
•  Hotspotting
•  Large amount of traffic is directed at only one node
•  Problem can arise when sequentially rowkeys are being written to HBase
•  Try salting or hashing to avoid hotspotting
HBase Technical Considerations
‹#›Copyright © 2015 Accenture All rights reserved.
Other Considered Solutions (in Hadoop)
17
Copyright © 2015 Accenture All rights reserved.
18
Hive is the relational data warehouse of the Hadoop framework
•  A Hive solution: A wide table created from the individual time series datasets
Hive: The Wrong Tool
airport date vehicle timestamp gear rpm speed
jfk 20140301 1CKPH7747ZK277944 1393679153.0 1 1080.477408 0
jfk 20140301 1CKPH7747ZK277944 1393679153.2 NA 1238.601102 8
jfk 20140301 1CKPH7747ZK277944 1393679153.4 NA NA 14
jfk 20140301 1CKPH7747ZK277944 1393679153.5 1 1611.751439 20
jfk 20140301 1CKPH7747ZK277944 1393679153.7 NA 1871.309458 24
… … … … … … …
lax 20140302 1CHMP60486H283129 1393680375.0 1 1816.800001 0
lax 20140302 1CHMP60486H283129 1393680375.2 NA 1940.41125 7
lax 20140302 1CHMP60486H283129 1393680375.4 NA 1877.631061 15
lax 20140302 1CHMP60486H283129 1393680375.5 1 2491.360822 NA
lax 20140302 1CHMP60486H283129 1393680375.7 NA 2200.494979 20
… … … … … … …
Copyright © 2015 Accenture All rights reserved.
19
Hive is the wrong solution for our use case
•  Use Case: Data retrieval, with operations happening in R
•  Simply selecting certain columns and rows requires a MapReduce job
•  Partitioning not feasible (many small files)
•  Impala or Shark would solve the MapReduce overhead, but RDBMS is still not great
for pure data retrieval
Hive: The Wrong Tool
airport date vehicle timestamp gear rpm speed
jfk 20140301 1CKPH7747ZK277944 1393679153.0 1 1080.477408 0
jfk 20140301 1CKPH7747ZK277944 1393679153.2 NA 1238.601102 8
jfk 20140301 1CKPH7747ZK277944 1393679153.4 NA NA 14
jfk 20140301 1CKPH7747ZK277944 1393679153.5 1 1611.751439 20
jfk 20140301 1CKPH7747ZK277944 1393679153.7 NA 1871.309458 24
… … … … … … …
lax 20140302 1CHMP60486H283129 1393680375.0 1 1816.800001 0
lax 20140302 1CHMP60486H283129 1393680375.2 NA 1940.41125 7
lax 20140302 1CHMP60486H283129 1393680375.4 NA 1877.631061 15
lax 20140302 1CHMP60486H283129 1393680375.5 1 2491.360822 NA
lax 20140302 1CHMP60486H283129 1393680375.7 NA 2200.494979 20
… … … … … … …
SELECT	
  airport,	
  date,	
  	
  
vehicle,	
  timestamp,	
  rpm	
  
FROM	
  ground_support	
  
WHERE	
  date	
  =	
  '20140301'	
  
AND	
  vehicle	
  LIKE	
  '1CKPH77%'	
  
Copyright © 2015 Accenture All rights reserved.
20
What about storing individual dataset files directly on HDFS?
•  Hadoop’s problem with many small files
•  Every file is held as an object in the namenode’s memory (about 150 bytes per)
•  With enough files, that starts to add up
•  HDFS is not designed to efficiently access small files quickly
•  Designed for streaming access of large files
•  Possible solutions
•  HAR files
•  Sequence files
•  HBase
•  HBase consistently outperformed HDFS in our comparisons
HDFS Storage: Hadoop’s Small Files Problem
Copyright © 2015 Accenture All rights reserved.
21
•  Add timestamp to the rowkey
•  Singular observations held in the record
•  Scan over rowkey prefix (airport::date::vin) to retrieve whole dataset
Poor Application of HBase
•  Explosion of number of rowkeys to be retrieved at one time
•  Data is not being juxtaposed, so scan becomes more costly
•  HBase is optimized for fast lookups of one or a few rowkeys—not thousands
HBase: A Really Bad Data Model
jfk::20140301::1CKPH77::1393679153
1
rpm speedgear
1080.4774 0
…	
   …	
   …	
  
…
Copyright © 2015 Accenture All rights reserved.
22
Wide Row Data Model
•  Add metric to the rowkey
•  Store timestamps in the column name
•  Singular observations held in the record
Issues
•  Additional time to transpose and create the data.table object in R
•  Added ability to filter a specific time series not a priority for the use case
HBase: Common Uncompressed Wide Row Model
jfk::20140301::1CKPH77::rpm
1080.4774
1393679153.2 1393679153.41393679153 …	
  
1238.6011 1611.7514 …	
  
‹#›Copyright © 2015 Accenture All rights reserved.
Working with HBase in R
23
Copyright © 2015 Accenture All rights reserved.
24
•  Part of Revolution Analytics’ free RHadoop suite of packages
•  Uses the Thrift API to connect to HBase
rhbase: Working with HBase in R
HBase Shell Command rhbase Function Description
list hb.list.tables	
   Shows all tables
scan [table] hb.scan	
  
Scans a table to retrieve all rows over a
specified range of rowkeys
create [table], [cf1],… hb.new.table	
  
Creates a new table with some column
families
get [table], [key]
(, [cf:col])
hb.get	
  
Gets a single row, or specific records with
specification of column names
put [table], [key], [cf:col],
[val] (,[ts])
hb.insert	
  
Writes a value into the table at a specified
key, column family and column.
Copyright © 2015 Accenture All rights reserved.
25
Custom additions to rhbase were made to promote tidyR principles and the use of
data.tables
•  hb.put
•  A wrapper around hb.insert for easier inputting
•  By default, stores byte array’s using R’s native serializer
•  You can specify a different serializer: raw, character, or a custom function
•  hb.pull
•  A wrapper around hb.scan for easier retrieval
•  By default, unserializes the values to their original R object
•  Presents retrieved data in a much nicer format than the original hb.scan
Custom Additions to rhbase
Copyright © 2015 Accenture All rights reserved.
26
require(rhbase)	
  
require(data.table)	
  
	
  
hb.init()	
  
table_name	
  <-­‐	
  "ground_support"	
  
hb.new.table(table_name,	
  "f")	
  
	
  
dataset	
  <-­‐	
  fread("jfk/20140301/1CKPH7747ZK277944/gear.csv")	
  
	
  
rowkey	
  <-­‐	
  "jfk::20140301::1CKPH7747ZK277944"	
  
column	
  <-­‐	
  "gear"	
  
	
  
hb.put(table_name	
  =	
  table_name,	
  column_family	
  =	
  "f",	
  	
  
	
  	
  	
  	
  	
  	
  	
  rowkey	
  =	
  rowkey,	
  column	
  =	
  column,	
  value	
  =	
  dataset)	
  
Loading Data in HBase
•  Loading R objects into HBase is easy with rhbase and the hb.put wrapper
•  Let the data’s nested directory structure inform the rowkey and column names
Copyright © 2015 Accenture All rights reserved.
27
hb.pull("ground_support",	
  "test",	
  start	
  =	
  "LAX::20140306",	
  	
  
	
  	
  end	
  =	
  "LAXa",	
  columns	
  =	
  "speed",	
  batchsize	
  =	
  100)	
  
	
  
>	
  ##	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  rowkey	
  column_family	
  column	
  	
  	
  	
  	
  	
  	
  values	
  
	
  	
  ##	
  1:	
  LAX::20140306::1CHMP60486H283129	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  2:	
  LAX::20140306::1FAJL35763D392894	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  3:	
  LAX::20140306::1FJAL1998VS238734	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  4:	
  LAX::20140306::1FSMZ91563C548451	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  5:	
  LAX::20140307::1CHMP60486H283129	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  6:	
  LAX::20140307::1FAJL35763D392894	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  7:	
  LAX::20140307::1FJAL1998VS238734	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
	
  	
  ##	
  8:	
  LAX::20140307::1FSMZ91563C548451	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  test	
  	
  speed	
  <data.table>	
  
Retrieving Data from HBase
Perform a scan over LAX vehicles on 2014-03-06, and retrieve the speed datasets
for those vehicles
Copyright © 2015 Accenture All rights reserved.
28
Use these techniques to get even faster results
•  Serialized objects > Text/Blob representations
•  On read, no extra overhead to convert the text object into an R object
•  Just unserialize it
•  Use data.table
•  Much faster aggregation and manipulation of datasets
•  Smarter memory usage
•  Reduce the number points to plot
•  Not always necessary to graph all the plots to get the right picture
•  Customized algorithm to reduce the right points (so the picture stays the same)
•  See timeseriesr package @ github.com/accenture/timeseries
Additional Optimizations in R
‹#›Copyright © 2015 Accenture All rights reserved.
Live Demo
29
Copyright © 2015 Accenture All rights reserved.
30
Batch Calculations
•  Pull in datasets, and either
•  Derive new calculated time series metrics, and store those new datasets in HBase
•  Calculate aggregate functions over datasets, and collect these values in aggregated table
•  Performing such tasks over millions of datasets will take time
•  Suggested R Packages for Scale
•  parallel
•  rmr2
•  Ensure that each map task is scanning in all the rows it needs to perform the
calculation
Batch Calculations
Copyright © 2015 Accenture All rights reserved.
31
If Python is your language of choice…
happybase
•  Similar functionality to the HBase shell and rhbase
•  Uses Thrift to connect to HBase
•  Added capability to perform batch puts (also offered in the HBase shell)
•  Well-developed and helpful documentation
http://happybase.readthedocs.org/en/latest/
Python’s happybase
Copyright © 2015 Accenture All rights reserved.
32
•  HBase is a great tool if you’re looking for fast reads and writes of your data
•  Let your query patterns inform your rowkey design
•  NoSQL sacrifices flexibility for speed
•  Both Python and R have well developed connectors to HBase
•  Store your data in a format that can be quickly unserialized for faster reads
Conclusion
Copyright © 2015 Accenture All rights reserved.
33
Apache HBase Documentation: http://hbase.apache.org/
Revolution Analytics: https://github.com/RevolutionAnalytics/RHadoop/wiki
Accenture rhbase fork: https://github.com/Accenture/rhbase
Accenture timeseries: https://github.com/Accenture/timeseries
RStudio: Rshiny http://shiny.rstudio.com/
R-Hbase Tutorial: https://github.com/aaronbenz/rhbase/tree/master/examples
Shiny Example: https://github.com/aaronbenz/hbaseVhive
References

Más contenido relacionado

La actualidad más candente

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 

La actualidad más candente (20)

Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 

Destacado (6)

So you want to do a Big Data project?
So you want to do a Big Data project?So you want to do a Big Data project?
So you want to do a Big Data project?
 
Team3 presentation
Team3 presentationTeam3 presentation
Team3 presentation
 
Big Data project offer for HSL
Big Data project offer for HSLBig Data project offer for HSL
Big Data project offer for HSL
 
CitySprint Fleetmapper use case -Big Data Bootcamp
CitySprint  Fleetmapper use case -Big Data BootcampCitySprint  Fleetmapper use case -Big Data Bootcamp
CitySprint Fleetmapper use case -Big Data Bootcamp
 
이민의 포트폴리오
이민의 포트폴리오이민의 포트폴리오
이민의 포트폴리오
 
포트폴리오 오경원
포트폴리오 오경원포트폴리오 오경원
포트폴리오 오경원
 

Similar a Big Data Conference April 2015

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 

Similar a Big Data Conference April 2015 (20)

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 

Big Data Conference April 2015

  • 1. Fast Time-Series Analytics with HBase and R Big Data Bootcamp, Austin February 19, 2015
  • 2. Copyright © 2015 Accenture All rights reserved. 2 Aaron Benz @aaronbenz33 Aaron is currently a Data Scientist and Big Data Modeler at Accenture. He has extensive experience in R, Cassandra, and HBase, particularly in time-series analysis, discovery, and modeling. Aaron graduated from the Templeton Honors College at Eastern University with a Bachelors of Mathematics. While attending he played lacrosse and was awarded All-America and Academic All-America​ About the Speaker
  • 3. Copyright © 2015 Accenture All rights reserved. 3 Fast Time-Series Analytics with HBase and R Brief Introduction to NoSQL and HBase Use Case and Data Model Other Considered Solutions (in Hadoop) Working with HBase in R Live Demo
  • 4. Copyright © 2015 Accenture All rights reserved. 4 “Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable... Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.” -  From http://hbase.apache.org/ What is HBase?
  • 5. Copyright © 2015 Accenture All rights reserved. 5 Like I Said, What is HBase? A Map (at its core) •  Key-value pairs (think dictionary in Python or hash in Ruby) Persistent and Consistent •  Strictly consistent Read & Write protocol Distributed •  Uses HDFS for replication across Hadoop cluster Sorted •  Stores information alphabetically according to rowkey Multi-Dimensional •  Helpful to think of a map of maps. Essentially, each key can have an many different key-value pairs (key-column-value pairs)
  • 6. Copyright © 2015 Accenture All rights reserved. 6 The role of HBase within the Hadoop system involves the co-ordination and scaling of data storage and processing. HBase atop HDFS HDFS (Distribution File System) MapReduce (Data/Batch Processing) Pig (Dataflow) Hive (SQL) Zookeeper (Co-ordination) HBase (BigTable) Hadoop Core Related Project YARN (Distributed Resource Management) HadoopEcosystem
  • 7. Copyright © 2015 Accenture All rights reserved. 7 HBase: Multi-Dimensional Key-Column-Value Pairs rowkey 1 column a column b column c record 1a record 1b record 1c rowkey 2 column d column e column f record 2d record 2e record 2f Timestamp is also available as an additional dimension
  • 8. Copyright © 2015 Accenture All rights reserved. 8 SQL to NoSQL Transformation C2  &  the  Brothers  Reed:Hot  Mess   C2  &  the  Brothers  Reed:Weigh  Sta6on  Tour   track:1   mollie   track:2   the  line   track:3   introducing   track:1   revolu6on   track:2   sicko   track:3   smokin  gun   year   2013   year   2014   NoSQL Columnar Store SQL Store
  • 9. ‹#›Copyright © 2015 Accenture All rights reserved. Use Case and Data Model 9
  • 10. Copyright © 2015 Accenture All rights reserved. 10 Data •  Hundreds of thousands of machines •  Hundreds of time-series metrics, per machine •  Every metric reporting on their own intervals •  Altogether, millions of time series datasets Use Case Requirements •  Fast reads and writes of datasets on the fly •  Calculate new time-series variables •  Perform batch calculations across all datasets -  Use parallelization to improve performance •  Quickly create time series plots on demand •  Do all of this in R
  • 11. Copyright © 2015 Accenture All rights reserved. 11 •  Trucks are equipped with sensors, measuring •  Gear •  Engine RPM •  Vehicle Speed •  Metrics are reported across time •  Inconsistent time indices across different metrics and vehicles Use Case: Airport Ground Support Vehicles
  • 12. Copyright © 2015 Accenture All rights reserved. 12 •  The higher levels determine directory structure •  The lowest level is the dataset level •  Each dataset file has a unique hierarchical filepath airport/date/vehicle/metric Data Hierarchy »  airport »  date »  vehicle »  metric timestamp metric 1393681805 996.183209557728 1393681805.2 1117.07736425362 1393681805.4 1537.63856473972 1393681805.5 1547.38554623813 1393681805.7 1499.64107879763 1393681805.9 2016.95743109826 1393681806 2224.40342820991 ... … dataset
  • 13. Copyright © 2015 Accenture All rights reserved. 13 Data Hierarchy
  • 14. Copyright © 2015 Accenture All rights reserved. 14 Requirements •  Fast reads and writes of datasets on the fly •  A data model that easily translates from the hierarchical nature of the data •  Cost-effective scalability •  Ability to access data store directly from R (rhbase) HBase Fulfills Our Requirements for Data Storage
  • 15. Copyright © 2015 Accenture All rights reserved. 15 •  The directory names make up the rowkey design •  The dataset names make up the column names •  Datasets are stored as blobs at the record level Our NoSQL Data Model airport::date::vin metric dataset f:gear f:rpm f:speedjfk::20140301::1CKPH77 f:gear f:rpm f:speedlax::20140302::1CWPJ732 Records are whole datasets Let your query patterns inform rowkey design! NoSQL offers less flexibility in how you query your data.
  • 16. Copyright © 2015 Accenture All rights reserved. 16 •  Thrift API to HBase •  Needed for R connection to HBase •  Size of records •  By default, the keyvalue max is set to 10 MB •  Load Balancer •  Hotspotting •  Large amount of traffic is directed at only one node •  Problem can arise when sequentially rowkeys are being written to HBase •  Try salting or hashing to avoid hotspotting HBase Technical Considerations
  • 17. ‹#›Copyright © 2015 Accenture All rights reserved. Other Considered Solutions (in Hadoop) 17
  • 18. Copyright © 2015 Accenture All rights reserved. 18 Hive is the relational data warehouse of the Hadoop framework •  A Hive solution: A wide table created from the individual time series datasets Hive: The Wrong Tool airport date vehicle timestamp gear rpm speed jfk 20140301 1CKPH7747ZK277944 1393679153.0 1 1080.477408 0 jfk 20140301 1CKPH7747ZK277944 1393679153.2 NA 1238.601102 8 jfk 20140301 1CKPH7747ZK277944 1393679153.4 NA NA 14 jfk 20140301 1CKPH7747ZK277944 1393679153.5 1 1611.751439 20 jfk 20140301 1CKPH7747ZK277944 1393679153.7 NA 1871.309458 24 … … … … … … … lax 20140302 1CHMP60486H283129 1393680375.0 1 1816.800001 0 lax 20140302 1CHMP60486H283129 1393680375.2 NA 1940.41125 7 lax 20140302 1CHMP60486H283129 1393680375.4 NA 1877.631061 15 lax 20140302 1CHMP60486H283129 1393680375.5 1 2491.360822 NA lax 20140302 1CHMP60486H283129 1393680375.7 NA 2200.494979 20 … … … … … … …
  • 19. Copyright © 2015 Accenture All rights reserved. 19 Hive is the wrong solution for our use case •  Use Case: Data retrieval, with operations happening in R •  Simply selecting certain columns and rows requires a MapReduce job •  Partitioning not feasible (many small files) •  Impala or Shark would solve the MapReduce overhead, but RDBMS is still not great for pure data retrieval Hive: The Wrong Tool airport date vehicle timestamp gear rpm speed jfk 20140301 1CKPH7747ZK277944 1393679153.0 1 1080.477408 0 jfk 20140301 1CKPH7747ZK277944 1393679153.2 NA 1238.601102 8 jfk 20140301 1CKPH7747ZK277944 1393679153.4 NA NA 14 jfk 20140301 1CKPH7747ZK277944 1393679153.5 1 1611.751439 20 jfk 20140301 1CKPH7747ZK277944 1393679153.7 NA 1871.309458 24 … … … … … … … lax 20140302 1CHMP60486H283129 1393680375.0 1 1816.800001 0 lax 20140302 1CHMP60486H283129 1393680375.2 NA 1940.41125 7 lax 20140302 1CHMP60486H283129 1393680375.4 NA 1877.631061 15 lax 20140302 1CHMP60486H283129 1393680375.5 1 2491.360822 NA lax 20140302 1CHMP60486H283129 1393680375.7 NA 2200.494979 20 … … … … … … … SELECT  airport,  date,     vehicle,  timestamp,  rpm   FROM  ground_support   WHERE  date  =  '20140301'   AND  vehicle  LIKE  '1CKPH77%'  
  • 20. Copyright © 2015 Accenture All rights reserved. 20 What about storing individual dataset files directly on HDFS? •  Hadoop’s problem with many small files •  Every file is held as an object in the namenode’s memory (about 150 bytes per) •  With enough files, that starts to add up •  HDFS is not designed to efficiently access small files quickly •  Designed for streaming access of large files •  Possible solutions •  HAR files •  Sequence files •  HBase •  HBase consistently outperformed HDFS in our comparisons HDFS Storage: Hadoop’s Small Files Problem
  • 21. Copyright © 2015 Accenture All rights reserved. 21 •  Add timestamp to the rowkey •  Singular observations held in the record •  Scan over rowkey prefix (airport::date::vin) to retrieve whole dataset Poor Application of HBase •  Explosion of number of rowkeys to be retrieved at one time •  Data is not being juxtaposed, so scan becomes more costly •  HBase is optimized for fast lookups of one or a few rowkeys—not thousands HBase: A Really Bad Data Model jfk::20140301::1CKPH77::1393679153 1 rpm speedgear 1080.4774 0 …   …   …   …
  • 22. Copyright © 2015 Accenture All rights reserved. 22 Wide Row Data Model •  Add metric to the rowkey •  Store timestamps in the column name •  Singular observations held in the record Issues •  Additional time to transpose and create the data.table object in R •  Added ability to filter a specific time series not a priority for the use case HBase: Common Uncompressed Wide Row Model jfk::20140301::1CKPH77::rpm 1080.4774 1393679153.2 1393679153.41393679153 …   1238.6011 1611.7514 …  
  • 23. ‹#›Copyright © 2015 Accenture All rights reserved. Working with HBase in R 23
  • 24. Copyright © 2015 Accenture All rights reserved. 24 •  Part of Revolution Analytics’ free RHadoop suite of packages •  Uses the Thrift API to connect to HBase rhbase: Working with HBase in R HBase Shell Command rhbase Function Description list hb.list.tables   Shows all tables scan [table] hb.scan   Scans a table to retrieve all rows over a specified range of rowkeys create [table], [cf1],… hb.new.table   Creates a new table with some column families get [table], [key] (, [cf:col]) hb.get   Gets a single row, or specific records with specification of column names put [table], [key], [cf:col], [val] (,[ts]) hb.insert   Writes a value into the table at a specified key, column family and column.
  • 25. Copyright © 2015 Accenture All rights reserved. 25 Custom additions to rhbase were made to promote tidyR principles and the use of data.tables •  hb.put •  A wrapper around hb.insert for easier inputting •  By default, stores byte array’s using R’s native serializer •  You can specify a different serializer: raw, character, or a custom function •  hb.pull •  A wrapper around hb.scan for easier retrieval •  By default, unserializes the values to their original R object •  Presents retrieved data in a much nicer format than the original hb.scan Custom Additions to rhbase
  • 26. Copyright © 2015 Accenture All rights reserved. 26 require(rhbase)   require(data.table)     hb.init()   table_name  <-­‐  "ground_support"   hb.new.table(table_name,  "f")     dataset  <-­‐  fread("jfk/20140301/1CKPH7747ZK277944/gear.csv")     rowkey  <-­‐  "jfk::20140301::1CKPH7747ZK277944"   column  <-­‐  "gear"     hb.put(table_name  =  table_name,  column_family  =  "f",                  rowkey  =  rowkey,  column  =  column,  value  =  dataset)   Loading Data in HBase •  Loading R objects into HBase is easy with rhbase and the hb.put wrapper •  Let the data’s nested directory structure inform the rowkey and column names
  • 27. Copyright © 2015 Accenture All rights reserved. 27 hb.pull("ground_support",  "test",  start  =  "LAX::20140306",        end  =  "LAXa",  columns  =  "speed",  batchsize  =  100)     >  ##                                                            rowkey  column_family  column              values      ##  1:  LAX::20140306::1CHMP60486H283129                    test    speed  <data.table>      ##  2:  LAX::20140306::1FAJL35763D392894                    test    speed  <data.table>      ##  3:  LAX::20140306::1FJAL1998VS238734                    test    speed  <data.table>      ##  4:  LAX::20140306::1FSMZ91563C548451                    test    speed  <data.table>      ##  5:  LAX::20140307::1CHMP60486H283129                    test    speed  <data.table>      ##  6:  LAX::20140307::1FAJL35763D392894                    test    speed  <data.table>      ##  7:  LAX::20140307::1FJAL1998VS238734                    test    speed  <data.table>      ##  8:  LAX::20140307::1FSMZ91563C548451                    test    speed  <data.table>   Retrieving Data from HBase Perform a scan over LAX vehicles on 2014-03-06, and retrieve the speed datasets for those vehicles
  • 28. Copyright © 2015 Accenture All rights reserved. 28 Use these techniques to get even faster results •  Serialized objects > Text/Blob representations •  On read, no extra overhead to convert the text object into an R object •  Just unserialize it •  Use data.table •  Much faster aggregation and manipulation of datasets •  Smarter memory usage •  Reduce the number points to plot •  Not always necessary to graph all the plots to get the right picture •  Customized algorithm to reduce the right points (so the picture stays the same) •  See timeseriesr package @ github.com/accenture/timeseries Additional Optimizations in R
  • 29. ‹#›Copyright © 2015 Accenture All rights reserved. Live Demo 29
  • 30. Copyright © 2015 Accenture All rights reserved. 30 Batch Calculations •  Pull in datasets, and either •  Derive new calculated time series metrics, and store those new datasets in HBase •  Calculate aggregate functions over datasets, and collect these values in aggregated table •  Performing such tasks over millions of datasets will take time •  Suggested R Packages for Scale •  parallel •  rmr2 •  Ensure that each map task is scanning in all the rows it needs to perform the calculation Batch Calculations
  • 31. Copyright © 2015 Accenture All rights reserved. 31 If Python is your language of choice… happybase •  Similar functionality to the HBase shell and rhbase •  Uses Thrift to connect to HBase •  Added capability to perform batch puts (also offered in the HBase shell) •  Well-developed and helpful documentation http://happybase.readthedocs.org/en/latest/ Python’s happybase
  • 32. Copyright © 2015 Accenture All rights reserved. 32 •  HBase is a great tool if you’re looking for fast reads and writes of your data •  Let your query patterns inform your rowkey design •  NoSQL sacrifices flexibility for speed •  Both Python and R have well developed connectors to HBase •  Store your data in a format that can be quickly unserialized for faster reads Conclusion
  • 33. Copyright © 2015 Accenture All rights reserved. 33 Apache HBase Documentation: http://hbase.apache.org/ Revolution Analytics: https://github.com/RevolutionAnalytics/RHadoop/wiki Accenture rhbase fork: https://github.com/Accenture/rhbase Accenture timeseries: https://github.com/Accenture/timeseries RStudio: Rshiny http://shiny.rstudio.com/ R-Hbase Tutorial: https://github.com/aaronbenz/rhbase/tree/master/examples Shiny Example: https://github.com/aaronbenz/hbaseVhive References