SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Overview	
  of	
  S+nger:	
  Interac+ve	
  
Query	
  for	
  Hive	
  
	
  
@ddkaiser	
  
linkedin.com/in/dkaiser	
  
slideshare.net/ddkaiser	
  
dkaiser@cdk.com	
  
dkaiser@hortonworks.com	
  
	
  
OC	
  Big	
  Data	
  Meetup	
  #1	
  
May	
  21,	
  2014	
  
David	
  Kaiser	
  
Who Am I?
David	
  Kaiser	
  
20+	
  years	
  experience	
  with	
  Linux	
  
	
  
3	
  years	
  experience	
  with	
  Hadoop	
  
	
  
Career	
  experiences:	
  
•  Data	
  Warehousing	
  
•  Geospa+al	
  Analy+cs	
  
•  Open-­‐source	
  Solu+ons	
  and	
  Architecture	
  
	
  
Employed	
  at	
  Hortonworks	
  as	
  a	
  Senior	
  Solu+ons	
  Engineer	
  
	
  
@ddkaiser	
  
linkedin.com/in/dkaiser	
  
slideshare.net/ddkaiser	
  
dkaiser@cdk.com	
  
dkaiser@hortonworks.com	
  
	
  
Overview of Stinger: Interactive Query for Hive
• Abstract:
– Hadoop	
  is	
  about	
  so	
  much	
  more	
  than	
  batch	
  processing.	
  	
  With	
  the	
  
recent	
  release	
  of	
  Hadoop	
  2,	
  there	
  have	
  been	
  many	
  new	
  
approaches	
  for	
  increased	
  applica+on	
  performance.	
  
– Hive	
  is	
  the	
  most	
  used	
  SQL	
  implementa+on	
  on	
  Hadoop.	
  
	
  
– Hive	
  provides	
  the	
  most	
  amount	
  of	
  SQL	
  compa+bility	
  on	
  Hadoop.	
  
– But…	
  	
  	
  	
  Hive	
  is	
  Slow.	
  	
  	
  	
  	
  
	
  
– Hive	
  WAS	
  Slow.	
  
– This	
  talk	
  will	
  discuss	
  the	
  S+nger	
  ini+a+ve,	
  which	
  improved	
  Hive	
  
performance	
  over	
  100x.	
  
S"nger	
  Project	
  
(announced	
  February	
  2013)	
  
Batch AND Interactive SQL-in-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
	
  
	
  
Hive	
  0.13,	
  April	
  2014	
  
•  Hive	
  on	
  Apache	
  Tez	
  
•  Query	
  Service	
  
•  Buffer	
  Cache	
  
•  Cost	
  Based	
  Op+mizer	
  (Op+q)	
  
•  Vectorized	
  Processing	
  
	
  
Hive	
  0.11,	
  May	
  2013:	
  
•  Base	
  Op+miza+ons	
  
•  SQL	
  Analy+c	
  Func+ons	
  
•  ORCFile,	
  Modern	
  File	
  Format	
  
Hive	
  0.12,	
  October	
  2013:	
  
•  VARCHAR,	
  DATE	
  Types	
  
•  ORCFile	
  predicate	
  pushdown	
  
•  Advanced	
  Op+miza+ons	
  
•  Performance	
  Boosts	
  via	
  YARN	
  
Speed	
  
Improve	
  Hive	
  query	
  performance	
  by	
  100X	
  to	
  
allow	
  for	
  interac+ve	
  query	
  +mes	
  (seconds)	
  
Scale	
  
The	
  only	
  SQL	
  processing	
  in	
  Hadoop	
  designed	
  for	
  
queries	
  that	
  scale	
  from	
  TB	
  to	
  PB	
  
SQL	
  
Support	
  broadest	
  range	
  of	
  SQL	
  seman+cs	
  for	
  
analy+c	
  applica+ons	
  running	
  against	
  Hadoop	
  
Goals:	
  
An Open Community at its finest: Apache Hive Contribution
1,672Jira Tickets Closed
145Developers
44Companies
~400,000Lines Of Code Added…
13Months
Outcomes from the Stinger Project
Page 5
Feature	
   Descrip"on	
   Benefit	
  
Tez	
  Integra+on	
   Tez	
  is	
  significantly	
  beeer	
  engine	
  than	
  MapReduce	
   Latency	
  
Vectorized	
  Query	
  
Take	
  advantage	
  of	
  modern	
  hardware	
  by	
  processing	
  
thousand-­‐row	
  blocks	
  rather	
  than	
  row-­‐at-­‐a-­‐+me.	
  
Throughput	
  
Query	
  Planner	
  
Using	
  extensive	
  sta+s+cs	
  now	
  available	
  in	
  Metastore	
  
to	
  beeer	
  plan	
  and	
  op+mize	
  query,	
  including	
  
predicate	
  pushdown	
  during	
  compila+on	
  to	
  eliminate	
  
por+ons	
  of	
  input	
  (beyond	
  par++on	
  pruning)	
  
Latency	
  
ORC	
  File	
   Columnar,	
  type	
  aware	
  format	
  with	
  indices	
   Latency	
  
Cost	
  Based	
  Op+mizer	
  
(Op+q)	
  
Join	
  re-­‐ordering	
  and	
  other	
  op+miza+ons	
  based	
  on	
  
column	
  sta+s+cs	
  including	
  histograms	
  etc.	
  
Latency	
  
Hive	
  as	
  a	
  Service	
   Leaves	
  engine	
  running	
  between	
  sessions	
   Latency	
  
Buffer	
  Cache	
   Leaves	
  most	
  used	
  HDFS	
  file	
  blocks	
  in	
  memory	
   Latency	
  
Hadoop 2: Moving Past MapReduce
Page	
  6	
  
HADOOP	
  1.0	
  
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  
HDFS2	
  
(redundant,	
  highly-­‐available	
  &	
  reliable	
  storage)	
  
YARN	
  
(cluster	
  resource	
  management)	
  
MapReduce	
  
(data	
  processing)	
  
Others	
  
HADOOP	
  2.0	
  
Single	
  Use	
  System	
  
Batch	
  Apps	
  
Mul/	
  Purpose	
  Pla5orm	
  
Batch,	
  Interac/ve,	
  Online,	
  Streaming,	
  …	
  
Apache Tez as the new Primitive
HDFS2	
  
(redundant,	
  reliable	
  storage)	
  
Tez	
  
(execu+on	
  engine)	
  
YARN	
  
(cluster	
  resource	
  management)	
  
HADOOP	
  2.0	
  
MapReduce	
  as	
  Base	
   Apache	
  Tez	
  as	
  Base	
  
HDFS	
  
(redundant,	
  reliable	
  storage)	
  
MapReduce	
  
(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  
Pig	
  
(data	
  flow)	
  
Hive	
  
(sql)	
  
	
  
Others	
  
(Cascading)	
  
	
  
HADOOP	
  1.0	
  
Data	
  Flow	
  
Pig	
  
SQL	
  
Hive	
  
	
  
Others	
  
(Cascading)	
  
	
  
Batch	
  
MapReduce	
  
Slider	
  
(con+nuous	
  execu+on)	
  
Online	
  	
  
Data	
  	
  
Processing	
  
HBase,	
  
Accumulo	
  
Real	
  Time	
  	
  
Stream	
  	
  
Processing	
  
Storm	
  
Complete Open Source Stack
•  YARN is the logical extension of Apache Hadoop
–  Complements	
  HDFS,	
  the	
  data	
  reservoir	
  
	
  
•  Resource Management for the Enterprise Data Lake
–  Shared,	
  secure,	
  mul+-­‐tenant	
  Hadoop	
  
Allows for all processing in Open-Source Hadoop
Page	
  8	
  
HDFS2	
  (Redundant,	
  Reliable	
  Storage)	
  
YARN	
  (Cluster	
  Resource	
  Management)	
  	
  	
  
BATCH	
  
(MapReduce)	
  
INTERACTIVE	
  
(Tez)	
  
STREAMING	
  
(Storm,	
  S4,…)	
  
GRAPH	
  
(Giraph)	
  
IN-­‐MEMORY	
  
(Spark)	
  
HPC	
  MPI	
  
(OpenMPI)	
  
ONLINE	
  
(HBase)	
  
OTHER	
  
(Search)	
  
(Weave…)	
  
Feature	
   Descrip"on	
   Benefit	
  
Tez	
  Session	
  
Overcomes	
  Map-­‐Reduce	
  job-­‐launch	
  latency	
  by	
  pre-­‐
launching	
  Tez	
  AppMaster	
  
Latency	
  
Tez	
  Container	
  Pre-­‐
Launch	
  
Overcomes	
  Map-­‐Reduce	
  latency	
  by	
  pre-­‐launching	
  
hot	
  containers	
  ready	
  to	
  serve	
  queries.	
  
Latency	
  
Tez	
  Container	
  Re-­‐
Use	
  
Finished	
  maps	
  and	
  reduces	
  pick	
  up	
  more	
  work	
  
rather	
  than	
  exi+ng.	
  Reduces	
  latency	
  and	
  eliminates	
  
difficult	
  split-­‐size	
  tuning.	
  Out	
  of	
  box	
  performance!	
  
Latency	
  
Run+me	
  re-­‐
configura+on	
  of	
  DAG	
  
Run+me	
  query	
  tuning	
  by	
  picking	
  aggrega+on	
  
parallelism	
  using	
  online	
  query	
  sta+s+cs	
  
Throughput	
  
Tez	
  In-­‐Memory	
  
Cache	
  
Hot	
  data	
  kept	
  in	
  RAM	
  for	
  fast	
  access.	
   Latency	
  
Complex	
  DAGs	
  
Tez	
  Broadcast	
  Edge	
  and	
  Map-­‐Reduce-­‐Reduce	
  
paeern	
  improve	
  query	
  scale	
  and	
  throughput.	
  
Throughput	
  
Hive On Tez - Execution
ORC File Advantages
Sustained Query Times
Apache Hive 0.12 provides
sustained acceptable query
times even at petabyte scale
131	
  GB	
  
(78%	
  Smaller)	
  
File	
  Size	
  Comparison	
  Across	
  Encoding	
  Methods	
  
Dataset:	
  TPC-­‐DS	
  Scale	
  500	
  Dataset	
  
221	
  GB	
  
(62%	
  Smaller)	
  
Encoded	
  with	
  
Text	
  
Encoded	
  with	
  
RCFile	
  
Encoded	
  with	
  
ORCFile	
  
Encoded	
  with	
  
Parquet	
  
505	
  GB	
  
(14%	
  Smaller)	
  
585	
  GB	
  
(Original	
  Size)	
   •  Larger	
  Block	
  Sizes	
  	
  
•  Columnar	
  format	
  
arranges	
  columns	
  
adjacent	
  within	
  the	
  
file	
  for	
  compression	
  
&	
  fast	
  access	
  
Impala	
  
Hive	
  12	
  
Smaller Footprint
Better encoding with ORC in
Apache Hive 0.12 reduces resource
requirements for your cluster.
ORCFile	
  File	
  Format	
  
Page 11
Query-­‐Op"mized:	
  Split-­‐able,	
  columnar	
  storage	
  
file	
  
	
  
Efficient	
  Reads:	
  Break	
  into	
  large	
  “stripes”	
  of	
  
data	
  for	
  efficient	
  read	
  
	
  
Fast	
  Filtering:	
  Built	
  in	
  index,	
  min/max,	
  
metadata	
  for	
  fast	
  filtering	
  blocks	
  -­‐	
  bloom	
  filters	
  
if	
  desired	
  
	
  
Efficient	
  Compression:	
  Decompose	
  complex	
  
row	
  types	
  into	
  primi+ves:	
  massive	
  
compression	
  and	
  efficient	
  comparisons	
  for	
  
filtering	
  
	
  
Precomputa"on:	
  Built	
  in	
  aggregates	
  per	
  block	
  
(min,	
  max,	
  count,	
  sum,	
  etc.)	
  
	
  
A Journey to SQL Compliance
Evolu"on	
  of	
  SQL	
  Compliance	
  in	
  Hive	
  
SQL	
  Datatypes	
   SQL	
  Seman"cs	
  
INT/TINYINT/SMALLINT/BIGINT	
   SELECT,	
  INSERT	
  
FLOAT/DOUBLE	
   GROUP	
  BY,	
  ORDER	
  BY,	
  HAVING	
  
BOOLEAN	
   JOIN	
  on	
  explicit	
  join	
  key	
  
ARRAY,	
  MAP,	
  STRUCT,	
  UNION	
   Inner,	
  outer,	
  cross	
  and	
  semi	
  joins	
  
STRING	
   Sub-­‐queries	
  in	
  the	
  FROM	
  clause	
  
BINARY	
   ROLLUP	
  and	
  CUBE	
  
TIMESTAMP	
   UNION	
  
DECIMAL	
   Standard	
  aggrega+ons	
  (sum,	
  avg,	
  etc.)	
  
DATE	
   Custom	
  Java	
  UDFs	
  
VARCHAR	
   Windowing	
  func+ons	
  (OVER,	
  RANK,	
  etc.)	
  
CHAR	
   Advanced	
  UDFs	
  (ngram,	
  XPath,	
  URL)	
  
Interval	
  Types	
   Sub-­‐queries	
  for	
  IN/NOT	
  IN,	
  HAVING	
  
JOINs	
  in	
  WHERE	
  Clause	
  
INSERT/UPDATE/DELETE	
  
Legend	
  
Hive	
  10	
  or	
  earlier	
  
Roadmap	
  
Hive	
  11	
  
Hive	
  12	
  
Hive	
  13	
  
Tez – Execution Performance
•  Performance gains over Map Reduce
–  Eliminate	
  replicated	
  write	
  barrier	
  between	
  successive	
  computa+ons.	
  
–  Eliminate	
  job	
  launch	
  overhead	
  of	
  workflow	
  jobs.	
  
–  Eliminate	
  extra	
  stage	
  of	
  map	
  reads	
  in	
  every	
  workflow	
  job.	
  
–  Eliminate	
  queue	
  and	
  resource	
  conten+on	
  suffered	
  by	
  workflow	
  jobs	
  that	
  are	
  started	
  aper	
  
a	
  predecessor	
  job	
  completes.	
  
Page	
  13	
  
Pig/Hive	
  -­‐	
  MR	
  
Pig/Hive	
  -­‐	
  Tez	
  
Hive	
  –	
  MR	
   Hive	
  –	
  Tez	
  
Hive-on-MR vs. Hive-on-Tez
SELECT	
  a.state,	
  COUNT(*),	
  AVERAGE(c.price)	
  FROM	
  a	
  
JOIN	
  b	
  on	
  (a.id	
  =	
  b.id)	
  
JOIN	
  c	
  on	
  (a.itemId	
  =	
  c.itemId)	
  
GROUP	
  by	
  a.state
SELECT	
  a.state	
  
JOIN	
  (a,	
  c)	
  
SELECT	
  c.price	
  
SELECT	
  b.id	
  
JOIN(a,	
  b)	
  
GROUP	
  BY	
  a.state	
  
COUNT(*)	
  
AVERAGE(c.price)	
  
M M M
R R
M M
R
M M
R
M M
R
HDFS	
  
HDFS	
  
HDFS	
  
M M M
R R
R
M M
R
R
SELECT	
  a.state,	
  
c.itemId	
  
JOIN	
  (a,	
  c)	
  
JOIN(a,	
  b)	
  
GROUP	
  BY	
  a.state	
  
COUNT(*)	
  
AVERAGE(c.price)	
  
SELECT	
  b.id	
  
Tez	
  avoids	
  unneeded	
  
writes	
  to	
  HDFS	
  
Vectorization
• Rewrite all operations to operate on blocks of 1K+
records, rather than one record at a time
• Block is array of Java scalars, not Objects (eliminate
Objects – compounding GC gains over time)
• Avoids many function calls, CPU pipeline stalls
•  Size to fit in L1 cache, avoid cache misses
Page	
  15	
  
Stinger Phase 3: Unlocking Interactive Query
S"nger	
  Phase	
  3:	
  Features	
  and	
  Benefits	
  
Container	
  Pre-­‐Launch	
  
Overcomes	
  Java	
  VM	
  startup	
  latency	
  by	
  pre-­‐
launching	
  hot	
  containers	
  ready	
  to	
  serve	
  queries	
  
Container	
  Re-­‐Use	
  
Finished	
  Maps	
  and	
  Reduces	
  pick	
  up	
  more	
  work	
  
rather	
  than	
  exi+ng.	
  Reduces	
  latency	
  and	
  eliminates	
  
difficult	
  split	
  size	
  tuning	
  
Tez	
  Integra+on	
  
Tez	
  Broadcast	
  Edge	
  and	
  Map-­‐Reduce-­‐Reduce	
  
paeern	
  improve	
  query	
  scale	
  and	
  throughput	
  
In-­‐Memory	
  Cache	
   Hot	
  data	
  kept	
  in	
  RAM	
  for	
  fast	
  access	
  
Quantifying Stinger
Page 17
Hive 10 Hive 0.13 (Phase 3)Hive 0.11 (Phase 1)
190x	
  
Improvement	
  
1400s
39s
7.2s
TPC-­‐DS	
  Query	
  27	
  
3200s
65s
14.9s
TPC-­‐DS	
  Query	
  82	
  
200x	
  
Improvement	
  
Query	
  27:	
  Pricing	
  Analy"cs	
  using	
  Star	
  Schema	
  Join	
  	
  
Query	
  82:	
  Inventory	
  Analy"cs	
  Joining	
  2	
  Large	
  Fact	
  Tables	
  
All	
  Results	
  at	
  Scale	
  Factor	
  200	
  (Approximately	
  200GB	
  Data)	
  
41.1s
4.2s
39.8s
4.1s
TPC-­‐DS	
  Query	
  52	
   TPC-­‐DS	
  Query	
  55	
  
Query	
  Time	
  in	
  Seconds	
  
Speed: Delivering Interactive Query
Test	
  Cluster:	
  
•  200	
  GB	
  Data	
  (ORCFile)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  disk	
  each	
  	
  
Hive 0.12
Hive 0.13 (Phase 3)
Query	
  52:	
  Star	
  Schema	
  Join	
  with	
  group-­‐by,	
  order-­‐by	
  on	
  different	
  keys	
  
Query	
  55:	
  Star	
  Schema	
  Join	
  with	
  group-­‐by,	
  order-­‐by	
  on	
  different	
  keys	
  
22s
9.8s
31s
6.7s
TPC-­‐DS	
  Query	
  28	
   TPC-­‐DS	
  Query	
  12	
  
Query	
  Time	
  in	
  Seconds	
  
Speed: Delivering Interactive Query
Test	
  Cluster:	
  
•  200	
  GB	
  Data	
  (ORCFile)	
  
•  20	
  Nodes,	
  24GB	
  RAM	
  each,	
  6x	
  disk	
  each	
  	
  
Hive 0.12
Hive 0.13 (Phase 3)
Query	
  28:	
  Four	
  sub-­‐query	
  join	
  (Vectoriza"on)	
  
Query	
  12:	
  Star	
  Join	
  over	
  range	
  of	
  dates	
  (M-­‐R-­‐R	
  palern)	
  
Hortonworks Confidential © 2014
Speed@Scale: Large Scale Implementation
Page 20
http://blogs.cisco.com/datacenter/hdp
Cisco Engineering
Blog Post
Independent
assessment by
Cisco UCS Team
Benchmark @ 30TB
Hortonworks Confidential © 2014
Speed@Scale: Large Scale Implementation
Page 21
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Facebook Engineering Blog Post
Hortonworks engineering team
worked on ORCFile
Facebook provided
improvements to ORCFile,
working with Hortonworks
Hive is used for efficient
analytics on the largest Hadoop
Data Warehouse site
Ultimate Scale Data Analysis
Your Fastest On-ramp to Enterprise Hadoop™!
Page	
  22	
  
hep://hortonworks.com/products/hortonworks-­‐sandbox/	
  
The	
  Sandbox	
  lets	
  you	
  experience	
  Apache	
  Hadoop	
  from	
  the	
  convenience	
  of	
  your	
  own	
  
laptop	
  –	
  no	
  data	
  center,	
  no	
  cloud	
  and	
  no	
  internet	
  connec+on	
  needed!	
  
	
  
The	
  Hortonworks	
  Sandbox	
  is:	
  
•  A	
  free	
  download:	
  	
  hep://hortonworks.com/products/hortonworks-­‐sandbox/	
  
•  A	
  complete,	
  self	
  contained	
  virtual	
  machine	
  with	
  Apache	
  Hadoop	
  pre-­‐configured	
  
•  A	
  personal,	
  portable	
  and	
  standalone	
  Hadoop	
  environment	
  
•  A	
  set	
  of	
  hands-­‐on,	
  step-­‐by-­‐step	
  tutorials	
  that	
  allow	
  you	
  to	
  learn	
  and	
  explore	
  Hadoop	
  
Ques+ons?	
  
@ddkaiser	
  
linkedin.com/in/dkaiser	
  
slideshare.net/ddkaiser	
  
dkaiser@cdk.com	
  
dkaiser@hortonworks.com	
  
	
  
David	
  Kaiser	
  

Más contenido relacionado

La actualidad más candente

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureLynn Langit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 

La actualidad más candente (20)

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 

Destacado

Statistics And the Query Optimizer
Statistics And the Query OptimizerStatistics And the Query Optimizer
Statistics And the Query OptimizerGrant Fritchey
 
Vnsispl dbms concepts_ch1
Vnsispl dbms concepts_ch1Vnsispl dbms concepts_ch1
Vnsispl dbms concepts_ch1sriprasoon
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow enginedmoebius
 
Buffer management --database buffering
Buffer management --database buffering Buffer management --database buffering
Buffer management --database buffering julia121214
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Lect 21 components_of_database_management_system
Lect 21 components_of_database_management_systemLect 21 components_of_database_management_system
Lect 21 components_of_database_management_systemnadine016
 
Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)MongoDB
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...DataKitchen
 
L8 components and properties of dbms
L8  components and properties of dbmsL8  components and properties of dbms
L8 components and properties of dbmsRushdi Shams
 
Dbms role advantages
Dbms role advantagesDbms role advantages
Dbms role advantagesjeancly
 
Database management functions
Database management functionsDatabase management functions
Database management functionsyhen06
 
17. Recovery System in DBMS
17. Recovery System in DBMS17. Recovery System in DBMS
17. Recovery System in DBMSkoolkampus
 
16. Concurrency Control in DBMS
16. Concurrency Control in DBMS16. Concurrency Control in DBMS
16. Concurrency Control in DBMSkoolkampus
 

Destacado (19)

Statistics And the Query Optimizer
Statistics And the Query OptimizerStatistics And the Query Optimizer
Statistics And the Query Optimizer
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Vnsispl dbms concepts_ch1
Vnsispl dbms concepts_ch1Vnsispl dbms concepts_ch1
Vnsispl dbms concepts_ch1
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow engine
 
Buffer management --database buffering
Buffer management --database buffering Buffer management --database buffering
Buffer management --database buffering
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Lect 21 components_of_database_management_system
Lect 21 components_of_database_management_systemLect 21 components_of_database_management_system
Lect 21 components_of_database_management_system
 
Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
L8 components and properties of dbms
L8  components and properties of dbmsL8  components and properties of dbms
L8 components and properties of dbms
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Dbms role advantages
Dbms role advantagesDbms role advantages
Dbms role advantages
 
Dml and ddl
Dml and ddlDml and ddl
Dml and ddl
 
Database management functions
Database management functionsDatabase management functions
Database management functions
 
2 tier and 3 tier architecture
2 tier and 3 tier architecture2 tier and 3 tier architecture
2 tier and 3 tier architecture
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
17. Recovery System in DBMS
17. Recovery System in DBMS17. Recovery System in DBMS
17. Recovery System in DBMS
 
16. Concurrency Control in DBMS
16. Concurrency Control in DBMS16. Concurrency Control in DBMS
16. Concurrency Control in DBMS
 
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with ExamplesDML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
 

Similar a Overview of stinger interactive query for hive

Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...nnakasone
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)Claudiu Barbura
 

Similar a Overview of stinger interactive query for hive (20)

Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 

Último

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Último (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Overview of stinger interactive query for hive

  • 1. Overview  of  S+nger:  Interac+ve   Query  for  Hive     @ddkaiser   linkedin.com/in/dkaiser   slideshare.net/ddkaiser   dkaiser@cdk.com   dkaiser@hortonworks.com     OC  Big  Data  Meetup  #1   May  21,  2014   David  Kaiser  
  • 2. Who Am I? David  Kaiser   20+  years  experience  with  Linux     3  years  experience  with  Hadoop     Career  experiences:   •  Data  Warehousing   •  Geospa+al  Analy+cs   •  Open-­‐source  Solu+ons  and  Architecture     Employed  at  Hortonworks  as  a  Senior  Solu+ons  Engineer     @ddkaiser   linkedin.com/in/dkaiser   slideshare.net/ddkaiser   dkaiser@cdk.com   dkaiser@hortonworks.com    
  • 3. Overview of Stinger: Interactive Query for Hive • Abstract: – Hadoop  is  about  so  much  more  than  batch  processing.    With  the   recent  release  of  Hadoop  2,  there  have  been  many  new   approaches  for  increased  applica+on  performance.   – Hive  is  the  most  used  SQL  implementa+on  on  Hadoop.     – Hive  provides  the  most  amount  of  SQL  compa+bility  on  Hadoop.   – But…        Hive  is  Slow.             – Hive  WAS  Slow.   – This  talk  will  discuss  the  S+nger  ini+a+ve,  which  improved  Hive   performance  over  100x.  
  • 4. S"nger  Project   (announced  February  2013)   Batch AND Interactive SQL-in-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE     Hive  0.13,  April  2014   •  Hive  on  Apache  Tez   •  Query  Service   •  Buffer  Cache   •  Cost  Based  Op+mizer  (Op+q)   •  Vectorized  Processing     Hive  0.11,  May  2013:   •  Base  Op+miza+ons   •  SQL  Analy+c  Func+ons   •  ORCFile,  Modern  File  Format   Hive  0.12,  October  2013:   •  VARCHAR,  DATE  Types   •  ORCFile  predicate  pushdown   •  Advanced  Op+miza+ons   •  Performance  Boosts  via  YARN   Speed   Improve  Hive  query  performance  by  100X  to   allow  for  interac+ve  query  +mes  (seconds)   Scale   The  only  SQL  processing  in  Hadoop  designed  for   queries  that  scale  from  TB  to  PB   SQL   Support  broadest  range  of  SQL  seman+cs  for   analy+c  applica+ons  running  against  Hadoop   Goals:   An Open Community at its finest: Apache Hive Contribution 1,672Jira Tickets Closed 145Developers 44Companies ~400,000Lines Of Code Added… 13Months
  • 5. Outcomes from the Stinger Project Page 5 Feature   Descrip"on   Benefit   Tez  Integra+on   Tez  is  significantly  beeer  engine  than  MapReduce   Latency   Vectorized  Query   Take  advantage  of  modern  hardware  by  processing   thousand-­‐row  blocks  rather  than  row-­‐at-­‐a-­‐+me.   Throughput   Query  Planner   Using  extensive  sta+s+cs  now  available  in  Metastore   to  beeer  plan  and  op+mize  query,  including   predicate  pushdown  during  compila+on  to  eliminate   por+ons  of  input  (beyond  par++on  pruning)   Latency   ORC  File   Columnar,  type  aware  format  with  indices   Latency   Cost  Based  Op+mizer   (Op+q)   Join  re-­‐ordering  and  other  op+miza+ons  based  on   column  sta+s+cs  including  histograms  etc.   Latency   Hive  as  a  Service   Leaves  engine  running  between  sessions   Latency   Buffer  Cache   Leaves  most  used  HDFS  file  blocks  in  memory   Latency  
  • 6. Hadoop 2: Moving Past MapReduce Page  6   HADOOP  1.0   HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)   HDFS2   (redundant,  highly-­‐available  &  reliable  storage)   YARN   (cluster  resource  management)   MapReduce   (data  processing)   Others   HADOOP  2.0   Single  Use  System   Batch  Apps   Mul/  Purpose  Pla5orm   Batch,  Interac/ve,  Online,  Streaming,  …  
  • 7. Apache Tez as the new Primitive HDFS2   (redundant,  reliable  storage)   Tez   (execu+on  engine)   YARN   (cluster  resource  management)   HADOOP  2.0   MapReduce  as  Base   Apache  Tez  as  Base   HDFS   (redundant,  reliable  storage)   MapReduce   (cluster  resource  management    &  data  processing)   Pig   (data  flow)   Hive   (sql)     Others   (Cascading)     HADOOP  1.0   Data  Flow   Pig   SQL   Hive     Others   (Cascading)     Batch   MapReduce   Slider   (con+nuous  execu+on)   Online     Data     Processing   HBase,   Accumulo   Real  Time     Stream     Processing   Storm  
  • 8. Complete Open Source Stack •  YARN is the logical extension of Apache Hadoop –  Complements  HDFS,  the  data  reservoir     •  Resource Management for the Enterprise Data Lake –  Shared,  secure,  mul+-­‐tenant  Hadoop   Allows for all processing in Open-Source Hadoop Page  8   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)  
  • 9. Feature   Descrip"on   Benefit   Tez  Session   Overcomes  Map-­‐Reduce  job-­‐launch  latency  by  pre-­‐ launching  Tez  AppMaster   Latency   Tez  Container  Pre-­‐ Launch   Overcomes  Map-­‐Reduce  latency  by  pre-­‐launching   hot  containers  ready  to  serve  queries.   Latency   Tez  Container  Re-­‐ Use   Finished  maps  and  reduces  pick  up  more  work   rather  than  exi+ng.  Reduces  latency  and  eliminates   difficult  split-­‐size  tuning.  Out  of  box  performance!   Latency   Run+me  re-­‐ configura+on  of  DAG   Run+me  query  tuning  by  picking  aggrega+on   parallelism  using  online  query  sta+s+cs   Throughput   Tez  In-­‐Memory   Cache   Hot  data  kept  in  RAM  for  fast  access.   Latency   Complex  DAGs   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce   paeern  improve  query  scale  and  throughput.   Throughput   Hive On Tez - Execution
  • 10. ORC File Advantages Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale 131  GB   (78%  Smaller)   File  Size  Comparison  Across  Encoding  Methods   Dataset:  TPC-­‐DS  Scale  500  Dataset   221  GB   (62%  Smaller)   Encoded  with   Text   Encoded  with   RCFile   Encoded  with   ORCFile   Encoded  with   Parquet   505  GB   (14%  Smaller)   585  GB   (Original  Size)   •  Larger  Block  Sizes     •  Columnar  format   arranges  columns   adjacent  within  the   file  for  compression   &  fast  access   Impala   Hive  12   Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster.
  • 11. ORCFile  File  Format   Page 11 Query-­‐Op"mized:  Split-­‐able,  columnar  storage   file     Efficient  Reads:  Break  into  large  “stripes”  of   data  for  efficient  read     Fast  Filtering:  Built  in  index,  min/max,   metadata  for  fast  filtering  blocks  -­‐  bloom  filters   if  desired     Efficient  Compression:  Decompose  complex   row  types  into  primi+ves:  massive   compression  and  efficient  comparisons  for   filtering     Precomputa"on:  Built  in  aggregates  per  block   (min,  max,  count,  sum,  etc.)    
  • 12. A Journey to SQL Compliance Evolu"on  of  SQL  Compliance  in  Hive   SQL  Datatypes   SQL  Seman"cs   INT/TINYINT/SMALLINT/BIGINT   SELECT,  INSERT   FLOAT/DOUBLE   GROUP  BY,  ORDER  BY,  HAVING   BOOLEAN   JOIN  on  explicit  join  key   ARRAY,  MAP,  STRUCT,  UNION   Inner,  outer,  cross  and  semi  joins   STRING   Sub-­‐queries  in  the  FROM  clause   BINARY   ROLLUP  and  CUBE   TIMESTAMP   UNION   DECIMAL   Standard  aggrega+ons  (sum,  avg,  etc.)   DATE   Custom  Java  UDFs   VARCHAR   Windowing  func+ons  (OVER,  RANK,  etc.)   CHAR   Advanced  UDFs  (ngram,  XPath,  URL)   Interval  Types   Sub-­‐queries  for  IN/NOT  IN,  HAVING   JOINs  in  WHERE  Clause   INSERT/UPDATE/DELETE   Legend   Hive  10  or  earlier   Roadmap   Hive  11   Hive  12   Hive  13  
  • 13. Tez – Execution Performance •  Performance gains over Map Reduce –  Eliminate  replicated  write  barrier  between  successive  computa+ons.   –  Eliminate  job  launch  overhead  of  workflow  jobs.   –  Eliminate  extra  stage  of  map  reads  in  every  workflow  job.   –  Eliminate  queue  and  resource  conten+on  suffered  by  workflow  jobs  that  are  started  aper   a  predecessor  job  completes.   Page  13   Pig/Hive  -­‐  MR   Pig/Hive  -­‐  Tez  
  • 14. Hive  –  MR   Hive  –  Tez   Hive-on-MR vs. Hive-on-Tez SELECT  a.state,  COUNT(*),  AVERAGE(c.price)  FROM  a   JOIN  b  on  (a.id  =  b.id)   JOIN  c  on  (a.itemId  =  c.itemId)   GROUP  by  a.state SELECT  a.state   JOIN  (a,  c)   SELECT  c.price   SELECT  b.id   JOIN(a,  b)   GROUP  BY  a.state   COUNT(*)   AVERAGE(c.price)   M M M R R M M R M M R M M R HDFS   HDFS   HDFS   M M M R R R M M R R SELECT  a.state,   c.itemId   JOIN  (a,  c)   JOIN(a,  b)   GROUP  BY  a.state   COUNT(*)   AVERAGE(c.price)   SELECT  b.id   Tez  avoids  unneeded   writes  to  HDFS  
  • 15. Vectorization • Rewrite all operations to operate on blocks of 1K+ records, rather than one record at a time • Block is array of Java scalars, not Objects (eliminate Objects – compounding GC gains over time) • Avoids many function calls, CPU pipeline stalls •  Size to fit in L1 cache, avoid cache misses Page  15  
  • 16. Stinger Phase 3: Unlocking Interactive Query S"nger  Phase  3:  Features  and  Benefits   Container  Pre-­‐Launch   Overcomes  Java  VM  startup  latency  by  pre-­‐ launching  hot  containers  ready  to  serve  queries   Container  Re-­‐Use   Finished  Maps  and  Reduces  pick  up  more  work   rather  than  exi+ng.  Reduces  latency  and  eliminates   difficult  split  size  tuning   Tez  Integra+on   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce   paeern  improve  query  scale  and  throughput   In-­‐Memory  Cache   Hot  data  kept  in  RAM  for  fast  access  
  • 17. Quantifying Stinger Page 17 Hive 10 Hive 0.13 (Phase 3)Hive 0.11 (Phase 1) 190x   Improvement   1400s 39s 7.2s TPC-­‐DS  Query  27   3200s 65s 14.9s TPC-­‐DS  Query  82   200x   Improvement   Query  27:  Pricing  Analy"cs  using  Star  Schema  Join     Query  82:  Inventory  Analy"cs  Joining  2  Large  Fact  Tables   All  Results  at  Scale  Factor  200  (Approximately  200GB  Data)  
  • 18. 41.1s 4.2s 39.8s 4.1s TPC-­‐DS  Query  52   TPC-­‐DS  Query  55   Query  Time  in  Seconds   Speed: Delivering Interactive Query Test  Cluster:   •  200  GB  Data  (ORCFile)   •  20  Nodes,  24GB  RAM  each,  6x  disk  each     Hive 0.12 Hive 0.13 (Phase 3) Query  52:  Star  Schema  Join  with  group-­‐by,  order-­‐by  on  different  keys   Query  55:  Star  Schema  Join  with  group-­‐by,  order-­‐by  on  different  keys  
  • 19. 22s 9.8s 31s 6.7s TPC-­‐DS  Query  28   TPC-­‐DS  Query  12   Query  Time  in  Seconds   Speed: Delivering Interactive Query Test  Cluster:   •  200  GB  Data  (ORCFile)   •  20  Nodes,  24GB  RAM  each,  6x  disk  each     Hive 0.12 Hive 0.13 (Phase 3) Query  28:  Four  sub-­‐query  join  (Vectoriza"on)   Query  12:  Star  Join  over  range  of  dates  (M-­‐R-­‐R  palern)  
  • 20. Hortonworks Confidential © 2014 Speed@Scale: Large Scale Implementation Page 20 http://blogs.cisco.com/datacenter/hdp Cisco Engineering Blog Post Independent assessment by Cisco UCS Team Benchmark @ 30TB
  • 21. Hortonworks Confidential © 2014 Speed@Scale: Large Scale Implementation Page 21 https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ Facebook Engineering Blog Post Hortonworks engineering team worked on ORCFile Facebook provided improvements to ORCFile, working with Hortonworks Hive is used for efficient analytics on the largest Hadoop Data Warehouse site Ultimate Scale Data Analysis
  • 22. Your Fastest On-ramp to Enterprise Hadoop™! Page  22   hep://hortonworks.com/products/hortonworks-­‐sandbox/   The  Sandbox  lets  you  experience  Apache  Hadoop  from  the  convenience  of  your  own   laptop  –  no  data  center,  no  cloud  and  no  internet  connec+on  needed!     The  Hortonworks  Sandbox  is:   •  A  free  download:    hep://hortonworks.com/products/hortonworks-­‐sandbox/   •  A  complete,  self  contained  virtual  machine  with  Apache  Hadoop  pre-­‐configured   •  A  personal,  portable  and  standalone  Hadoop  environment   •  A  set  of  hands-­‐on,  step-­‐by-­‐step  tutorials  that  allow  you  to  learn  and  explore  Hadoop  
  • 23. Ques+ons?   @ddkaiser   linkedin.com/in/dkaiser   slideshare.net/ddkaiser   dkaiser@cdk.com   dkaiser@hortonworks.com     David  Kaiser