SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
HBase	
  and	
  Impala	
  
Use	
  Cases	
  for	
  fast	
  SQL	
  queries	
  

1	
  
About	
  Me	
  
• 

EMEA	
  Chief	
  Architect	
  @	
  Cloudera	
  (3+	
  years)	
  
• 

• 

Apache	
  CommiLer	
  
• 

• 

ConsulGng	
  on	
  Hadoop	
  projects	
  (everywhere)	
  
HBase	
  and	
  Whirr	
  

O’Reilly	
  Author	
  
• 

HBase	
  –	
  The	
  DefiniGve	
  Guide	
  
• 

• 

Contact	
  
• 
• 

2	
  

Now	
  in	
  Japanese!	
  

lars@cloudera.com	
  
@larsgeorge	
  

日本語版も出ました!	
  
Agenda	
  
“IntroducGon”	
  to	
  HBase	
  
•  Impala	
  Architecture	
  
•  Mapping	
  Schemas	
  
•  Query	
  ConsideraGon	
  
• 

3	
  
Intro	
  To	
  HBase	
  
Slide	
  4	
  to	
  250	
  

4	
  
What	
  is	
  HBase?	
  
This	
  is	
  HBase!	
  

HBase	
  
5	
  
What	
  is	
  HBase?	
  
This	
  is	
  HBase!	
  

Really	
  though…	
  RTFM!	
  

(there	
  are	
  at	
  least	
  two	
  good	
  books	
  
about	
  it)	
  

6	
  

HBase	
  
IOPS	
  vs	
  Throughput	
  Mythbusters	
  
It	
  is	
  all	
  physics	
  in	
  the	
  end,	
  you	
  cannot	
  solve	
  an	
  I/O	
  
problem	
  without	
  reducing	
  I/O	
  in	
  general.	
  Parallelize	
  
access	
  and	
  read/write	
  sequenGally.	
  

7	
  
HBase:	
  Strengths	
  &	
  Weaknesses	
  
Strengths:	
  
•  Random	
  access	
  to	
  small(ish)	
  key-­‐value	
  pairs	
  
•  Rows	
  and	
  columns	
  stored	
  sorted	
  lexicographically	
  	
  
•  Adds	
  table	
  and	
  region	
  concepts	
  to	
  group	
  related	
  KVs	
  
•  Stores	
  and	
  reads	
  data	
  sequenGally	
  
•  Parallelizes	
  across	
  all	
  clients	
  
• 

8	
  

Non-­‐blocking	
  I/O	
  throughout	
  
Using	
  HBase	
  Strengths	
  

9	
  
HBase	
  “Indexes”	
  
• 

Use	
  primary	
  keys,	
  aka	
  the	
  row	
  keys,	
  as	
  sorted	
  index	
  
• 
• 

One	
  sort	
  direcGon	
  only	
  
Use	
  “secondary	
  index”	
  to	
  get	
  reverse	
  sorGng	
  
• 

• 

Use	
  secondary	
  keys,	
  aka	
  the	
  column	
  qualifiers,	
  as	
  
sorted	
  index	
  within	
  main	
  record	
  
• 

10	
  

Lookup	
  table	
  or	
  same	
  table	
  

Use	
  prefixes	
  within	
  a	
  column	
  family	
  or	
  separate	
  column	
  
families	
  	
  
HBase:	
  Strengths	
  &	
  Weaknesses	
  
Weaknesses:	
  
•  Not	
  opGmized	
  (yet)	
  for	
  100%	
  possible	
  throughput	
  of	
  
underlying	
  storage	
  layer	
  
• 

And	
  HDFS	
  is	
  not	
  opGmized	
  fully	
  either	
  

Single	
  writer	
  issue	
  with	
  WALs	
  
•  Single	
  server	
  hot-­‐sporng	
  with	
  non-­‐distributed	
  keys	
  
• 

11	
  
HBase	
  Dilemma	
  
Although	
  HBase	
  can	
  host	
  many	
  applicaGons,	
  they	
  may	
  
require	
  completely	
  opposite	
  features	
  

Events	
  

Time	
  Series	
  
12	
  

En((es	
  

Message	
  Store	
  
Opposite	
  Use-­‐Case	
  
• 

EnGty	
  Store	
  
• 
• 
• 
• 

• 

Event	
  Store	
  
• 
• 
• 
• 

13	
  

Regular	
  (random)	
  updates	
  and	
  inserts	
  in	
  exisGng	
  enGty	
  
Causes	
  enGty	
  details	
  being	
  spread	
  over	
  many	
  files	
  
Needs	
  to	
  read	
  a	
  lot	
  of	
  data	
  to	
  reconsGtute	
  “logical”	
  view	
  
WriGng	
  is	
  osen	
  nicely	
  distributed	
  (can	
  be	
  hashed)	
  

One-­‐off	
  inserts	
  of	
  events	
  such	
  as	
  log	
  entries	
  
Access	
  is	
  osen	
  a	
  scan	
  over	
  parGGons	
  by	
  Gme	
  
Reads	
  are	
  efficient	
  due	
  to	
  sequenGal	
  write	
  paLern	
  
Writes	
  need	
  to	
  be	
  taken	
  care	
  of	
  to	
  avoid	
  hotsporng	
  
Impala	
  Architecture	
  

14	
  
Beyond	
  Batch	
  
For	
  some	
  things	
  MapReduce	
  is	
  just	
  too	
  slow

	
  

Apache	
  Hive:	
  
• 
• 
• 

MapReduce	
  execuGon	
  engine	
  
High-­‐latency,	
  low	
  throughput	
  
High	
  runGme	
  overhead	
  

Google	
  realized	
  this	
  early	
  on	
  
• 

15	
  

Analysts	
  wanted	
  fast,	
  interacGve	
  results	
  
Dremel	
  
Google	
  paper	
  (2010)	
  
“scalable,	
  interac.ve	
  ad-­‐hoc	
  query	
  system	
  for	
  analysis	
  of	
  
read-­‐only	
  nested	
  data”	
  

Columnar	
  storage	
  format	
  
Distributed	
  scalable	
  aggregaGon	
  
“capable	
  of	
  running	
  aggrega.on	
  queries	
  over	
  trillion-­‐row	
  
tables	
  in	
  seconds”	
  

hLp://research.google.com/pubs/pub36632.html	
  
16	
  
Impala:	
  Goals	
  
General-­‐purpose	
  SQL	
  query	
  engine	
  for	
  Hadoop	
  
•  For	
  analyGcal	
  and	
  transacGonal	
  workloads	
  
•  Support	
  queries	
  that	
  take	
  ms	
  to	
  hours	
  
•  Run	
  directly	
  with	
  Hadoop	
  
• 

• 
• 
• 

17	
  

Collocated	
  daemons	
  
Same	
  file	
  formats	
  
Same	
  storage	
  managers	
  (NN,	
  metastore)	
  
Impala:	
  Goals	
  
• 

High	
  performance	
  
• 
• 
• 

• 

Retain	
  user	
  experience	
  
• 

• 

18	
  

C++	
  
runGme	
  code	
  generaGon	
  (LLVM)	
  
direct	
  access	
  to	
  data	
  (no	
  MapReduce)	
  

easy	
  for	
  Hive	
  users	
  to	
  migrate	
  

100%	
  open-­‐source	
  
Impala:	
  Architecture	
  
• 

impalad	
  
• 
• 
• 

• 

statestored	
  
• 
• 
• 

19	
  

runs	
  on	
  every	
  node	
  
handles	
  client	
  requests	
  (ODBC,	
  thris)	
  
handles	
  query	
  planning	
  &	
  execuGon	
  

provides	
  name	
  service	
  
metadata	
  distribuGon	
  
used	
  for	
  finding	
  data	
  
Impala:	
  Architecture	
  

20	
  
Impala:	
  Architecture	
  

21	
  
Impala:	
  Architecture	
  

22	
  
Impala:	
  Architecture	
  

23	
  
Mapping	
  Schemas	
  
HBase	
  to	
  Typed	
  Schema	
  

24	
  
Binary	
  to	
  Types	
  
HBase	
  only	
  has	
  binary	
  keys	
  and	
  values	
  
•  Hive	
  and	
  Impala	
  share	
  the	
  same	
  metastore	
  which	
  
adds	
  types	
  to	
  each	
  column	
  
• 

• 

• 

The	
  row	
  key	
  of	
  an	
  HBase	
  table	
  is	
  mapped	
  to	
  a	
  column	
  
in	
  the	
  metastore,	
  i.e.	
  on	
  the	
  SQL	
  side	
  	
  
• 

25	
  

Can	
  use	
  Hive	
  or	
  Impala	
  shell	
  to	
  change	
  metadata	
  

Impala	
  prefers	
  “String”	
  type	
  to	
  beLer	
  support	
  comparisons	
  
and	
  sorGng	
  
Defining	
  the	
  Schema	
  
CREATE TABLE hbase_table_1(
key string, value string
)
STORED BY
"org.apache.hadoop.hive.hbase.HBaseStorageHandler"
WITH SERDEPROPERTIES(
"hbase.columns.mapping" = ":key,cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);

26	
  
Defining	
  the	
  Schema	
  
CREATE TABLE hbase_table_1(
key string, value string
)
Maps	
  columns	
  to	
  fields	
  
STORED BY
"org.apache.hadoop.hive.hbase.HBaseStorageHandler"
WITH SERDEPROPERTIES(
"hbase.columns.mapping" = ":key,cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);

27	
  
Mapping	
  OpGons	
  
• 

Can	
  create	
  a	
  new	
  table	
  or	
  map	
  to	
  an	
  exis(ng	
  one	
  
• 
• 

• 

CreaGng	
  table	
  through	
  Hive	
  or	
  Impala	
  does	
  not	
  set	
  
any	
  table	
  or	
  column	
  family	
  proper(es	
  
• 
• 

	
  
	
  
28	
  

CREATE TABLE	
  	
  vs.	
  
CREATE EXTERNAL TABLE

Typically	
  not	
  a	
  good	
  idea	
  to	
  rely	
  on	
  defaults	
  
BeLer	
  specify	
  compression,	
  TTLs,	
  etc.	
  on	
  HBase	
  side	
  and	
  
then	
  map	
  as	
  external	
  table	
  
Mapping	
  OpGons	
  
SERDE	
  ProperGes	
  to	
  map	
  columns	
  to	
  fields	
  
•  hbase.columns.mapping
• 
• 
• 
• 

• 

hbase.table.default.storage.type
• 
• 
• 

29	
  

Matching	
  count	
  of	
  entries	
  required	
  (on	
  SQL	
  side	
  only)	
  
Spaces	
  are	
  not	
  allowed	
  (as	
  they	
  are	
  valid	
  characters	
  in	
  HBase)	
  
The	
  “:key”	
  mapping	
  is	
  a	
  special	
  one	
  for	
  the	
  HBase	
  row	
  key	
  
Otherwise:	
  column-family-name:[column-name]
[#(binary|string)
Can	
  be	
  string	
  (the	
  default)	
  or	
  binary
Defines	
  the	
  default	
  type	
  
Binary	
  means	
  data	
  treated	
  like	
  HBase	
  Bytes	
  class	
  does	
  	
  
Mapping	
  Limits	
  
• 

Only	
  one	
  (1)	
  “:key”	
  is	
  allowed	
  
• 

• 

But	
  can	
  be	
  inserted	
  in	
  SQL	
  schema	
  at	
  will	
  

Access	
  to	
  HBase	
  KV	
  versions	
  are	
  not	
  supported	
  (yet)	
  
• 
• 

Always	
  returns	
  the	
  latest	
  version	
  by	
  default	
  
This	
  is	
  very	
  similar	
  to	
  what	
  a	
  database	
  user	
  expects	
  

HBase	
  columns	
  not	
  mapped	
  are	
  not	
  visible	
  on	
  SQL	
  side	
  
•  Since	
  row	
  keys	
  in	
  HBase	
  are	
  unique,	
  results	
  may	
  vary	
  
• 

• 

• 

30	
  

InserGng	
  duplicate	
  keys	
  updates	
  row	
  while	
  count	
  of	
  rows	
  stays	
  
the	
  same	
  

INSERT	
  OVERWRITE	
  does	
  not	
  delete	
  exisGng	
  rows	
  but	
  
rather	
  updates	
  those	
  (HBase	
  is	
  mutable	
  aser	
  all!)	
  
Query	
  ConsideraGons	
  

31	
  
HBase	
  Table	
  Scan	
  
$ hbase shell
hbase(main):001:0> list
xyz
1 row(s) in 0.0530 seconds'

Table	
  was	
  created	
  

hbase(main):002:0> describe "xyz"
DESCRIPTION
ENABLED
{NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE',
VE true
RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY
=>
'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0220 seconds
hbase(main):003:0> scan "xyz"
ROW
COLUMN+CELL
0 row(s) in 0.0060 seconds

32	
  

Table	
  empty	
  
HBase	
  Table	
  Scan	
  
Insert	
  data	
  from	
  exisGng	
  table	
  into	
  HBase	
  backed	
  one:	
  
INSERT OVERWRITE TABLE hbase_table_1 
SELECT * FROM pokes WHERE foo=98;

	
  
Verify	
  on	
  HBase	
  side:	
  
hbase(main):009:0> scan "xyz"
ROW
COLUMN+CELL
98
column=cf1:val,
timestamp=1267737987733, value=val_98
1 row(s) in 0.0110 seconds
33	
  
Pro	
  Tip:	
  hLp://gethue.com/	
  

34	
  
HBase	
  Scans	
  under	
  the	
  Hood	
  
Impala	
  uses	
  Scan	
  instances	
  under	
  the	
  hood	
  just	
  as	
  the	
  
naGve	
  Java	
  API	
  does.	
  This	
  allows	
  for	
  all	
  scan	
  
opGmizaGons,	
  e.g.	
  predicate	
  push-­‐down,	
  like	
  
• 

Start	
  and	
  Stop	
  Row	
  

Server-­‐side	
  Filters	
  
•  Scanner	
  caching	
  (but	
  not	
  batching	
  yet)	
  
• 

35	
  
Configure	
  HBase	
  Scan	
  Details	
  
In	
  impala-shell:	
  
	
  

• 

Same	
  as	
  calling	
  setCacheBlocks(true)	
  or	
  
setCacheBlocks(false)
set hbase_cache_blocks=true;
set hbase_cache_blocks=false;

• 

Same	
  as	
  calling	
  setCaching(rows)	
  
set hbase_caching=1000;

36	
  
HBase	
  Scans	
  under	
  the	
  Hood	
  
Back	
  to	
  Physics:	
  A	
  scan	
  can	
  only	
  perform	
  well	
  if	
  as	
  few	
  
data	
  is	
  read	
  as	
  possible.	
  
•  Need	
  to	
  issue	
  queries	
  that	
  are	
  known	
  not	
  to	
  be	
  full	
  
table	
  scans	
  
•  This	
  requires	
  careful	
  schema	
  design!	
  
Typical	
  use-­‐cases	
  are	
  	
  
•  OLAP	
  cube:	
  read	
  report	
  data	
  from	
  single	
  row	
  
•  Time	
  series:	
  read	
  fine-­‐grained,	
  Gme	
  parGGoned	
  data	
  
37	
  
OLAP	
  Example	
  
Facebook	
  Insights	
  is	
  using	
  HBase	
  to	
  keep	
  an	
  OLAP	
  
cube	
  live,	
  i.e.	
  fully	
  materialized	
  
•  Each	
  row	
  reflect	
  one	
  tracked	
  page	
  and	
  contains	
  all	
  its	
  
data	
  points	
  
• 

• 

All	
  dimensions	
  with	
  Gme	
  bracket	
  prefix	
  plus	
  TTLs	
  

During	
  report	
  Gme	
  only	
  one	
  or	
  very	
  few	
  rows	
  are	
  
read	
  
•  Design	
  favors	
  read	
  over	
  write	
  performance	
  
•  Could	
  also	
  think	
  about	
  hybrid	
  system:	
  
• 

• 
38	
  

CEP	
  +	
  HBase	
  +	
  HDFS	
  (Parquet)	
  
Time	
  Series	
  Example	
  
• 

OpenTSDB	
  writes	
  the	
  metric	
  events	
  bucketed	
  by	
  
metric	
  ID	
  and	
  then	
  Gmestamp	
  
• 

Helps	
  using	
  all	
  servers	
  in	
  the	
  cluster	
  equally	
  

During	
  reporGng/dashboarding	
  the	
  data	
  is	
  read	
  for	
  
specific	
  metrics	
  within	
  a	
  specific	
  (me	
  frame	
  
•  Sorted	
  data	
  translates	
  into	
  effec(ve	
  use	
  of	
  Scan	
  with	
  
start	
  and	
  stop	
  rows	
  
• 

39	
  
Final	
  Notes	
  
Since	
  the	
  HBase	
  scan	
  performance	
  is	
  mainly	
  influenced	
  by	
  
number	
  of	
  rows	
  scanned	
  you	
  need	
  to	
  issue	
  queries	
  that	
  are	
  
selecGve,	
  i.e.	
  scan	
  only	
  certain	
  rows	
  and	
  not	
  the	
  en(re	
  table.	
  
	
  

This	
  requires	
  WHERE	
  clauses	
  with	
  the	
  HBase	
  row	
  key	
  in	
  it:	
  
	
  

SELECT f1, f2, f3 FROM mapped_table
WHERE key >= "user1234" AND key <
"user1235";
	
  

“Scan	
  all	
  rows	
  for	
  user	
  1234,	
  i.e.	
  that	
  have	
  a	
  row	
  key	
  starGng	
  
with	
  user1234”	
  -­‐	
  might	
  be	
  a	
  composite	
  key!	
  
40	
  
Example	
  

41	
  
Final	
  Notes	
  
Not	
  using	
  the	
  primary	
  HBase	
  index,	
  aka	
  row	
  key,	
  results	
  
in	
  a	
  full	
  table	
  scan	
  and	
  might	
  need	
  much	
  longer	
  (when	
  
you	
  have	
  a	
  large	
  table.	
  
	
  
SELECT f1, f2, f3 FROM mapped_table
WHERE f1 = ”value1” OR f20 < ”200";
	
  

This	
  will	
  result	
  in	
  a	
  full	
  table	
  scan.	
  Remember:	
  it	
  is	
  all	
  
just	
  physics!	
  

42	
  
Final	
  Notes	
  
Impala	
  also	
  uses	
  SingleColumnValueFilter	
  from	
  HBase	
  
to	
  reduce	
  transferred	
  data	
  	
  
•  Filters	
  out	
  enGre	
  rows	
  by	
  checking	
  a	
  given	
  column	
  
value	
  
•  Does	
  not	
  skip	
  rows	
  since	
  no	
  index	
  or	
  Bloom	
  filter	
  is	
  
available	
  to	
  help	
  idenGfy	
  the	
  next	
  match	
  
	
  

Overall	
  this	
  helps	
  yet	
  cannot	
  do	
  any	
  magic	
  (physics	
  
again!)	
  

43	
  
Final	
  Notes	
  
Some	
  advice	
  on	
  Tall-­‐narrow	
  vs.	
  flat-­‐wide	
  table	
  layout:	
  
Store	
  data	
  in	
  a	
  tall	
  and	
  narrow	
  table	
  since	
  there	
  is	
  
currently	
  no	
  support	
  for	
  scanner	
  batching	
  (i.e.	
  intra	
  
row	
  scanning).	
  Mapping,	
  for	
  example,	
  one	
  million	
  
HBase	
  columns	
  into	
  SQL	
  is	
  fu(le.	
  
This	
  is	
  sGll	
  true	
  for	
  Hive’s	
  Map	
  support,	
  since	
  the	
  
enGre	
  row	
  has	
  to	
  fit	
  into	
  memory!	
  

44	
  
Outlook	
  
Future	
  work:	
  
•  Composite	
  keys:	
  map	
  mul(ple	
  SQL	
  fields	
  into	
  a	
  single	
  
composite	
  HBase	
  row	
  key	
  
•  Expose	
  KV	
  versions	
  to	
  SQL	
  schema	
  
•  BeLer	
  predicate	
  pushdown	
  
• 

45	
  

Advanced	
  filter	
  or	
  indexes?	
  
Ques(ons?	
  

46	
  

@larsgeorge	
  
lars@cloudera.com	
  

Más contenido relacionado

La actualidad más candente

Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_finalasterix_smartplatf
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 

La actualidad más candente (20)

Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 

Destacado

January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path ForwardDan Mallinger
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Big Analytics: Building Lasting Value
Big Analytics: Building Lasting ValueBig Analytics: Building Lasting Value
Big Analytics: Building Lasting ValueDan Mallinger
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010Ysance
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeKhanh Maudoux
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBaseSalesforce Developers
 

Destacado (20)

January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Big Analytics: Building Lasting Value
Big Analytics: Building Lasting ValueBig Analytics: Building Lasting Value
Big Analytics: Building Lasting Value
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Présentation Club STORM
Présentation Club STORMPrésentation Club STORM
Présentation Club STORM
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBase
 

Similar a HBase and Impala Notes - Munich HUG - 20131017

Similar a HBase and Impala Notes - Munich HUG - 20131017 (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 
Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
HBase
HBaseHBase
HBase
 
Hbase
HbaseHbase
Hbase
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
HBase lon meetup
HBase lon meetupHBase lon meetup
HBase lon meetup
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Hbasepreso 111116185419-phpapp02
Hbasepreso 111116185419-phpapp02Hbasepreso 111116185419-phpapp02
Hbasepreso 111116185419-phpapp02
 

Más de larsgeorge

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 

Más de larsgeorge (7)

Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 

Último

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Último (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

HBase and Impala Notes - Munich HUG - 20131017

  • 1. HBase  and  Impala   Use  Cases  for  fast  SQL  queries   1  
  • 2. About  Me   •  EMEA  Chief  Architect  @  Cloudera  (3+  years)   •  •  Apache  CommiLer   •  •  ConsulGng  on  Hadoop  projects  (everywhere)   HBase  and  Whirr   O’Reilly  Author   •  HBase  –  The  DefiniGve  Guide   •  •  Contact   •  •  2   Now  in  Japanese!   lars@cloudera.com   @larsgeorge   日本語版も出ました!  
  • 3. Agenda   “IntroducGon”  to  HBase   •  Impala  Architecture   •  Mapping  Schemas   •  Query  ConsideraGon   •  3  
  • 4. Intro  To  HBase   Slide  4  to  250   4  
  • 5. What  is  HBase?   This  is  HBase!   HBase   5  
  • 6. What  is  HBase?   This  is  HBase!   Really  though…  RTFM!   (there  are  at  least  two  good  books   about  it)   6   HBase  
  • 7. IOPS  vs  Throughput  Mythbusters   It  is  all  physics  in  the  end,  you  cannot  solve  an  I/O   problem  without  reducing  I/O  in  general.  Parallelize   access  and  read/write  sequenGally.   7  
  • 8. HBase:  Strengths  &  Weaknesses   Strengths:   •  Random  access  to  small(ish)  key-­‐value  pairs   •  Rows  and  columns  stored  sorted  lexicographically     •  Adds  table  and  region  concepts  to  group  related  KVs   •  Stores  and  reads  data  sequenGally   •  Parallelizes  across  all  clients   •  8   Non-­‐blocking  I/O  throughout  
  • 10. HBase  “Indexes”   •  Use  primary  keys,  aka  the  row  keys,  as  sorted  index   •  •  One  sort  direcGon  only   Use  “secondary  index”  to  get  reverse  sorGng   •  •  Use  secondary  keys,  aka  the  column  qualifiers,  as   sorted  index  within  main  record   •  10   Lookup  table  or  same  table   Use  prefixes  within  a  column  family  or  separate  column   families    
  • 11. HBase:  Strengths  &  Weaknesses   Weaknesses:   •  Not  opGmized  (yet)  for  100%  possible  throughput  of   underlying  storage  layer   •  And  HDFS  is  not  opGmized  fully  either   Single  writer  issue  with  WALs   •  Single  server  hot-­‐sporng  with  non-­‐distributed  keys   •  11  
  • 12. HBase  Dilemma   Although  HBase  can  host  many  applicaGons,  they  may   require  completely  opposite  features   Events   Time  Series   12   En((es   Message  Store  
  • 13. Opposite  Use-­‐Case   •  EnGty  Store   •  •  •  •  •  Event  Store   •  •  •  •  13   Regular  (random)  updates  and  inserts  in  exisGng  enGty   Causes  enGty  details  being  spread  over  many  files   Needs  to  read  a  lot  of  data  to  reconsGtute  “logical”  view   WriGng  is  osen  nicely  distributed  (can  be  hashed)   One-­‐off  inserts  of  events  such  as  log  entries   Access  is  osen  a  scan  over  parGGons  by  Gme   Reads  are  efficient  due  to  sequenGal  write  paLern   Writes  need  to  be  taken  care  of  to  avoid  hotsporng  
  • 15. Beyond  Batch   For  some  things  MapReduce  is  just  too  slow   Apache  Hive:   •  •  •  MapReduce  execuGon  engine   High-­‐latency,  low  throughput   High  runGme  overhead   Google  realized  this  early  on   •  15   Analysts  wanted  fast,  interacGve  results  
  • 16. Dremel   Google  paper  (2010)   “scalable,  interac.ve  ad-­‐hoc  query  system  for  analysis  of   read-­‐only  nested  data”   Columnar  storage  format   Distributed  scalable  aggregaGon   “capable  of  running  aggrega.on  queries  over  trillion-­‐row   tables  in  seconds”   hLp://research.google.com/pubs/pub36632.html   16  
  • 17. Impala:  Goals   General-­‐purpose  SQL  query  engine  for  Hadoop   •  For  analyGcal  and  transacGonal  workloads   •  Support  queries  that  take  ms  to  hours   •  Run  directly  with  Hadoop   •  •  •  •  17   Collocated  daemons   Same  file  formats   Same  storage  managers  (NN,  metastore)  
  • 18. Impala:  Goals   •  High  performance   •  •  •  •  Retain  user  experience   •  •  18   C++   runGme  code  generaGon  (LLVM)   direct  access  to  data  (no  MapReduce)   easy  for  Hive  users  to  migrate   100%  open-­‐source  
  • 19. Impala:  Architecture   •  impalad   •  •  •  •  statestored   •  •  •  19   runs  on  every  node   handles  client  requests  (ODBC,  thris)   handles  query  planning  &  execuGon   provides  name  service   metadata  distribuGon   used  for  finding  data  
  • 24. Mapping  Schemas   HBase  to  Typed  Schema   24  
  • 25. Binary  to  Types   HBase  only  has  binary  keys  and  values   •  Hive  and  Impala  share  the  same  metastore  which   adds  types  to  each  column   •  •  •  The  row  key  of  an  HBase  table  is  mapped  to  a  column   in  the  metastore,  i.e.  on  the  SQL  side     •  25   Can  use  Hive  or  Impala  shell  to  change  metadata   Impala  prefers  “String”  type  to  beLer  support  comparisons   and  sorGng  
  • 26. Defining  the  Schema   CREATE TABLE hbase_table_1( key string, value string ) STORED BY "org.apache.hadoop.hive.hbase.HBaseStorageHandler" WITH SERDEPROPERTIES( "hbase.columns.mapping" = ":key,cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 26  
  • 27. Defining  the  Schema   CREATE TABLE hbase_table_1( key string, value string ) Maps  columns  to  fields   STORED BY "org.apache.hadoop.hive.hbase.HBaseStorageHandler" WITH SERDEPROPERTIES( "hbase.columns.mapping" = ":key,cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 27  
  • 28. Mapping  OpGons   •  Can  create  a  new  table  or  map  to  an  exis(ng  one   •  •  •  CreaGng  table  through  Hive  or  Impala  does  not  set   any  table  or  column  family  proper(es   •  •      28   CREATE TABLE    vs.   CREATE EXTERNAL TABLE Typically  not  a  good  idea  to  rely  on  defaults   BeLer  specify  compression,  TTLs,  etc.  on  HBase  side  and   then  map  as  external  table  
  • 29. Mapping  OpGons   SERDE  ProperGes  to  map  columns  to  fields   •  hbase.columns.mapping •  •  •  •  •  hbase.table.default.storage.type •  •  •  29   Matching  count  of  entries  required  (on  SQL  side  only)   Spaces  are  not  allowed  (as  they  are  valid  characters  in  HBase)   The  “:key”  mapping  is  a  special  one  for  the  HBase  row  key   Otherwise:  column-family-name:[column-name] [#(binary|string) Can  be  string  (the  default)  or  binary Defines  the  default  type   Binary  means  data  treated  like  HBase  Bytes  class  does    
  • 30. Mapping  Limits   •  Only  one  (1)  “:key”  is  allowed   •  •  But  can  be  inserted  in  SQL  schema  at  will   Access  to  HBase  KV  versions  are  not  supported  (yet)   •  •  Always  returns  the  latest  version  by  default   This  is  very  similar  to  what  a  database  user  expects   HBase  columns  not  mapped  are  not  visible  on  SQL  side   •  Since  row  keys  in  HBase  are  unique,  results  may  vary   •  •  •  30   InserGng  duplicate  keys  updates  row  while  count  of  rows  stays   the  same   INSERT  OVERWRITE  does  not  delete  exisGng  rows  but   rather  updates  those  (HBase  is  mutable  aser  all!)  
  • 32. HBase  Table  Scan   $ hbase shell hbase(main):001:0> list xyz 1 row(s) in 0.0530 seconds' Table  was  created   hbase(main):002:0> describe "xyz" DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0220 seconds hbase(main):003:0> scan "xyz" ROW COLUMN+CELL 0 row(s) in 0.0060 seconds 32   Table  empty  
  • 33. HBase  Table  Scan   Insert  data  from  exisGng  table  into  HBase  backed  one:   INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;   Verify  on  HBase  side:   hbase(main):009:0> scan "xyz" ROW COLUMN+CELL 98 column=cf1:val, timestamp=1267737987733, value=val_98 1 row(s) in 0.0110 seconds 33  
  • 35. HBase  Scans  under  the  Hood   Impala  uses  Scan  instances  under  the  hood  just  as  the   naGve  Java  API  does.  This  allows  for  all  scan   opGmizaGons,  e.g.  predicate  push-­‐down,  like   •  Start  and  Stop  Row   Server-­‐side  Filters   •  Scanner  caching  (but  not  batching  yet)   •  35  
  • 36. Configure  HBase  Scan  Details   In  impala-shell:     •  Same  as  calling  setCacheBlocks(true)  or   setCacheBlocks(false) set hbase_cache_blocks=true; set hbase_cache_blocks=false; •  Same  as  calling  setCaching(rows)   set hbase_caching=1000; 36  
  • 37. HBase  Scans  under  the  Hood   Back  to  Physics:  A  scan  can  only  perform  well  if  as  few   data  is  read  as  possible.   •  Need  to  issue  queries  that  are  known  not  to  be  full   table  scans   •  This  requires  careful  schema  design!   Typical  use-­‐cases  are     •  OLAP  cube:  read  report  data  from  single  row   •  Time  series:  read  fine-­‐grained,  Gme  parGGoned  data   37  
  • 38. OLAP  Example   Facebook  Insights  is  using  HBase  to  keep  an  OLAP   cube  live,  i.e.  fully  materialized   •  Each  row  reflect  one  tracked  page  and  contains  all  its   data  points   •  •  All  dimensions  with  Gme  bracket  prefix  plus  TTLs   During  report  Gme  only  one  or  very  few  rows  are   read   •  Design  favors  read  over  write  performance   •  Could  also  think  about  hybrid  system:   •  •  38   CEP  +  HBase  +  HDFS  (Parquet)  
  • 39. Time  Series  Example   •  OpenTSDB  writes  the  metric  events  bucketed  by   metric  ID  and  then  Gmestamp   •  Helps  using  all  servers  in  the  cluster  equally   During  reporGng/dashboarding  the  data  is  read  for   specific  metrics  within  a  specific  (me  frame   •  Sorted  data  translates  into  effec(ve  use  of  Scan  with   start  and  stop  rows   •  39  
  • 40. Final  Notes   Since  the  HBase  scan  performance  is  mainly  influenced  by   number  of  rows  scanned  you  need  to  issue  queries  that  are   selecGve,  i.e.  scan  only  certain  rows  and  not  the  en(re  table.     This  requires  WHERE  clauses  with  the  HBase  row  key  in  it:     SELECT f1, f2, f3 FROM mapped_table WHERE key >= "user1234" AND key < "user1235";   “Scan  all  rows  for  user  1234,  i.e.  that  have  a  row  key  starGng   with  user1234”  -­‐  might  be  a  composite  key!   40  
  • 42. Final  Notes   Not  using  the  primary  HBase  index,  aka  row  key,  results   in  a  full  table  scan  and  might  need  much  longer  (when   you  have  a  large  table.     SELECT f1, f2, f3 FROM mapped_table WHERE f1 = ”value1” OR f20 < ”200";   This  will  result  in  a  full  table  scan.  Remember:  it  is  all   just  physics!   42  
  • 43. Final  Notes   Impala  also  uses  SingleColumnValueFilter  from  HBase   to  reduce  transferred  data     •  Filters  out  enGre  rows  by  checking  a  given  column   value   •  Does  not  skip  rows  since  no  index  or  Bloom  filter  is   available  to  help  idenGfy  the  next  match     Overall  this  helps  yet  cannot  do  any  magic  (physics   again!)   43  
  • 44. Final  Notes   Some  advice  on  Tall-­‐narrow  vs.  flat-­‐wide  table  layout:   Store  data  in  a  tall  and  narrow  table  since  there  is   currently  no  support  for  scanner  batching  (i.e.  intra   row  scanning).  Mapping,  for  example,  one  million   HBase  columns  into  SQL  is  fu(le.   This  is  sGll  true  for  Hive’s  Map  support,  since  the   enGre  row  has  to  fit  into  memory!   44  
  • 45. Outlook   Future  work:   •  Composite  keys:  map  mul(ple  SQL  fields  into  a  single   composite  HBase  row  key   •  Expose  KV  versions  to  SQL  schema   •  BeLer  predicate  pushdown   •  45   Advanced  filter  or  indexes?  
  • 46. Ques(ons?   46   @larsgeorge   lars@cloudera.com