SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
MICRO-­‐ETL:	
  
AN	
  EMERGING	
  PATTERN	
  OF	
  USE	
  WITH	
  HADOOP	
  
AND	
  NEAR-­‐REALTIME	
  FRAMEWORKS	
  
Adam	
  Muise	
  –	
  Principle	
  Architect,	
  Hortonworks	
  
Who	
  is	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ?	
  
We	
  do	
  Hadoop	
  
The	
  leaders	
  of	
  Hadoop’s	
  
development	
  
Community	
  driven,	
  	
  
Enterprise	
  Focused	
  
Drive	
  InnovaEon	
  in	
  
the	
  plaForm	
  –	
  We	
  
lead	
  the	
  roadmap	
  	
  
100%	
  Open	
  Source	
  –	
  
DemocraEzed	
  Access	
  to	
  
Data	
  
What	
  is	
  Micro-­‐ETL?	
  
I	
  made	
  it	
  up.	
  
It	
  turns	
  out	
  several	
  other	
  people	
  
have	
  made	
  it	
  up	
  before	
  so	
  I	
  don’t	
  
feel	
  like	
  a	
  megalomaniac.	
  
Another	
  terminology	
  calls	
  it	
  Near-­‐
RealEme	
  ETL.	
  	
  
hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf	
  
Micro-­‐ETL/Near-­‐RealEme	
  ETL	
  
involves:	
  
1.	
  An	
  intra-­‐batch	
  and/or	
  near-­‐real3me	
  
ingest	
  of	
  analy3cal	
  data	
  	
  
2.	
  Processing	
  on	
  a	
  small	
  batches	
  of	
  data	
  
or	
  event	
  streams	
  	
  
3.	
  A	
  scalable	
  “ETL”	
  processing	
  
framework	
  	
  
Why	
  Use	
  Micro-­‐ETL:	
  
1.	
  Your	
  analy3cs	
  require	
  more	
  up	
  to	
  date	
  
informa3on	
  than	
  provided	
  by	
  a	
  regular	
  batch	
  
process	
  
2.	
  There	
  is	
  opera3onal	
  risk	
  (ie	
  falling	
  behind	
  the	
  data	
  
firehose,	
  data	
  loss,	
  data	
  fidelity)	
  in	
  leaving	
  the	
  
processing	
  of	
  events	
  un3l	
  batch	
  can	
  process	
  them	
  
3.	
  Your	
  data	
  ingest	
  rate	
  is	
  inconsistent	
  and	
  there	
  is	
  
value	
  in	
  keeping	
  up-­‐to-­‐date	
  with	
  current	
  events.	
  
When	
  not	
  to	
  use	
  Micro-­‐ETL:	
  
1.	
  Your	
  data	
  ingest	
  rates	
  are	
  predictable	
  and	
  
analyzing	
  them	
  intra-­‐batch	
  provides	
  liMle	
  
value.	
  
2.	
  Processing	
  in	
  a	
  large	
  batch	
  yields	
  a	
  
complete	
  popula3on	
  of	
  the	
  data	
  required	
  to	
  
make	
  a	
  decision.	
  	
  
3.	
  You	
  have	
  exis3ng	
  investments	
  in	
  tradi3onal	
  
ETL	
  tools	
  that	
  outweigh	
  any	
  benefits	
  in	
  a	
  new	
  
tool/framework	
  (as	
  tools	
  evolve,	
  this	
  will	
  be	
  a	
  
moot	
  point	
  however)	
  
Micro-­‐ETL	
  involves	
  regular	
  
processing	
  tasks	
  only	
  run	
  on	
  data	
  
frameworks	
  that	
  can	
  handle	
  
near-­‐realEme	
  event	
  streams	
  
Let’s	
  refresh	
  on	
  some	
  core	
  
Hadoop	
  concepts…	
  
Refresh	
  on	
  YARN	
  
YARN	
  =	
  Yet	
  Another	
  Resource	
  
NegoEator	
  
Resource	
  Manager	
  
+	
  
Node	
  Managers	
  
=	
  YARN	
  
Resource	
  Manager	
  
AppMaster	
  
Node	
  Manager	
  
Scheduler	
  
AppMaster	
  
AppMaster	
  
Node	
  Manager	
  
Node	
  Manager	
  
Node	
  Manager	
  
Container	
  
Container	
  
MapReduce	
  
Container	
  
Storm	
  
Container	
  
Container	
  
Container	
  
Pig	
  
Container	
  
Container	
  
Container	
  
YARN	
  abstracts	
  resource	
  
management	
  so	
  you	
  can	
  run	
  all	
  sorts	
  
of	
  distributed	
  applicaEons	
  
HDFS	
  
MapReduce	
  V2	
  
YARN	
  
MapReduce	
  V?	
   STORM	
  
MPI	
  Giraph	
  
HBase	
  Tez	
  
…	
  and	
  
more	
  Spark	
  
The	
  following	
  secEon	
  outlines	
  
frameworks	
  that	
  are	
  emerging	
  as	
  
data	
  processing	
  opEons	
  to	
  batch-­‐
driven	
  MapReduce	
  
Introducing	
  Tez	
  
Three	
  important	
  facts	
  about	
  Tez:	
  
1.	
  Tez	
  is	
  a	
  YARN	
  applicaEon.	
  
2.	
  Tez	
  will	
  eventually	
  replace	
  
MapRecue.	
  
3.	
  Tez	
  scales	
  as	
  well	
  as	
  the	
  rest	
  of	
  
Hadoop	
  scales	
  (thousands	
  of	
  nodes).	
  
Tez	
  provides	
  a	
  layer	
  for	
  abstract	
  
tasks,	
  these	
  could	
  be	
  mappers,	
  
reducers,	
  customized	
  stream	
  
processes,	
  in	
  memory	
  structures,	
  
etc	
  
Tez	
  chains	
  tasks	
  together	
  into	
  one	
  job	
  to	
  
get	
  jobs	
  like	
  Map	
  –	
  Reduce	
  –	
  Reduce.	
  	
  
This	
  is	
  ideal	
  for	
  apps	
  like	
  Hive.	
  
TezMap	
  
TezMap	
  
TezMap	
  
TezMap	
  
TezMap	
  
TezReduce	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
Data	
  
TezReduce	
  
TezReduce	
  
TezReduce	
  
TezReduce	
  
TezReduce	
  
Group	
  By	
  
ProjecEons	
  
Order	
  By	
  
Tez	
  provides	
  the	
  opEon	
  for	
  more	
  
complicated	
  workflows	
  
Tez	
  allows	
  for	
  more	
  complicated	
  
workflow	
  primiEves	
  than	
  just	
  Map	
  
and	
  Reduce.	
  A	
  Tez	
  task	
  is	
  composed	
  
of	
  a	
  programmable	
  Input,	
  Output,	
  
and	
  Processor	
  
YARN	
  can	
  provide	
  long-­‐running	
  
containers*	
  for	
  applicaEons	
  like	
  
Storm,	
  Hbase,	
  JBoss,	
  etc	
  
	
  
*	
  -­‐	
  With	
  the	
  help	
  of	
  Apache	
  Slider:	
  	
  
hTp://wiki.apache.org/incubator/SliderProposal	
  
Yahoo!,	
  the	
  original	
  author	
  of	
  
Hadoop	
  and	
  the	
  largest	
  Hadoop	
  
user,	
  is	
  bepng	
  on	
  Apache	
  Hive/Tez/
YARN	
  as	
  their	
  core	
  data	
  
architecture.	
  
	
  
	
  
hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-­‐
bepng-­‐on-­‐apache-­‐hive-­‐tez-­‐and-­‐yarn	
  
Refresh	
  on	
  Storm	
  
Storm	
  is	
  a	
  distributed	
  execuEon	
  
engine	
  that	
  handles	
  streaming	
  data	
  
Storm	
  processes	
  streaming	
  event	
  
data	
  as	
  tuples.	
  Each	
  event	
  is	
  
generated/ingested	
  through	
  a	
  spout	
  
and	
  processed	
  in	
  series	
  of	
  bolts.	
  The	
  
spouts	
  and	
  bolts	
  form	
  a	
  topology.	
  
Refresh	
  on	
  Summingbird	
  
Summingbird	
  is	
  a	
  processing	
  
framework	
  that	
  runs	
  over	
  Storm.	
  It	
  
uses	
  Scala	
  and	
  has	
  MapReduce-­‐like	
  
features.	
  Technically	
  it’s	
  Scalding	
  
(Cascading	
  with	
  Scala).	
  
Refresh	
  on	
  Spark	
  
Spark	
  is	
  a	
  framework	
  designed	
  to	
  
handle	
  in-­‐memory	
  compuEng.	
  If	
  you	
  
are	
  using	
  Hadoop,	
  you	
  are	
  typically	
  
running	
  Spark	
  on	
  YARN.	
  
Spark	
  uses	
  RDDs	
  (Resilient	
  
Distributed	
  Datasets)	
  as	
  a	
  primiEve	
  
to	
  enable	
  in-­‐memory	
  processing.	
  
RDDs	
  can	
  be	
  created	
  from	
  all	
  sorts	
  
of	
  data,	
  like	
  HDFS	
  files:	
  
scala>	
  val	
  distFile	
  =	
  sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")	
  
	
  
distFile:	
  spark.RDD[String]	
  =	
  spark.HadoopRDD@1d4cee08	
  
Spark	
  Streaming	
  constructs	
  DStream	
  
primiEves	
  from	
  a	
  streaming	
  data	
  
source.	
  DStreams	
  are	
  actually	
  made	
  
up	
  of	
  many	
  RDDs	
  and	
  use	
  the	
  
common	
  Spark	
  Engine.	
  	
  
Spark	
  Streaming	
  has	
  an	
  expressive	
  
API	
  to	
  allow	
  typical	
  transformaEons	
  
on	
  RDDs	
  or	
  DStreams.	
  This	
  is	
  
comparable	
  to	
  MapReduce.	
  
Since	
  Micro-­‐ETL	
  involves	
  porEng	
  
tradiEonal	
  data	
  processing	
  tasks	
  to	
  
different	
  execuEon	
  environments,	
  it	
  
makes	
  sense	
  that	
  data	
  processing	
  
tools	
  would	
  facilitate	
  execuEon	
  on	
  
mulEple	
  plaForms.	
  	
  
Some	
  opEons	
  in	
  the	
  field…	
  
You	
  can	
  write	
  generic	
  processing	
  
libraries	
  (Java)	
  for	
  your	
  data	
  and	
  
port	
  them	
  from	
  MapReduce/Tez	
  to	
  
Storm.	
  Storm	
  can	
  also	
  run	
  
Summingbird	
  (Scala).	
  	
  
Spark	
  already	
  provides	
  a	
  
streaming	
  and	
  processing	
  layer	
  
for	
  Micro-­‐ETL.	
  These	
  can	
  be	
  
Scala,	
  Java,	
  or	
  Python.	
  
Cascading	
  has	
  recently	
  announced	
  
that	
  it	
  would	
  support	
  Tez	
  and	
  Storm	
  
immediately	
  in	
  version	
  3.0.	
  They	
  
also	
  have	
  plans	
  for	
  Spark	
  
hTp://www.concurrenEnc.com/2014/05/cascading-­‐3-­‐0-­‐adds-­‐mulEple-­‐framework-­‐support-­‐concurrent-­‐driven-­‐manages-­‐big-­‐data-­‐apps/	
  
Pentaho	
  KeTle	
  currently	
  
supports	
  porEng	
  their	
  data	
  
workflows	
  to	
  Storm	
  in	
  a	
  beta	
  
version.	
  
hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm	
  
Talend	
  recently	
  asked	
  for	
  $40	
  
million	
  in	
  funding	
  to	
  help	
  push	
  
further	
  into	
  Big	
  Data.	
  That	
  includes	
  
tooling	
  to	
  port	
  Talend	
  ETL	
  workflows	
  
to	
  Storm	
  and	
  Tez.	
  
hTp://techcrunch.com/2013/12/11/talend-­‐raises-­‐40m-­‐to-­‐more-­‐aggressively-­‐extend-­‐into-­‐big-­‐data-­‐market-­‐sets-­‐sights-­‐on-­‐ipo/	
  
Discuss.	
  
Thanks	
  THUGs.	
  

Más contenido relacionado

La actualidad más candente

Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoopdarugar
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 

La actualidad más candente (20)

Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Destacado

Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...Massimo Cenci
 
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...Massimo Cenci
 
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...Massimo Cenci
 
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...Massimo Cenci
 
Data Warehouse and Business Intelligence - Recipe 3
Data Warehouse and Business Intelligence - Recipe 3Data Warehouse and Business Intelligence - Recipe 3
Data Warehouse and Business Intelligence - Recipe 3Massimo Cenci
 
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...Massimo Cenci
 
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...Massimo Cenci
 
Data Warehouse and Business Intelligence - Recipe 1
Data Warehouse and Business Intelligence - Recipe 1Data Warehouse and Business Intelligence - Recipe 1
Data Warehouse and Business Intelligence - Recipe 1Massimo Cenci
 
Data Warehouse and Business Intelligence - Recipe 2
Data Warehouse and Business Intelligence - Recipe 2Data Warehouse and Business Intelligence - Recipe 2
Data Warehouse and Business Intelligence - Recipe 2Massimo Cenci
 
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...Massimo Cenci
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyMark Ginnebaugh
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 

Destacado (13)

Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the...
 
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
Data Warehouse and Business Intelligence - Recipe 4 - Staging area - how to v...
 
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
Data Warehouse and Business Intelligence - Recipe 7 - A messaging system for ...
 
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
ata Warehouse and Business Intelligence - Recipe 7 - A messaging system for O...
 
Data Warehouse and Business Intelligence - Recipe 3
Data Warehouse and Business Intelligence - Recipe 3Data Warehouse and Business Intelligence - Recipe 3
Data Warehouse and Business Intelligence - Recipe 3
 
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
 
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
 
Data Warehouse and Business Intelligence - Recipe 1
Data Warehouse and Business Intelligence - Recipe 1Data Warehouse and Business Intelligence - Recipe 1
Data Warehouse and Business Intelligence - Recipe 1
 
Data Warehouse and Business Intelligence - Recipe 2
Data Warehouse and Business Intelligence - Recipe 2Data Warehouse and Business Intelligence - Recipe 2
Data Warehouse and Business Intelligence - Recipe 2
 
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case Study
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 

Similar a May 29, 2014 Toronto Hadoop User Group - Micro ETL

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)ruchabhandiwad
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 

Similar a May 29, 2014 Toronto Hadoop User Group - Micro ETL (20)

hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 

Más de Adam Muise

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadamAdam Muise
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_securityAdam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mdaAdam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - HadoopAdam Muise
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACAdam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013Adam Muise
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 

Más de Adam Muise (20)

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 

Último

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Último (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

May 29, 2014 Toronto Hadoop User Group - Micro ETL

  • 1. MICRO-­‐ETL:   AN  EMERGING  PATTERN  OF  USE  WITH  HADOOP   AND  NEAR-­‐REALTIME  FRAMEWORKS   Adam  Muise  –  Principle  Architect,  Hortonworks  
  • 2. Who  is                                        ?  
  • 3. We  do  Hadoop   The  leaders  of  Hadoop’s   development   Community  driven,     Enterprise  Focused   Drive  InnovaEon  in   the  plaForm  –  We   lead  the  roadmap     100%  Open  Source  –   DemocraEzed  Access  to   Data  
  • 5. I  made  it  up.  
  • 6. It  turns  out  several  other  people   have  made  it  up  before  so  I  don’t   feel  like  a  megalomaniac.   Another  terminology  calls  it  Near-­‐ RealEme  ETL.     hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf  
  • 7. Micro-­‐ETL/Near-­‐RealEme  ETL   involves:   1.  An  intra-­‐batch  and/or  near-­‐real3me   ingest  of  analy3cal  data     2.  Processing  on  a  small  batches  of  data   or  event  streams     3.  A  scalable  “ETL”  processing   framework    
  • 8. Why  Use  Micro-­‐ETL:   1.  Your  analy3cs  require  more  up  to  date   informa3on  than  provided  by  a  regular  batch   process   2.  There  is  opera3onal  risk  (ie  falling  behind  the  data   firehose,  data  loss,  data  fidelity)  in  leaving  the   processing  of  events  un3l  batch  can  process  them   3.  Your  data  ingest  rate  is  inconsistent  and  there  is   value  in  keeping  up-­‐to-­‐date  with  current  events.  
  • 9. When  not  to  use  Micro-­‐ETL:   1.  Your  data  ingest  rates  are  predictable  and   analyzing  them  intra-­‐batch  provides  liMle   value.   2.  Processing  in  a  large  batch  yields  a   complete  popula3on  of  the  data  required  to   make  a  decision.     3.  You  have  exis3ng  investments  in  tradi3onal   ETL  tools  that  outweigh  any  benefits  in  a  new   tool/framework  (as  tools  evolve,  this  will  be  a   moot  point  however)  
  • 10. Micro-­‐ETL  involves  regular   processing  tasks  only  run  on  data   frameworks  that  can  handle   near-­‐realEme  event  streams  
  • 11. Let’s  refresh  on  some  core   Hadoop  concepts…  
  • 13. YARN  =  Yet  Another  Resource   NegoEator  
  • 14. Resource  Manager   +   Node  Managers   =  YARN   Resource  Manager   AppMaster   Node  Manager   Scheduler   AppMaster   AppMaster   Node  Manager   Node  Manager   Node  Manager   Container   Container   MapReduce   Container   Storm   Container   Container   Container   Pig   Container   Container   Container  
  • 15. YARN  abstracts  resource   management  so  you  can  run  all  sorts   of  distributed  applicaEons   HDFS   MapReduce  V2   YARN   MapReduce  V?   STORM   MPI  Giraph   HBase  Tez   …  and   more  Spark  
  • 16. The  following  secEon  outlines   frameworks  that  are  emerging  as   data  processing  opEons  to  batch-­‐ driven  MapReduce  
  • 18. Three  important  facts  about  Tez:   1.  Tez  is  a  YARN  applicaEon.   2.  Tez  will  eventually  replace   MapRecue.   3.  Tez  scales  as  well  as  the  rest  of   Hadoop  scales  (thousands  of  nodes).  
  • 19. Tez  provides  a  layer  for  abstract   tasks,  these  could  be  mappers,   reducers,  customized  stream   processes,  in  memory  structures,   etc  
  • 20. Tez  chains  tasks  together  into  one  job  to   get  jobs  like  Map  –  Reduce  –  Reduce.     This  is  ideal  for  apps  like  Hive.   TezMap   TezMap   TezMap   TezMap   TezMap   TezReduce   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   TezReduce   TezReduce   TezReduce   TezReduce   TezReduce   Group  By   ProjecEons   Order  By  
  • 21. Tez  provides  the  opEon  for  more   complicated  workflows  
  • 22. Tez  allows  for  more  complicated   workflow  primiEves  than  just  Map   and  Reduce.  A  Tez  task  is  composed   of  a  programmable  Input,  Output,   and  Processor  
  • 23. YARN  can  provide  long-­‐running   containers*  for  applicaEons  like   Storm,  Hbase,  JBoss,  etc     *  -­‐  With  the  help  of  Apache  Slider:     hTp://wiki.apache.org/incubator/SliderProposal  
  • 24. Yahoo!,  the  original  author  of   Hadoop  and  the  largest  Hadoop   user,  is  bepng  on  Apache  Hive/Tez/ YARN  as  their  core  data   architecture.       hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-­‐ bepng-­‐on-­‐apache-­‐hive-­‐tez-­‐and-­‐yarn  
  • 26. Storm  is  a  distributed  execuEon   engine  that  handles  streaming  data  
  • 27. Storm  processes  streaming  event   data  as  tuples.  Each  event  is   generated/ingested  through  a  spout   and  processed  in  series  of  bolts.  The   spouts  and  bolts  form  a  topology.  
  • 29. Summingbird  is  a  processing   framework  that  runs  over  Storm.  It   uses  Scala  and  has  MapReduce-­‐like   features.  Technically  it’s  Scalding   (Cascading  with  Scala).  
  • 31. Spark  is  a  framework  designed  to   handle  in-­‐memory  compuEng.  If  you   are  using  Hadoop,  you  are  typically   running  Spark  on  YARN.  
  • 32. Spark  uses  RDDs  (Resilient   Distributed  Datasets)  as  a  primiEve   to  enable  in-­‐memory  processing.   RDDs  can  be  created  from  all  sorts   of  data,  like  HDFS  files:   scala>  val  distFile  =  sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")     distFile:  spark.RDD[String]  =  spark.HadoopRDD@1d4cee08  
  • 33. Spark  Streaming  constructs  DStream   primiEves  from  a  streaming  data   source.  DStreams  are  actually  made   up  of  many  RDDs  and  use  the   common  Spark  Engine.    
  • 34. Spark  Streaming  has  an  expressive   API  to  allow  typical  transformaEons   on  RDDs  or  DStreams.  This  is   comparable  to  MapReduce.  
  • 35. Since  Micro-­‐ETL  involves  porEng   tradiEonal  data  processing  tasks  to   different  execuEon  environments,  it   makes  sense  that  data  processing   tools  would  facilitate  execuEon  on   mulEple  plaForms.    
  • 36. Some  opEons  in  the  field…  
  • 37. You  can  write  generic  processing   libraries  (Java)  for  your  data  and   port  them  from  MapReduce/Tez  to   Storm.  Storm  can  also  run   Summingbird  (Scala).    
  • 38. Spark  already  provides  a   streaming  and  processing  layer   for  Micro-­‐ETL.  These  can  be   Scala,  Java,  or  Python.  
  • 39. Cascading  has  recently  announced   that  it  would  support  Tez  and  Storm   immediately  in  version  3.0.  They   also  have  plans  for  Spark   hTp://www.concurrenEnc.com/2014/05/cascading-­‐3-­‐0-­‐adds-­‐mulEple-­‐framework-­‐support-­‐concurrent-­‐driven-­‐manages-­‐big-­‐data-­‐apps/  
  • 40. Pentaho  KeTle  currently   supports  porEng  their  data   workflows  to  Storm  in  a  beta   version.   hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm  
  • 41. Talend  recently  asked  for  $40   million  in  funding  to  help  push   further  into  Big  Data.  That  includes   tooling  to  port  Talend  ETL  workflows   to  Storm  and  Tez.   hTp://techcrunch.com/2013/12/11/talend-­‐raises-­‐40m-­‐to-­‐more-­‐aggressively-­‐extend-­‐into-­‐big-­‐data-­‐market-­‐sets-­‐sights-­‐on-­‐ipo/