SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
HUG France - 22 Sept 2014 - @Criteo 
Data Management platform for Hadoop 
Cédric Carbone, Talend CTO 
@carbone 
Jean-Baptiste Onofré, Falcon Committer 
@jbonofre 
Ce support est mis à disposition selon les termes de la Licence Creative 
Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 
France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ 
© Talend 2014 1
Overview 
• Falcon is a Data Management solution for Hadoop 
• Falcon in production at InMobi since 2012 
• InMobi gave Falcon to ASF in April 2013 
• Falcon is in Apache incubation 
• Falcon embedded per default inside HDP 
• Falcon leverages a lot of Apache components 
- Oozie, Ambari, ActiveMQ, HCat, Sqoop… 
• Committer/PPMC/IPMC: 
- #8 InMobi 
- #5 Hortonworks 
- #1 Talend 
© Talend 2014 2
Why Falcon? 
© Talend 2014 3
Why Falcon? 
© Talend 2014 4
What is Falcon? 
• Data Motion Import, Export, CDC 
• Policy-based Lifecycle Management Retention, Replication, 
Archival, Anonymization of PII data 
• Process orchestration and scheduling Late data handling, 
reprocessing, dependency checking, etc. Multi-cluster management to 
support Local/Global Aggregations, Rollups, etc. 
• Data Governance Lineage, Audit, SLA 
© Talend 2014 5
Falcon - The Solution! 
• Introduces a higher layer of abstraction – Data Set Decouples a 
data location and its properties from workflows Understanding the life-time 
of a feed will allow for implicit validation of the processing rules 
• Provides the key services for data processing apps Common data 
services are simple directives, No need to define them verbosely in 
each job Allows process owners to keep their processing specific to 
their application logic Sits in the execution path, intercepts to handle 
OOB data / retries etc. 
• Promotes Polyglot Programming Does not do any heavy lifting but 
delegates to tools with in the Hadoop ecosystem 
© Talend 2014 6
Falcon Basic Concepts : Data Pipelines 
• Cluster: : Represents the Hadoop cluster 
• Feed: Defines a “dataset” 
• Process: Consumes feeds, invokes processing logic & produces feeds 
• Entity Actions: submit, list, dependency, schedule, suspend, resume, status, 
definition, delete, update 
© Talend 2014 7
Falcon Entity Relationships 
© Talend 2014 8
Cluster Entity 
<?xml version="1.0"?> 
<cluster colo=”talend-datacenter" description="" name=”prod-cluster"> 
<interfaces> 
Needed by distcp for replications 
Writing to HDFS 
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> 
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> 
<interface type="execute" endpoint=”rm:8050" version="2.2.0" /> 
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> 
<interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> 
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> 
</interfaces> 
<locations> 
Used to submit processes as MR 
Submit Oozie jobs 
<location name="staging" path="/apps/falcon/prod-cluster/staging" /> 
<location name="temp" path="/tmp" /> 
<location name="working" path="/apps/falcon/prod-cluster/working" /> 
</locations> 
</cluster> 
Hive metastore to register/ 
deregister partitions and 
get events on partition availability 
Used For alerts 
HDFS directories used by Falcon server 
© Talend 2014 9
Feed Entity 
Feed run frequency in mins/hrs/days/mths 
<?xml version="1.0"?> 
<feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> 
<frequency>hours(1)</frequency> 
<late-arrival cut-off="hours(6)”/> 
<groups>churnAnalysisFeeds</groups> 
<clusters> 
Late arrival cutoff 
<cluster name=”cluster-primary" type="source"> 
Feeds can belong to multiple groups 
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> 
<retention limit="days(2)" action="delete"/> 
One or more source & target clusters for retention & replication 
</cluster> 
<cluster name=”cluster-secondary" type="target"> 
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> 
<location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> 
<retention limit=”days(7)" action="delete"/> 
</cluster> 
</clusters> 
<locations> 
Global location across clusters - HDFS paths or Hive tables 
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> 
</locations> 
<ACL owner=”hdfs" group="users" permission="0755"/> 
<schema location="/none" provider="none"/> 
</feed> 
Access Permissions 
© Talend 2014 10
Process Entity 
<process name="process-test" xmlns="uri:falcon:process:0.1”> 
<clusters> 
<cluster name="cluster-primary"> 
Which cluster should the process run on and when 
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> 
</cluster> 
How frequently does the process run , how many instances can be run in parallel and in what order 
</clusters> 
<parallel>1</parallel> 
<order>FIFO</order> 
<frequency>days(1)</frequency> 
<inputs> 
<input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> 
</inputs> 
<outputs> 
Input & output feeds for process 
<output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> 
</outputs> 
<workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> 
<retry policy="periodic" delay="minutes(10)" attempts="3"/> 
<late-process policy="exp-backoff" delay="hours(1)"> 
<late-input input="input" workflow-path="/apps/clickstream/late" /> 
</late-process> 
</process> 
The processing logic. 
Retry policy on failure Handling late input feeds 
© Talend 2014 11
High Level Architecture 
© Talend 2014 12
Demo #1 : my first feed 
• Start Falcon on a single cluster 
• Submission of one simple feed 
• Check into Oozie the generated job 
© Talend 2014 13
Replication & Retention 
• Sophisticated retention policies expressed in one place 
• Simplify data retention for audit, compliance, or for data re-processing 
© Talend 2014 14
Demo #2: processes notification 
 Motivation: being able to be notified about processes activity “outside” 
of the cluster 
 Falcon manages workflow, send JMS notification, and a Camel route 
react to the notification 
 Result: trigger an action (Camel route) when a notification is sent by 
Falcon (eviction, late-arrival, process execution, …) 
 Mix BigData/Hadoop and ESB technologies 
© Talend 2014 15
Demo #2: workflow 
Falcon ActiveMQ 
Camel 
2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 
176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: 
java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, 
logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06- 
05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, 
workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, 
operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, 
brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}] 
© Talend 2014 16
Demo #2 : Data Notification 
• All notification will be in ActiveMQ 
• Subscribers : Camel routes 
• Add some files and see the notification system working! 
• http://blog.nanthrax.net/2014/03/hadoop-cdc-and-processes-notification-with- 
apache-falcon-apache-activemq-and-apache-camel/ 
© Talend 2014 17
Data Pipeline Tracing 
© Talend 2014 18
Topologies 
• STANDALONE 
– Single Data Center 
– Single Falcon Server 
– Hadoop jobs and relevant processing involves only one cluster 
• DISTRIBUTED 
– Multiple Data Centers 
– Falcon Server per DC 
– Multiple instances of hadoop clusters and workflow schedulers 
© Talend 2014 19
Multi cluster failover 
 Motivation: replicate subset of data from one cluster to another, 
guarantee the different eviction depending of the data subset 
 Falcon manages workflow on primary cluster, and replication on failover 
cluster 
 Result: support business continuity without requiring full data 
reprocessing 
© Talend 2014 20
Roadmap 
 Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC 
(ActiveMQ) 
 Feed implicit processes (no need of process) providing “native” CDC 
 Straight forward MR usage (without pig, oozie workflow XML, …) 
 More data acquisition 
 Monitoring/Management/Designer dashboard 
 TLP !!! 
© Talend 2014 21
Questions? 
Data Management platform for Hadoop 
Cédric Carbone, Talend CTO 
@carbone 
Jean-Baptiste Onofré, Falcon Committer 
@jbonofre 
Ce support est mis à disposition selon les termes de la Licence Creative 
Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 
France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ 
© Talend 2014 22

Más contenido relacionado

La actualidad más candente

Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentDataWorks Summit/Hadoop Summit
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 

La actualidad más candente (20)

Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 

Destacado

Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...MongoDB
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
 
Big Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesBig Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesMongoDB
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMIgor Veresov
 
Uccellini Uccellacci
Uccellini UccellacciUccellini Uccellacci
Uccellini Uccellaccifedericoneri
 
Das Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in KölnDas Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in KölnPeter Schreck
 
Liebe Dein Brot / Love your bread
Liebe Dein Brot / Love your breadLiebe Dein Brot / Love your bread
Liebe Dein Brot / Love your breadPeter Schreck
 
Speech videogiochi-pistoia
Speech videogiochi-pistoiaSpeech videogiochi-pistoia
Speech videogiochi-pistoiaLuca De Biase
 
Moviendo Las IDEAS
Moviendo Las IDEASMoviendo Las IDEAS
Moviendo Las IDEAScaaraya
 
Os Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado ModernoOs Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado ModernoAulagalicia Hxg
 
Il treno della vita
Il treno della vitaIl treno della vita
Il treno della vitafedericoneri
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyPerficient, Inc.
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
 
2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case Study2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case StudyPhilip Topham
 
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Amazon Web Services
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAmazon Web Services
 

Destacado (20)

Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
Big Data Paris: Etude de Cas: KPMG, l’innovation continue grâce au Data Lake ...
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Big Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use CasesBig Data Paris - Air France: Stratégie BigData et Use Cases
Big Data Paris - Air France: Stratégie BigData et Use Cases
 
Tiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVMTiered Compilation in Hotspot JVM
Tiered Compilation in Hotspot JVM
 
Uccellini Uccellacci
Uccellini UccellacciUccellini Uccellacci
Uccellini Uccellacci
 
Das Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in KölnDas Prinzip Coworking - Coworking in Köln
Das Prinzip Coworking - Coworking in Köln
 
Piringundines
PiringundinesPiringundines
Piringundines
 
Liebe Dein Brot / Love your bread
Liebe Dein Brot / Love your breadLiebe Dein Brot / Love your bread
Liebe Dein Brot / Love your bread
 
Speech videogiochi-pistoia
Speech videogiochi-pistoiaSpeech videogiochi-pistoia
Speech videogiochi-pistoia
 
Coworking Germany
Coworking Germany Coworking Germany
Coworking Germany
 
Moviendo Las IDEAS
Moviendo Las IDEASMoviendo Las IDEAS
Moviendo Las IDEAS
 
Os Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado ModernoOs Rrcc As Bases Do Estado Moderno
Os Rrcc As Bases Do Estado Moderno
 
ESKibana
ESKibanaESKibana
ESKibana
 
Il treno della vita
Il treno della vitaIl treno della vita
Il treno della vita
 
Big Data - uma visão executiva
Big Data - uma visão executivaBig Data - uma visão executiva
Big Data - uma visão executiva
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
 
2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case Study2015 Find You - Fuzzy Big Data - A Case Study
2015 Find You - Fuzzy Big Data - A Case Study
 
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 

Similar a Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...li David
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...Courtney Llamas
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Cloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platformCloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platformCodemotion
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchEMC
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
 
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)jeckels
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspectivePriit Piipuu
 
Building Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and KafkaBuilding Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and KafkaNenad Bogojevic
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 

Similar a Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo) (20)

Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level...
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
Zero to Manageability in 60 Minutes: Building a Solid Foundation for Oracle E...
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Cloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platformCloud Roundtable | Pivoltal: Agile platform
Cloud Roundtable | Pivoltal: Agile platform
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
Oracle Coherence Strategy and Roadmap (OpenWorld, September 2014)
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Database failover from client perspective
Database failover from client perspectiveDatabase failover from client perspective
Database failover from client perspective
 
Building Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and KafkaBuilding Data Streaming Platforms using OpenShift and Kafka
Building Data Streaming Platforms using OpenShift and Kafka
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 

Último

Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdfIntroduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdfShreedeep Rayamajhi
 
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024Jan Löffler
 
Computer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a WebsiteComputer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a WebsiteMavein
 
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDSTYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDSedrianrheine
 
A_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptx
A_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptxA_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptx
A_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptxjayshuklatrainer
 
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdfLESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdfmchristianalwyn
 
Zero-day Vulnerabilities
Zero-day VulnerabilitiesZero-day Vulnerabilities
Zero-day Vulnerabilitiesalihassaah1994
 
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...APNIC
 
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced HorizonsVision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced HorizonsRoxana Stingu
 
Niche Domination Prodigy Review Plus Bonus
Niche Domination Prodigy Review Plus BonusNiche Domination Prodigy Review Plus Bonus
Niche Domination Prodigy Review Plus BonusSkylark Nobin
 
Bio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptxBio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptxnaveenithkrishnan
 
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASSLESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASSlesteraporado16
 
world Tuberculosis day ppt 25-3-2024.pptx
world Tuberculosis day ppt 25-3-2024.pptxworld Tuberculosis day ppt 25-3-2024.pptx
world Tuberculosis day ppt 25-3-2024.pptxnaveenithkrishnan
 
Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024Shubham Pant
 
Presentation2.pptx - JoyPress Wordpress
Presentation2.pptx -  JoyPress WordpressPresentation2.pptx -  JoyPress Wordpress
Presentation2.pptx - JoyPress Wordpressssuser166378
 

Último (15)

Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdfIntroduction to ICANN and Fellowship program  by Shreedeep Rayamajhi.pdf
Introduction to ICANN and Fellowship program by Shreedeep Rayamajhi.pdf
 
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
WordPress by the numbers - Jan Loeffler, CTO WebPros, CloudFest 2024
 
Computer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a WebsiteComputer 10 Lesson 8: Building a Website
Computer 10 Lesson 8: Building a Website
 
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDSTYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
TYPES AND DEFINITION OF ONLINE CRIMES AND HAZARDS
 
A_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptx
A_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptxA_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptx
A_Z-1_0_4T_00A-EN_U-Po_w_erPoint_06.pptx
 
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdfLESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
LESSON 5 GROUP 10 ST. THOMAS AQUINAS.pdf
 
Zero-day Vulnerabilities
Zero-day VulnerabilitiesZero-day Vulnerabilities
Zero-day Vulnerabilities
 
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
Benefits of doing Internet peering and running an Internet Exchange (IX) pres...
 
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced HorizonsVision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
Vision Forward: Tracing Image Search SEO From Its Roots To AI-Enhanced Horizons
 
Niche Domination Prodigy Review Plus Bonus
Niche Domination Prodigy Review Plus BonusNiche Domination Prodigy Review Plus Bonus
Niche Domination Prodigy Review Plus Bonus
 
Bio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptxBio Medical Waste Management Guideliness 2023 ppt.pptx
Bio Medical Waste Management Guideliness 2023 ppt.pptx
 
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASSLESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
LESSON 10/ GROUP 10/ ST. THOMAS AQUINASS
 
world Tuberculosis day ppt 25-3-2024.pptx
world Tuberculosis day ppt 25-3-2024.pptxworld Tuberculosis day ppt 25-3-2024.pptx
world Tuberculosis day ppt 25-3-2024.pptx
 
Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024Check out the Free Landing Page Hosting in 2024
Check out the Free Landing Page Hosting in 2024
 
Presentation2.pptx - JoyPress Wordpress
Presentation2.pptx -  JoyPress WordpressPresentation2.pptx -  JoyPress Wordpress
Presentation2.pptx - JoyPress Wordpress
 

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

  • 1. HUG France - 22 Sept 2014 - @Criteo Data Management platform for Hadoop Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ © Talend 2014 1
  • 2. Overview • Falcon is a Data Management solution for Hadoop • Falcon in production at InMobi since 2012 • InMobi gave Falcon to ASF in April 2013 • Falcon is in Apache incubation • Falcon embedded per default inside HDP • Falcon leverages a lot of Apache components - Oozie, Ambari, ActiveMQ, HCat, Sqoop… • Committer/PPMC/IPMC: - #8 InMobi - #5 Hortonworks - #1 Talend © Talend 2014 2
  • 3. Why Falcon? © Talend 2014 3
  • 4. Why Falcon? © Talend 2014 4
  • 5. What is Falcon? • Data Motion Import, Export, CDC • Policy-based Lifecycle Management Retention, Replication, Archival, Anonymization of PII data • Process orchestration and scheduling Late data handling, reprocessing, dependency checking, etc. Multi-cluster management to support Local/Global Aggregations, Rollups, etc. • Data Governance Lineage, Audit, SLA © Talend 2014 5
  • 6. Falcon - The Solution! • Introduces a higher layer of abstraction – Data Set Decouples a data location and its properties from workflows Understanding the life-time of a feed will allow for implicit validation of the processing rules • Provides the key services for data processing apps Common data services are simple directives, No need to define them verbosely in each job Allows process owners to keep their processing specific to their application logic Sits in the execution path, intercepts to handle OOB data / retries etc. • Promotes Polyglot Programming Does not do any heavy lifting but delegates to tools with in the Hadoop ecosystem © Talend 2014 6
  • 7. Falcon Basic Concepts : Data Pipelines • Cluster: : Represents the Hadoop cluster • Feed: Defines a “dataset” • Process: Consumes feeds, invokes processing logic & produces feeds • Entity Actions: submit, list, dependency, schedule, suspend, resume, status, definition, delete, update © Talend 2014 7
  • 8. Falcon Entity Relationships © Talend 2014 8
  • 9. Cluster Entity <?xml version="1.0"?> <cluster colo=”talend-datacenter" description="" name=”prod-cluster"> <interfaces> Needed by distcp for replications Writing to HDFS <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> Used to submit processes as MR Submit Oozie jobs <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Hive metastore to register/ deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server © Talend 2014 9
  • 10. Feed Entity Feed run frequency in mins/hrs/days/mths <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <clusters> Late arrival cutoff <cluster name=”cluster-primary" type="source"> Feeds can belong to multiple groups <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> One or more source & target clusters for retention & replication </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> Global location across clusters - HDFS paths or Hive tables <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Access Permissions © Talend 2014 10
  • 11. Process Entity <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> Which cluster should the process run on and when <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> How frequently does the process run , how many instances can be run in parallel and in what order </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> Input & output feeds for process <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> The processing logic. Retry policy on failure Handling late input feeds © Talend 2014 11
  • 12. High Level Architecture © Talend 2014 12
  • 13. Demo #1 : my first feed • Start Falcon on a single cluster • Submission of one simple feed • Check into Oozie the generated job © Talend 2014 13
  • 14. Replication & Retention • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing © Talend 2014 14
  • 15. Demo #2: processes notification  Motivation: being able to be notified about processes activity “outside” of the cluster  Falcon manages workflow, send JMS notification, and a Camel route react to the notification  Result: trigger an action (Camel route) when a notification is sent by Falcon (eviction, late-arrival, process execution, …)  Mix BigData/Hadoop and ESB technologies © Talend 2014 15
  • 16. Demo #2: workflow Falcon ActiveMQ Camel 2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06- 05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}] © Talend 2014 16
  • 17. Demo #2 : Data Notification • All notification will be in ActiveMQ • Subscribers : Camel routes • Add some files and see the notification system working! • http://blog.nanthrax.net/2014/03/hadoop-cdc-and-processes-notification-with- apache-falcon-apache-activemq-and-apache-camel/ © Talend 2014 17
  • 18. Data Pipeline Tracing © Talend 2014 18
  • 19. Topologies • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRIBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers © Talend 2014 19
  • 20. Multi cluster failover  Motivation: replicate subset of data from one cluster to another, guarantee the different eviction depending of the data subset  Falcon manages workflow on primary cluster, and replication on failover cluster  Result: support business continuity without requiring full data reprocessing © Talend 2014 20
  • 21. Roadmap  Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC (ActiveMQ)  Feed implicit processes (no need of process) providing “native” CDC  Straight forward MR usage (without pig, oozie workflow XML, …)  More data acquisition  Monitoring/Management/Designer dashboard  TLP !!! © Talend 2014 21
  • 22. Questions? Data Management platform for Hadoop Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/ © Talend 2014 22