SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Hadoop First ETL On
Apache Falcon
Srikanth Sundarrajan
Naresh Agarwal
About Authors
!  Srikanth Sundarrajan
!  Principal Architect, InMobi Technology Services
!  Naresh Agarwal
!  Director – Engineering, InMobi Technology Services
Agenda
!  ETL & Challenges with Big Data
!  Apache Falcon – Background
!  Pipeline Designer – Overview
!  Pipeline Designer – Internals
Agenda
!  ETL & Challenges with Big Data
!  Apache Falcon – Background
!  Pipeline Designer – Overview
!  Pipeline Designer – Internals
ETL (Extract Transform Load)
Intelligence
Information
Data
Value
ETL Use cases
Data
Warehouse
Data
Migration
Data
Consolidation
Master Data
Management
Data
Synchronization
Data Archiving
ETL Authoring
Hand
coded
In-house
tools
Off-
shelf
tools
ETL & Big Data – Challenges
Challenges
Volume
VarietyVelocity
Big Data ETL
!  Mostly Hand coded (High Cost – Implementation +
Maintenance)
!  Map Reduce
!  Hive (i.e. SQL)
!  Pig
!  Crunch / Cascading
!  Spark
!  Off-shelf tools (Scale/Performance)
!  Mostly Retrofitted
Agenda
!  ETL & Challenges with Big Data
!  Apache Falcon – Background
!  Pipeline Designer – Overview
!  Pipeline Designer – Internals
Apache Falcon
!  Off the shelf, Falcon provides standard data
management functions through declarative
constructs
!  Data movement recipes
!  Cross data center replication
!  Cross cluster data synchronization
!  Data retention recipes
!  Eviction
!  Archival
Apache Falcon
!  However ETL related functions are still largely left
to the developer to implement. Falcon today
manages only
!  Orchestration
!  Late data handling / Change data capture
!  Retries
!  Monitoring
Agenda
!  ETL & Challenges with Big Data
!  Apache Falcon – Background
!  Pipeline Designer – Overview
!  Pipeline Designer – Internals
Pipeline Designer – Basics
Pipeline Designer – Basics
!  Feed
!  Is a data entity that Falcon manages and is physically
present in a cluster.
!  Data present in this feed conforms to a schema and
partitions of the same are registered with Hcatalog
!  Data Management functions such as eviction, archival
etc are declaratively specified through Falcon Feed
definitions
Pipeline Designer – Basics
Pipeline Designer – Basics
!  Process
!  Workflow that defines various actions that needs to be
performed along with control flow
!  Executes at a specified frequency on one or more
clusters
!  Pipelines
!  Logical grouping of Falcon processes owned and
operated together
Pipeline Designer – Basics
Pipeline Designer – Basics
!  Actions
!  Actions in designer are the building blocks for the
process workflows.
!  Actions have access to output variables earlier in the
flow and can emit output variables
!  Actions can transition to other actions
!  Default / Success Transition
!  Failure Transition
!  Conditional Transition
!  Transformation action is a special action that further
is a collection of transforms
Pipeline Designer – Basics
Pipeline Designer – Basics
!  Transforms
!  Is a data manipulation function that accepts one or
more inputs with well defined schema and produces
ore or more outputs
!  Multiple transform elements can be stitched together
to compose a single transformation action which can
further be used to build a flow
!  Composite Transformations
!  Transforms that are built through a combination of
multiple primitive transforms
!  Possible to add more transforms and extend the
system
Pipeline Designer – Basics
!  Deployment & Monitoring
!  Once a process and the pipeline is composed, the
same is deployed in Falcon as a standard process
Agenda
!  ETL & Challenges with Big Data
!  Apache Falcon – Background
!  Pipeline Designer – Overview
!  Pipeline Designer – Internals
Pipeline Designer Service
Pipeline Designer
Pipeline
Designer
Service
REST API
Versioned
Storage
Flow /
Action /
Transforms
Compiler +
Optimizer
Falcon
Server
Hcatalog
Service
DesignerUI
FalconDashboard
Process
Feed
Schema
Pipeline Designer – Internals
!  Transformation actions are compiled into PIG
scripts
!  Actions and Flows are compiled into Falcon Process
definitions
Text
Q & A
Thanks
mailto:sriksun@apache.org
mailto:naresh.agarwal@inmobi.com

Más contenido relacionado

La actualidad más candente

(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections
BIOVIA
 
Creating custom reports ora app express apex listener
Creating custom reports  ora app express apex listenerCreating custom reports  ora app express apex listener
Creating custom reports ora app express apex listener
Darnette A
 
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon
 

La actualidad más candente (19)

(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Creating custom reports ora app express apex listener
Creating custom reports  ora app express apex listenerCreating custom reports  ora app express apex listener
Creating custom reports ora app express apex listener
 
Express js api-versioning
Express js api-versioningExpress js api-versioning
Express js api-versioning
 
Oracle APEX Interactive Grid Essentials
Oracle APEX Interactive Grid EssentialsOracle APEX Interactive Grid Essentials
Oracle APEX Interactive Grid Essentials
 
APEX Interactive Grid API Essentials: The Stuff You Will Really Use
APEX Interactive Grid API Essentials:  The Stuff You Will Really UseAPEX Interactive Grid API Essentials:  The Stuff You Will Really Use
APEX Interactive Grid API Essentials: The Stuff You Will Really Use
 
EAD3 Progress Report 2014-08-13
EAD3 Progress Report 2014-08-13EAD3 Progress Report 2014-08-13
EAD3 Progress Report 2014-08-13
 
APEX 5 Interactive Reports: Deep Dive and Upgrade Advice
APEX 5 Interactive Reports: Deep Dive and Upgrade AdviceAPEX 5 Interactive Reports: Deep Dive and Upgrade Advice
APEX 5 Interactive Reports: Deep Dive and Upgrade Advice
 
IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning
 
Building ISV Applications that run in the cloud with SQL Anywhere On-Demand E...
Building ISV Applications that run in the cloud with SQL Anywhere On-Demand E...Building ISV Applications that run in the cloud with SQL Anywhere On-Demand E...
Building ISV Applications that run in the cloud with SQL Anywhere On-Demand E...
 
Oracle Forms to Apex - OGh - 29 September 2009 - Part 1
Oracle Forms to Apex - OGh - 29 September 2009 - Part 1Oracle Forms to Apex - OGh - 29 September 2009 - Part 1
Oracle Forms to Apex - OGh - 29 September 2009 - Part 1
 
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ SalesforceHBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
HBaseCon2017 HBase/Phoenix @ Scale @ Salesforce
 
Oracle Low Code Lowdown: APEX vs VBCS
Oracle Low Code Lowdown: APEX vs VBCSOracle Low Code Lowdown: APEX vs VBCS
Oracle Low Code Lowdown: APEX vs VBCS
 
Boston APEX Meetup ~ Standardize Your Grids
Boston APEX Meetup ~ Standardize Your GridsBoston APEX Meetup ~ Standardize Your Grids
Boston APEX Meetup ~ Standardize Your Grids
 
APEX 5 IR Guts and Performance
APEX 5 IR Guts and PerformanceAPEX 5 IR Guts and Performance
APEX 5 IR Guts and Performance
 
UTOUG Training Days 2019 APEX Interactive Grids: API Essentials, the Stuff Yo...
UTOUG Training Days 2019 APEX Interactive Grids: API Essentials, the Stuff Yo...UTOUG Training Days 2019 APEX Interactive Grids: API Essentials, the Stuff Yo...
UTOUG Training Days 2019 APEX Interactive Grids: API Essentials, the Stuff Yo...
 
Turning the Heat up on DevOps: Providing a web-based editing experience aroun...
Turning the Heat up on DevOps: Providing a web-based editing experience aroun...Turning the Heat up on DevOps: Providing a web-based editing experience aroun...
Turning the Heat up on DevOps: Providing a web-based editing experience aroun...
 
Components of openEHR based EHRs
Components of openEHR based EHRsComponents of openEHR based EHRs
Components of openEHR based EHRs
 
Validate Your Validations: Both Sides Now
Validate Your Validations: Both Sides NowValidate Your Validations: Both Sides Now
Validate Your Validations: Both Sides Now
 

Destacado

Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
DataWorks Summit
 
모바일 광고와 분석을 위한 기술
모바일 광고와 분석을 위한 기술모바일 광고와 분석을 위한 기술
모바일 광고와 분석을 위한 기술
Minwoo Park
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 

Destacado (8)

Apache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev TripurariApache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev Tripurari
 
Apache Falcon at Hadoop Summit 2013
Apache Falcon at Hadoop Summit 2013Apache Falcon at Hadoop Summit 2013
Apache Falcon at Hadoop Summit 2013
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
 
Apache Falcon - Data Management Platform For Hadoop
Apache Falcon - Data Management Platform For HadoopApache Falcon - Data Management Platform For Hadoop
Apache Falcon - Data Management Platform For Hadoop
 
Falcon - Data Management Platform on Hadoop (Beyond ETL)
Falcon - Data Management Platform on Hadoop (Beyond ETL)Falcon - Data Management Platform on Hadoop (Beyond ETL)
Falcon - Data Management Platform on Hadoop (Beyond ETL)
 
모바일 광고와 분석을 위한 기술
모바일 광고와 분석을 위한 기술모바일 광고와 분석을 위한 기술
모바일 광고와 분석을 위한 기술
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 

Similar a Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
Maryann Xue
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
ssuser8ccb5a
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
camyla81
 

Similar a Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer (20)

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
 
Report From Oracle Open World 2008 AMIS 2 October2008
Report From Oracle Open World 2008 AMIS 2 October2008Report From Oracle Open World 2008 AMIS 2 October2008
Report From Oracle Open World 2008 AMIS 2 October2008
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021oracle_soultion_oracledataintegrator_goldengate_2021
oracle_soultion_oracledataintegrator_goldengate_2021
 
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptxCERN_DIS_ODI_OGG_final_oracle_golde.pptx
CERN_DIS_ODI_OGG_final_oracle_golde.pptx
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
PSTL Spark Summit West 2017
PSTL Spark Summit West 2017PSTL Spark Summit West 2017
PSTL Spark Summit West 2017
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Griffith Bi Migration & Source Control
Griffith Bi Migration & Source ControlGriffith Bi Migration & Source Control
Griffith Bi Migration & Source Control
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
2007 SAPTech Ed
2007 SAPTech Ed2007 SAPTech Ed
2007 SAPTech Ed
 
SITIST 2015 Dev - Abap on Hana
SITIST 2015 Dev - Abap on HanaSITIST 2015 Dev - Abap on Hana
SITIST 2015 Dev - Abap on Hana
 
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLSteps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Hadoop-Summit-2014-Apache-Falcon-Hadoop-First-ETL-Pipeline-Designer

  • 1. Hadoop First ETL On Apache Falcon Srikanth Sundarrajan Naresh Agarwal
  • 2. About Authors !  Srikanth Sundarrajan !  Principal Architect, InMobi Technology Services !  Naresh Agarwal !  Director – Engineering, InMobi Technology Services
  • 3. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 4. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 5. ETL (Extract Transform Load) Intelligence Information Data Value
  • 6. ETL Use cases Data Warehouse Data Migration Data Consolidation Master Data Management Data Synchronization Data Archiving
  • 8. ETL & Big Data – Challenges Challenges Volume VarietyVelocity
  • 9. Big Data ETL !  Mostly Hand coded (High Cost – Implementation + Maintenance) !  Map Reduce !  Hive (i.e. SQL) !  Pig !  Crunch / Cascading !  Spark !  Off-shelf tools (Scale/Performance) !  Mostly Retrofitted
  • 10. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 11. Apache Falcon !  Off the shelf, Falcon provides standard data management functions through declarative constructs !  Data movement recipes !  Cross data center replication !  Cross cluster data synchronization !  Data retention recipes !  Eviction !  Archival
  • 12. Apache Falcon !  However ETL related functions are still largely left to the developer to implement. Falcon today manages only !  Orchestration !  Late data handling / Change data capture !  Retries !  Monitoring
  • 13. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 15. Pipeline Designer – Basics !  Feed !  Is a data entity that Falcon manages and is physically present in a cluster. !  Data present in this feed conforms to a schema and partitions of the same are registered with Hcatalog !  Data Management functions such as eviction, archival etc are declaratively specified through Falcon Feed definitions
  • 17. Pipeline Designer – Basics !  Process !  Workflow that defines various actions that needs to be performed along with control flow !  Executes at a specified frequency on one or more clusters !  Pipelines !  Logical grouping of Falcon processes owned and operated together
  • 19. Pipeline Designer – Basics !  Actions !  Actions in designer are the building blocks for the process workflows. !  Actions have access to output variables earlier in the flow and can emit output variables !  Actions can transition to other actions !  Default / Success Transition !  Failure Transition !  Conditional Transition !  Transformation action is a special action that further is a collection of transforms
  • 21. Pipeline Designer – Basics !  Transforms !  Is a data manipulation function that accepts one or more inputs with well defined schema and produces ore or more outputs !  Multiple transform elements can be stitched together to compose a single transformation action which can further be used to build a flow !  Composite Transformations !  Transforms that are built through a combination of multiple primitive transforms !  Possible to add more transforms and extend the system
  • 22. Pipeline Designer – Basics !  Deployment & Monitoring !  Once a process and the pipeline is composed, the same is deployed in Falcon as a standard process
  • 23. Agenda !  ETL & Challenges with Big Data !  Apache Falcon – Background !  Pipeline Designer – Overview !  Pipeline Designer – Internals
  • 24. Pipeline Designer Service Pipeline Designer Pipeline Designer Service REST API Versioned Storage Flow / Action / Transforms Compiler + Optimizer Falcon Server Hcatalog Service DesignerUI FalconDashboard Process Feed Schema
  • 25. Pipeline Designer – Internals !  Transformation actions are compiled into PIG scripts !  Actions and Flows are compiled into Falcon Process definitions
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. Text
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39. Q & A