SlideShare a Scribd company logo
1 of 28
Running
Apache Airflow
Workflows as ETL
Processes on Hadoop
By: Robert Sanders
2Page:
Agenda
• What is Apache Airflow?
• Features
• Architecture
• Terminology
• Operators
• ETL Best Practices
• How they’re supported in Apache Airflow
• Executing Airflow Workflows on Hadoop
• Examples
• Kerberized Cluster
• Use Cases
• Q&A
3Page:
Robert Sanders
• Big Data Manager, Engineer, Architect, etc.
• Work for Clairvoyant LLC
• 5+ Years of Big Data Experience
• Email: robert.sanders@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/robert-sanders-
61446732
• Slide Share: http://www.slideshare.net/RobertSanders49
4Page:
Clairvoyant
5Page:
Clairvoyant
Services
6Page:
What’s the problem?
• As a Big Data Engineer you work to create jobs that will
perform various operations
• Ingest data from external data sources
• Transformation of Data
• Run Predictions
• Export data
• etc.
• You need to have some mechanism to schedule and run
these jobs
• Cron
• Oozie
• Existing Scheduling Services have a number of limitations
that make them difficult to work with and not usable in all
instances
7Page:
What is Apache Airflow?
• Airflow is an Open Source platform to programmatically
author, schedule and monitor workflows
• Workflows as Code
• Schedules Jobs through Cron Expressions
• Provides monitoring tools like alerts and a web interface
• Written in Python
• As well as user defined Workflows and Plugins
• Was started in the fall of 2014 by Maxime Beauchemin at
Airbnb
• Apache Incubator Project
• Joined Apache Foundation in early 2016
• https://github.com/apache/incubator-airflow/
• Latest Version of Airflow: v1.8.0
8Page:
Why use Apache Airflow?
• Lightweight Workflow Platform
• Define Workflows as Code
• Makes workflows more maintainable, versionable, and
testable
• More flexible execution and workflow generation
• Lots of Features
• Automatic Retries
• SLA monitoring/alerting
• Complex dependency rules: branching, joining, sub-
workflows
• Plugins
• Built-in integration with other services
• Many more…
• Feature Rich Web Interface
• Worker Processes can Scale Horizontally and Vertically
• Can be a cluster or single node setup
9Page:
10Page:
11Page:
What is a DAG?
• Directed Acyclic Graph
• A finite directed graph that doesn’t have any cycles
• A collection of tasks to run, organized in a way that reflects
their relationships and dependencies
• Defines your Workflow
12Page:
What is an Operator?
• An operator describes a single task in a workflow
• Operators allow for generation of certain types of tasks that
become nodes in the DAG when instantiated
• All operators are derived from airflow.models.BaseOperator
and inherit all its attributes and methods
13Page:
Workflow Operators (Sensors)
• A type of operator that keeps running until a certain
condition is met or it times out
• Parameterized poke interval and timeout
• Example
• HdfsSensor
• HivePartitionSensor
• NamedHivePartitionSensor
• S3KeyPartition
• WebHdfsSensor
• Many More…
14Page:
Workflow Operators (Transfer)
• Operator that moves data from one system to another
• Data will be pulled from the source system, staged on the
machine where the executor is running and then transferred
to the target system
• Example:
• HiveToMySqlTransfer
• MySqlToHiveTransfer
• S3ToHiveTransfer
• Many More…
• WARNING: Avoid using these if you’re dealing with large
volumes of data
15Page:
Defining a DAG
# Library Imports
from airflow.models import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
# Define global variables and default arguments
START_DATE = datetime.now() - timedelta(minutes=1)
default_args = dict(
'owner'='Airflow’,
'retries': 1,
'retry_delay': timedelta(minutes=5),
)
# Define the DAG
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *’, start_date=START_DATE)
# Define the Tasks
task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag)
task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag)
task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag)
# Define the Task Relationships
task1.set_downstream(task2)
task2.set_downstream(task3)
task1 task2 task3
16Page:
Defining a DAG (Dynamically)
dag = DAG('dag_id', …)
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
task1 task2 task3
17Page:
ETL Best Practices (Some of Them)
• Load Data Incrementally
• Operators will receive an execution_date entry in the context
which you can use to pull in data since that date till now
• Process Historic Data
• Backfill operations are supported
• Enforce Idempotency (retry safe)
• Execute Conditionally
• Branching, Joining
• Understand SLA’s and Alerts
• Alert if there are failures (task failures and SLA misses)
• Sense when to Start a Task
• Sensor Operators
• Build Validation into your Workflows
18Page:
Executing Airflow Workflows on Hadoop
• Airflow Workers should be installed on edge/gateway nodes
• Allows Airflow to interact with Hadoop related commands
• Utilize the airflow.operator.BashOperator to run
command line functions and interact with Hadoop
services
• Put all necessary scripts and Jars in HDFS and pull the files
down from HDFS during the execution of the script
• Avoids requiring you to keep copies of the scripts on
every machine where the executors are running
19Page:
Executing Airflow Workflows on Hadoop – Example 1
# sqoop delta import
sqoop_extract_delta = BashOperator(
task_id=’sqoop_extract_delta’,
bash_command=“sqoop job –exec <JOB_ID>”,
dag=dag)
# sqoop full table refresh
sqoop_extract_full_refresh = BashOperator(
task_id=’sqoop_extract_full_refresh’,
bash_command=“””
sqoop import 
--driver com.mysql.jdbc.Driver 
--connect jdbc:mysql://<DB_HOST>/<DB_SCHEMA> 
--username <USERNAME> 
--password-file <PATH_TO_PWD_FILE> 
--table <TABLE_TO_IMPORT> 
--hive-import 
--hive-overwrite 
--hive-database <TARGET_DB> 
--hive-table <TARGET_TABLE>
“””,
dag=dag)
20Page:
Executing Airflow Workflows on Hadoop – Example 2
# hive transform with file
hive_transform_file = BashOperator(
task_id=’hive_transform_file’,
bash_command=“””
hadoop fs -get hdfs:///path/to/hive.hql .
if [ -f "hive.hql" ]
then
beeline -u jdbc:hive2://<HOST>:10000/default -n <USERNAME> -p <PASSWORD> -f hive.hql
exit ${?}
else
echo “hive.hql not found.”
exit 1
fi
“””,
dag=dag)
# hive transform with file
hive_transform_exec = BashOperator(
task_id=’hive_transform_exec’,
bash_command=“beeline -u jdbc:hive2://<HOST>:10000/default -n <USERNAME> -p <PASSWORD> -e ‘INSERT INTO
TABLE <TARGET_TABLE> AS SELECT * FROM <SOURCE_TABLE>’”,
dag=dag)
21Page:
Running on a Kerberized Cluster
• Airflow provides another processes (apart from the
webserver, worker and scheduler) which can renew Kerberos
tickets for the user it is running as and store it in the ticket
cache.
• The hooks and DAGs can make use of ticket to authenticate
against Kerberized services.
• Update airflow.cfg:
[core]
security = kerberos
[kerberos]
keytab = /etc/airflow/airflow.keytab
reinit_frequency = 3600
principal = airflow
22Page:
Use Case
• Daily ETL Batch Process to Ingest data into Hadoop
• Extract
• 23 databases total
• 1226 tables total
• Transform
• Impala scripts to join and transform data
• Load
• Impala scripts to load data into common final tables
• Other requirements
• Make it extensible to allow the client to import more databases and
tables in the future
• Status emails to be sent out after daily job to report on success and
failures
• Solution
• Create a DAG that dynamically generates the workflow based off data
in a Metastore
23Page:
Use Case (Architecture)
24Page:
Use Case (DAG)
100 foot view 10,000 foot view
25Page:
Use Case (Kogni)
• New Product being built by Clairvoyant to facilitate:
• kogni-inspector – Sensitive Data Analyzer
• kogni-ingestor – Ingests Data
• kogni-guardian – Sensitive Data Masking (Encrypt and
Tokenize)
• Others components coming soon
• Utilizes Airflow for Data Ingestion and Masking
• Dynamically creates a workflow based off what is in the
Metastore
• Learn More: http://kogni.io/
26Page:
Use Case (Kogni) (Architecture)
27Page:
References
• https://pythonhosted.org/airflow/
• https://gtoonstra.github.io/etl-with-airflow/principles.html
• https://github.com/apache/incubator-airflow
• https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
Q&A

More Related Content

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Featured

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

Running Apache Airflow Workflows as ETL Processes on Hadoop

  • 1. Running Apache Airflow Workflows as ETL Processes on Hadoop By: Robert Sanders
  • 2. 2Page: Agenda • What is Apache Airflow? • Features • Architecture • Terminology • Operators • ETL Best Practices • How they’re supported in Apache Airflow • Executing Airflow Workflows on Hadoop • Examples • Kerberized Cluster • Use Cases • Q&A
  • 3. 3Page: Robert Sanders • Big Data Manager, Engineer, Architect, etc. • Work for Clairvoyant LLC • 5+ Years of Big Data Experience • Email: robert.sanders@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/robert-sanders- 61446732 • Slide Share: http://www.slideshare.net/RobertSanders49
  • 6. 6Page: What’s the problem? • As a Big Data Engineer you work to create jobs that will perform various operations • Ingest data from external data sources • Transformation of Data • Run Predictions • Export data • etc. • You need to have some mechanism to schedule and run these jobs • Cron • Oozie • Existing Scheduling Services have a number of limitations that make them difficult to work with and not usable in all instances
  • 7. 7Page: What is Apache Airflow? • Airflow is an Open Source platform to programmatically author, schedule and monitor workflows • Workflows as Code • Schedules Jobs through Cron Expressions • Provides monitoring tools like alerts and a web interface • Written in Python • As well as user defined Workflows and Plugins • Was started in the fall of 2014 by Maxime Beauchemin at Airbnb • Apache Incubator Project • Joined Apache Foundation in early 2016 • https://github.com/apache/incubator-airflow/ • Latest Version of Airflow: v1.8.0
  • 8. 8Page: Why use Apache Airflow? • Lightweight Workflow Platform • Define Workflows as Code • Makes workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of Features • Automatic Retries • SLA monitoring/alerting • Complex dependency rules: branching, joining, sub- workflows • Plugins • Built-in integration with other services • Many more… • Feature Rich Web Interface • Worker Processes can Scale Horizontally and Vertically • Can be a cluster or single node setup
  • 11. 11Page: What is a DAG? • Directed Acyclic Graph • A finite directed graph that doesn’t have any cycles • A collection of tasks to run, organized in a way that reflects their relationships and dependencies • Defines your Workflow
  • 12. 12Page: What is an Operator? • An operator describes a single task in a workflow • Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated • All operators are derived from airflow.models.BaseOperator and inherit all its attributes and methods
  • 13. 13Page: Workflow Operators (Sensors) • A type of operator that keeps running until a certain condition is met or it times out • Parameterized poke interval and timeout • Example • HdfsSensor • HivePartitionSensor • NamedHivePartitionSensor • S3KeyPartition • WebHdfsSensor • Many More…
  • 14. 14Page: Workflow Operators (Transfer) • Operator that moves data from one system to another • Data will be pulled from the source system, staged on the machine where the executor is running and then transferred to the target system • Example: • HiveToMySqlTransfer • MySqlToHiveTransfer • S3ToHiveTransfer • Many More… • WARNING: Avoid using these if you’re dealing with large volumes of data
  • 15. 15Page: Defining a DAG # Library Imports from airflow.models import DAG from airflow.operators import BashOperator from datetime import datetime, timedelta # Define global variables and default arguments START_DATE = datetime.now() - timedelta(minutes=1) default_args = dict( 'owner'='Airflow’, 'retries': 1, 'retry_delay': timedelta(minutes=5), ) # Define the DAG dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *’, start_date=START_DATE) # Define the Tasks task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag) task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag) task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag) # Define the Task Relationships task1.set_downstream(task2) task2.set_downstream(task3) task1 task2 task3
  • 16. 16Page: Defining a DAG (Dynamically) dag = DAG('dag_id', …) last_task = None for i in range(1, 3): task = BashOperator( task_id='task' + str(i), bash_command="echo 'Task" + str(i) + "'", dag=dag) if last_task is None: last_task = task else: last_task.set_downstream(task) last_task = task task1 task2 task3
  • 17. 17Page: ETL Best Practices (Some of Them) • Load Data Incrementally • Operators will receive an execution_date entry in the context which you can use to pull in data since that date till now • Process Historic Data • Backfill operations are supported • Enforce Idempotency (retry safe) • Execute Conditionally • Branching, Joining • Understand SLA’s and Alerts • Alert if there are failures (task failures and SLA misses) • Sense when to Start a Task • Sensor Operators • Build Validation into your Workflows
  • 18. 18Page: Executing Airflow Workflows on Hadoop • Airflow Workers should be installed on edge/gateway nodes • Allows Airflow to interact with Hadoop related commands • Utilize the airflow.operator.BashOperator to run command line functions and interact with Hadoop services • Put all necessary scripts and Jars in HDFS and pull the files down from HDFS during the execution of the script • Avoids requiring you to keep copies of the scripts on every machine where the executors are running
  • 19. 19Page: Executing Airflow Workflows on Hadoop – Example 1 # sqoop delta import sqoop_extract_delta = BashOperator( task_id=’sqoop_extract_delta’, bash_command=“sqoop job –exec <JOB_ID>”, dag=dag) # sqoop full table refresh sqoop_extract_full_refresh = BashOperator( task_id=’sqoop_extract_full_refresh’, bash_command=“”” sqoop import --driver com.mysql.jdbc.Driver --connect jdbc:mysql://<DB_HOST>/<DB_SCHEMA> --username <USERNAME> --password-file <PATH_TO_PWD_FILE> --table <TABLE_TO_IMPORT> --hive-import --hive-overwrite --hive-database <TARGET_DB> --hive-table <TARGET_TABLE> “””, dag=dag)
  • 20. 20Page: Executing Airflow Workflows on Hadoop – Example 2 # hive transform with file hive_transform_file = BashOperator( task_id=’hive_transform_file’, bash_command=“”” hadoop fs -get hdfs:///path/to/hive.hql . if [ -f "hive.hql" ] then beeline -u jdbc:hive2://<HOST>:10000/default -n <USERNAME> -p <PASSWORD> -f hive.hql exit ${?} else echo “hive.hql not found.” exit 1 fi “””, dag=dag) # hive transform with file hive_transform_exec = BashOperator( task_id=’hive_transform_exec’, bash_command=“beeline -u jdbc:hive2://<HOST>:10000/default -n <USERNAME> -p <PASSWORD> -e ‘INSERT INTO TABLE <TARGET_TABLE> AS SELECT * FROM <SOURCE_TABLE>’”, dag=dag)
  • 21. 21Page: Running on a Kerberized Cluster • Airflow provides another processes (apart from the webserver, worker and scheduler) which can renew Kerberos tickets for the user it is running as and store it in the ticket cache. • The hooks and DAGs can make use of ticket to authenticate against Kerberized services. • Update airflow.cfg: [core] security = kerberos [kerberos] keytab = /etc/airflow/airflow.keytab reinit_frequency = 3600 principal = airflow
  • 22. 22Page: Use Case • Daily ETL Batch Process to Ingest data into Hadoop • Extract • 23 databases total • 1226 tables total • Transform • Impala scripts to join and transform data • Load • Impala scripts to load data into common final tables • Other requirements • Make it extensible to allow the client to import more databases and tables in the future • Status emails to be sent out after daily job to report on success and failures • Solution • Create a DAG that dynamically generates the workflow based off data in a Metastore
  • 24. 24Page: Use Case (DAG) 100 foot view 10,000 foot view
  • 25. 25Page: Use Case (Kogni) • New Product being built by Clairvoyant to facilitate: • kogni-inspector – Sensitive Data Analyzer • kogni-ingestor – Ingests Data • kogni-guardian – Sensitive Data Masking (Encrypt and Tokenize) • Others components coming soon • Utilizes Airflow for Data Ingestion and Masking • Dynamically creates a workflow based off what is in the Metastore • Learn More: http://kogni.io/
  • 26. 26Page: Use Case (Kogni) (Architecture)
  • 27. 27Page: References • https://pythonhosted.org/airflow/ • https://gtoonstra.github.io/etl-with-airflow/principles.html • https://github.com/apache/incubator-airflow • https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
  • 28. Q&A