Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Airflow - Insane power in a Tiny Box

473 visualizaciones

Publicado el

A walk through of what Airflow is and isn't. Also how to use airflow to construct dynamic tasks and automate your entire ETL process. Presentation can be seen here:

Publicado en: Software
  • ⇒⇒⇒ ⇐⇐⇐ has really great writers to help you get the grades you need, they are fast and do great research. Support will always contact you if there is any confusion with the requirements of your paper so they can make sure you are getting exactly what you need.
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Airflow - Insane power in a Tiny Box

  1. 1. Airflow Insane power in a tiny box
  2. 2. A BRIEF HISTORY OF DATA PIPELINES F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
  3. 3. The dev’s answer to EVERYTHING C r o n / c r o n t a b This works great for some use cases, but lacks in many other ways. Works great, provided the computer is on. Will manage at the time you set every time it can. No recovery, logs self managed, not sure when it runs. Can only execute on one computer.
  4. 4. It keeps tasks alive. S u p e r v i s o r / S u p e r v i s o r d Fantastic utility, works as expected and optionally embedded UI and CLI util. Keeps everything up and let’s you see what’s going on. Even rotates logs and allows groups. Still executes on the one computer. Isn’t more than it advertises to be. Limited scope.
  5. 5. Some one said… we can do better.
  6. 6. Airflow is a “workflow management system” created by “Today, we are proud to announce that we are open sourcing and sharing Airflow, our workflow management platform.” June 2, 2016 airflow-a-workflow-management- platform-46318b977fd8 And it’s all written in Python!
  7. 7. What IS Airflow? B U T R E A L LY … Dependency Control Task Management Task Recovery Charting Logging Alerting History Folder Watching Trending Dynamic Tasks ANYTHING your pipeline may need…
  8. 8. Airflow is NOT… …perfect So contribute, and help it get better!
  9. 9. Webserver / UI The Airflow Architecture Scheduler Worker
  10. 10. WITH VERY LITTLE WORK… A i r f l o w c a n b e r u n l o c a l l y O r b e r u n i n m u c h m o r e c o m p l e x c o n f i g u r a t i o n s .
  11. 11. Master / Slave / UI Configuration W i t h l o g s b e i n g f e d t o G C S .
  12. 12. How we provision Airflow.
  13. 13. We place it all on a single Google Compute Engine VM. No bull! E x c u s e m e ? CPU: n1-standard-2 2 vCPUs, 7.5 GB memory HD: 30 GB Standard Persistant Disk (Non-SSD)
  14. 14. LET’S 
  15. 15. A few key Airflow concepts. DAGs Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Written in python. 01 Describes how a single task performs in a workflow (dag). 
 There are many types of operators: BashOperator, PythonOperator, EmailOperator, HTTPOperator, MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, Sensor, DockerOperator 02 Operators Tasks Once an operator is instantiated it’s referred to as a task. 03 dag = DAG( dag_id='example_python_operator', schedule_interval=None ) def my_sleeping_function(random_base): '''This is a function that will run within the DAG execution''' time.sleep(random_base) def print_context(ds, **kwargs): pprint(kwargs) print(ds) return 'Whatever you return gets printed in the logs’ run_this = PythonOperator( task_id='print_the_context', provide_context=True, python_callable=print_context, dag=dag) for i in range(10): ''' Generating 10 sleeping task, sleeping from 0 to 9 seconds respectively ''' task = PythonOperator( task_id='sleep_for_'+str(i), python_callable=my_sleeping_function, op_kwargs={'random_base': float(i)/10}, dag=dag ) task.set_upstream(run_this)
  16. 16. Stop doing things the way you have, think dynamically. You can automate your task by reading source code or listing files in a directory. You don’t have to worry about execution order, you only need to present airflow with relationships. Think in terms of how you can remove human error. Let airflow work for you.
  17. 17. Airflow really shines with dynamic tasks. Dictionary (array) of Dependencies What if you made a script that parsed all your jobs, and detected all dependencies automatically.
 Now what if you took that dictionary, and fed it into airflow?
 How would that simplify your pipeline? dependencies = { 'topic_billing_frequency': [ ‘dim_billing_frequency’, ‘dim_account' ], 'topic_payment_method': ‘dim_credit_card_type’, ‘dim_payment_accounts’ ] } Let’s take a look… L e t m e s h o w y o u …
  18. 18. Airflow really shines with dynamic tasks. T h e c o d e t o r u n i t a l l Top Level Dependencies Top level dependencies are created. Each of these tasks, depends on creating and deleting the cluster. 01 Now each child dependency is iterated over, and a task is created for each. Each is given the delete task as a “downstream” so delete cluster will never run until the tasks are complete. 02 Child Dependencies Connect children to parents Now set the parent task as an upstream for each child task. 03 all_tasks = {} # Create all parent tasks, top level for key, value in dependencies.all_dependencies.iteritems(): if key not in all_tasks: all_tasks[key] = PythonOperator( task_id=key, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[key].set_upstream(task_create_cluster) all_tasks[key].set_downstream(task_delete_cluster) # Create all nested dependency tasks for key, value in dependencies.all_dependencies.iteritems(): for item in value: if item not in all_tasks: if key in all_tasks: continue all_tasks[item] = PythonOperator( task_id=item, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[item].set_downstream(task_delete_cluster) all_tasks[item].set_downstream(all_tasks[key])
  19. 19. What does that code do?This is real code being used today.
  20. 20. Dovy Paukstys Consultant at Caserta #geek #bigdata #redux How can I help?