Spark Workflow Management with Luigi

Spark Workflow Management
Romi Kuntsman
Senior Big Data Engineer @ Totango
romi@totango.com
https://il.linkedin.com/in/romik
„Big things are happening here“ Meetup
2015-04-29

Agenda
●
Totango and Customer Success
●
Totango architecture overview
●
Apache Spark computing framework
●
Luigi workflow Engine
●
Luigi in Totango

Totango and Customer Success
Your customers' success is your success

SaaS Customer Journey
DECREASE VALUE
DECREASE VALUE
CHURN
CHURN
GROW VALUE
FIRST VALUE
START
INCREASE USERS
INCREASE USAGE
EXPAND FUNCTIONALITY
CHURN
ONGOING VALUE

Customer Success Platform
●
Analytics for SaaS companies
●
Clear view of customer journey
●
Proactively prevent churn
●
Increase upsale
●
Track feature, module and total usage
●
Health score based on usages pattern
●
Improve conversion from trial to paying

About Totango
●
Founded in 2010
●
Size: ~50 (half R&D)
●
Offices in Tel Aviv, San Mateo CA
●
120+ customers
●
~70 million events per day
●
~1.5 billion indexed documents per month
●
Hosted on Amazon Web Services

Totango Architecture Overview
From usage information to actionable analytics

Terminology
●
Service – Totango's customer (e.g. Zendesk)
●
Account – Service's (Zendesk's) customer
●
SDR (Service Data Record) – User activity
event (e.g. user Joe from account Acme did
activity Login in module Application)

SDR reception
●
Clients send SDRs to the gateway, where they
are collected, filtered, packaged and finally
stored in S3 for daily/hourly batch processing.
●
Realtime processing also notified.

Account Data Flow
1) Raw Data (SDRs)
2) Account Aging (MySQL - legacy)
3) Activity Aggregations (Hadoop – legacy)
4) Metrics (Spark)
5) Health (Spark)
6) Alerts (Spark)
7) Indexing to Elasticsearch

Data Structure
●
Account documents stored on Amazon S3
●
Hierarchial directory structure per task param:
e.g. /s-1234/prod/2015-04-27/account/metrics
●
Documents have a predefined JSON schema.
JSON mapped directly to Java document class
●
Each file is an immutable collection of documents
One object per line – easily partitioned by lines

Apache Spark
One tool to rule all data transformations

Resilient Distributed Datasets
●
RDDs – distributed memory abstraction that lets
programmers perform in-memory computations
on large clusters in a fault-tolerant way
●
Initial RDD created from stable storage
●
Programmer defines a transformation from an
immutable input object to a new output object
●
Transformation function class can (read: should!)
be built and tested separately from Spark

Transformation flow
Read: inputRows = sparkContext.textFile(inputPath)
Decode: inputDocuments = inputRows.map(new
jsonToAccountDocument())
Trasform: docsWithHealth = inputDocuments.map(new
augmentDocumentWithHealth(healthCalcMetadata))
… other transformations may be done, all in memory …
Encode: outputRows = docsWithHealth.map(new
accountDocumentToJson())
Write: outputRows.saveAsTextFile(outputPath)

Examples (Java)
Class AugmentDocumentWithHealth implements
Function<AccountDocument, AccountDocument>
AccountDocument call(final AccountDocument document)
throws Exception { … return document with health … }
Class AccountHealthToAlerts implements
FlatMapFunction<AccountDocument, EventDocument>
Iterable<EventDocument> call(final AccountDocument
document) throws Exception { … generate alerts … }

Transformation function
●
Passed as parameter to Spark transformation:
map, reduce, filter, flatMap, mapPartitions
●
Can (read: should!!) be checked in Unit Tests
●
Serializable – sent to Spark worker serialized
●
Function must be idempotent!
●
May be passed immutable metadata

Luigi Workflow Engine
You build the tasks, it takes care of the plumbing

Why a workflow engine?
●
Managing many ETL jobs
●
Dependencies between jobs
●
Continue pipeline from point of failure
●
Separate workflow per service per date
●
Overview and drill-down status Web UI
●
Manual intervention

Workflow engines
●
Azkaban, by LinkedIn (mostly for Hadoop)
●
Oozie, by Apache (only for Hadoop)
●
Amazon Simple Workflow Service (too generic)
●
Amazon Data Pipeline (deeply tied to AWS)
●
Luigi, by Spotify (customizable) – our choice!

What is Luigi
●
Like Makefile – but in Python, and for data
●
Dependencies are managed directly in code
●
Generic and easily extendable
●
Visualization of task status and dependency
●
Command-line interface

Luigi Task Structure
●
Extend luigi.Task
Implement 4 methods:
●
def input(self) (optional)
●
def output(self)
●
def depends(self)
●
def run(self)

Luigi Predefined Tasks
●
HadoopJobTask
●
SparkSubmitTask
●
CopyToIndex (ES)
●
HiveQueryTask
●
PigJobTask
●
CopyToTable (RDMS)
●
… many others

Luigi in Totango
This is how we do it

Our codebase is in Java
Java class is called inside the task run method

Gameboy
●
Totango-specific controller for Luigi
●
Provides high level overview
●
Enable manual re-run of specific tasks
●
Monitor progress, performance, run time,
queue, worker load etc

Summary
●
Typical data flow – from raw data to insights
●
We use Spark for fast in-memory
transformations, all code is in Java
●
Our batch processing pipeline consist of a
series of tasks, which are managed in Luigi
●
We don't use all of Luigi's python abilities, and
we've added some new management abilities

Questions?
The end is only the beginning

Spark Workflow Management with Luigi

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark Workflow Management with Luigi

Similar a Spark Workflow Management with Luigi (20)

Último

Último (20)

Spark Workflow Management with Luigi

Notas del editor