Apache Spark @enbrite.ly
Budapest Spark Meetup
March 30, 2016
Joe MÉSZÁROS
software engineer
@joemesz
joemeszaros
Who we are?
Our vision is to revolutionize the KPIs and metrics the online
advertisement industry currently using. With our products,
Antifraud, Brandsafety and Viewability we provide actionable
data to our customers.
Agenda
● What we do?
● How we do? - enbrite.ly data platform
● Real world antifraud example
● LL + Spark in scale +/-
DATA
COLLECTION
ANALYZE
DATA PROCESSION
ANTI FRAUD
VIEWABILITY
BRAND SAFETY
REPORT + API
What we do?
How we do? DATA COLLECTION
How we do? DATA PROCESSION
Amazon EMR
● Most popular cloud service provider
● Amazon Big Data ecosystem
● Applications: Hadoop, Spark, Hive, ….
● Scaling is easy
● Do not trust the BIG guys (API problem)
● Spark application in EMR runs on YARN (cluster
manager)
For more information: https://aws.amazon.com/elasticmapreduce/
Tools we use
https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors
Workflow engine, that helps you build
complex data pipelines of batch jobs.
Created by Spotify’s engineering team.
Your friendly plumber, that sticks your Hadoop, Spark, … jobs
with simple dependency definition and failure management.
class SparkMeetupTask(luigi.Task):
param = luigi.Parameter(default=42)
def requires(self):
return SomeOtherTask(self.param)
def run(self):
with self.output().open('w') as f:
f.write('Hello Spark meetup!')
def output(self):
return luigi.LocalTarget('/meetup/message')
if __name__ == '__main__':
luigi.run()
Web interface
Web interface
Let me tell you a short story...
Tools we created GABO LUIGI
Luigi + enbrite.ly extensions = Gabo Luigi
● Dynamic task configuration + dependencies
● Reshaped web interface
● Define reusable data pipeline template
● Monitoring for each task
Tools we created GABO LUIGI
Tools we created GABO LUIGI
We plan to release it to the wild and make it open
source as part of Spotify’s Luigi! If you are
interested, you are front of open doors :-)
Tools we created GABO MARATHON
Motivation: Testing with large data sets and slow batch jobs is
boring and wasteful!
Tools we created GABO MARATHON
Graphite
Real world example
You are fighting against robots and want to humanize
ad tech era. You have a simple idea to detect bot traffic,
which saves the world. Let’s implement it!
Real world example
THE IDEA: Analyse events which are too hasty and deviate
from regular, humanlike profiles: too many clicks in a defined
timeframe.
INPUT: Load balancer access logs files on S3
OUTPUT: Print invalid sessions
Step 1: convert access log files to events
Step 2: sessionize events
Step 3: detect too many clicks
How to solve it?
The way to access log
{
"session_id": "spark_meetup_jsmmmoq",
"timestamp": 1456080915621,
"type": "click"
}
eyJzZXNzaW9uX2lkIjoic3Bhcmtfb
WVldHVwX2pzbW1tb3EiLCJ0aW1l
c3RhbXAiOjE0NTYwODA5MTU2M
jEsInR5cGUiOiAiY2xpY2sifQo=
Click event attributes
(created by JS tracker)
Access log format
TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
1. 2.
3.
Step 1: log to event
Simplify: log files are on the local storage, only click events.
SparkConf conf = new SparkConf().setAppName("LogToEvent");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER);
// 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET
https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
Step 1: log to event
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> eventParameter = rawUrls
.map(u -> parseUrl(u).get("event"));
// eyJzZXNzaW9uX2lkIj…
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> eventParameter = rawUrls
.map(u -> parseUrl(u).get("event"));
// eyJzZXNzaW9uX2lk
JavaRDD<String> base64Decoded = eventParameter
.map(e -> new String(Base64.getDecoder().decode(e)));
// {"session_id": "spark_meetup_jsmmmoq",
// "timestamp": 1456080915621, "type": "click"}
IoUtil.saveAsJsonGzipped(base64Decoded);
Step 2: event to session
SparkConf conf = new SparkConf().setAppName("EventToSession");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<ClickEvent> clickEvents = jsonEvents
.map(e -> readJsonObject(e));
SparkConf conf = new SparkConf().setAppName("EventToSession");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<ClickEvent> clickEvents = jsonEvents
.map(e -> readJsonObject(e));
JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents =
clickEvents.groupBy(e -> e.getSessionId());
JavaPairRDD<String, Session> sessions = grouped
.flatMapValues(sessionizer);
Step 2: event to session
//Sessionizer
public Session call(Iterable<ClickEvent> clickEvents) {
List<ClickEvent> ordered = sortByTimestamp(clickEvents);
Session session = new Session();
for (ClickEvent event: ordered) {
session.addClick(event)
}
return session;
}
Step 2: event to session
class Session {
public Boolean isBad = False;
public List<Long> clickTimestamps;
public void addClick(ClickEvent e) {
clickTimestamps.add(e.getTimestamp());
}
public void badify() { this.isBad = True; }
}
Step 3: detect bad sessions
JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<Session> markedSessions = sessions
.map(s -> s.clickTimestamps.size() > THRESHOLD);
JavaRDD<Session> badSessions = markedSessions
.filter(s -> s.isBad());
badSessions.collect().foreach(println);
Congratulation!
MISSION COMPLETED
YOU just saved the world with a
simple idea within ~10 minutes.
Using Spark pros
● Sparking is funny, community, tools
● Easy to start with it
● Language support: Python, Scala, Java, R
● Unified stack: batch, streaming, SQL,
ML
Using Spark cons
● You need memory and memory
● Distributed application, hard to debug
● Hard to optimize
Lessons learned
● Do not use default config, always optimize!
● Eliminate technical debt + automate
● Failures happen, use monitoring from the very
first breath + fault tolerant implementation
● Sparking is funny, but not a hammer for
everything
Data platform future
● Would like to play with Redshift
● Change data format (avro, parquet, …)
● Would like to play with streaming
● Would like to play with Spark 2.0
WE ARE HIRING!
working @exPrezi office, K9
check out the company in Forbes :-)
amazing company culture
BUT the real reason ….
WE ARE HIRING!
… is our mood manager, Bigyó :)
Joe MÉSZÁROS
software engineer
joe@enbrite.ly
@joemesz
@enbritely
joemeszaros
enbritely
THANK YOU!
QUESTIONS?

Budapest Spark Meetup - Apache Spark @enbrite.ly

  • 1.
    Apache Spark @enbrite.ly BudapestSpark Meetup March 30, 2016
  • 2.
  • 3.
    Who we are? Ourvision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.
  • 4.
    Agenda ● What wedo? ● How we do? - enbrite.ly data platform ● Real world antifraud example ● LL + Spark in scale +/-
  • 5.
  • 6.
    How we do?DATA COLLECTION
  • 7.
    How we do?DATA PROCESSION
  • 8.
    Amazon EMR ● Mostpopular cloud service provider ● Amazon Big Data ecosystem ● Applications: Hadoop, Spark, Hive, …. ● Scaling is easy ● Do not trust the BIG guys (API problem) ● Spark application in EMR runs on YARN (cluster manager) For more information: https://aws.amazon.com/elasticmapreduce/
  • 9.
    Tools we use https://github.com/spotify/luigi| 4500 ★ | more than 200 contributors Workflow engine, that helps you build complex data pipelines of batch jobs. Created by Spotify’s engineering team.
  • 10.
    Your friendly plumber,that sticks your Hadoop, Spark, … jobs with simple dependency definition and failure management.
  • 11.
    class SparkMeetupTask(luigi.Task): param =luigi.Parameter(default=42) def requires(self): return SomeOtherTask(self.param) def run(self): with self.output().open('w') as f: f.write('Hello Spark meetup!') def output(self): return luigi.LocalTarget('/meetup/message') if __name__ == '__main__': luigi.run()
  • 12.
  • 13.
  • 14.
    Let me tellyou a short story...
  • 15.
    Tools we createdGABO LUIGI Luigi + enbrite.ly extensions = Gabo Luigi ● Dynamic task configuration + dependencies ● Reshaped web interface ● Define reusable data pipeline template ● Monitoring for each task
  • 16.
    Tools we createdGABO LUIGI
  • 17.
    Tools we createdGABO LUIGI We plan to release it to the wild and make it open source as part of Spotify’s Luigi! If you are interested, you are front of open doors :-)
  • 18.
    Tools we createdGABO MARATHON Motivation: Testing with large data sets and slow batch jobs is boring and wasteful!
  • 19.
    Tools we createdGABO MARATHON Graphite
  • 20.
    Real world example Youare fighting against robots and want to humanize ad tech era. You have a simple idea to detect bot traffic, which saves the world. Let’s implement it!
  • 21.
    Real world example THEIDEA: Analyse events which are too hasty and deviate from regular, humanlike profiles: too many clicks in a defined timeframe. INPUT: Load balancer access logs files on S3 OUTPUT: Print invalid sessions
  • 22.
    Step 1: convertaccess log files to events Step 2: sessionize events Step 3: detect too many clicks How to solve it?
  • 23.
    The way toaccess log { "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click" } eyJzZXNzaW9uX2lkIjoic3Bhcmtfb WVldHVwX2pzbW1tb3EiLCJ0aW1l c3RhbXAiOjE0NTYwODA5MTU2M jEsInR5cGUiOiAiY2xpY2sifQo= Click event attributes (created by JS tracker) Access log format TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..." 1. 2. 3.
  • 24.
    Step 1: logto event Simplify: log files are on the local storage, only click events. SparkConf conf = new SparkConf().setAppName("LogToEvent"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER); // 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
  • 25.
    Step 1: logto event JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lkIj… JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lk JavaRDD<String> base64Decoded = eventParameter .map(e -> new String(Base64.getDecoder().decode(e))); // {"session_id": "spark_meetup_jsmmmoq", // "timestamp": 1456080915621, "type": "click"} IoUtil.saveAsJsonGzipped(base64Decoded);
  • 26.
    Step 2: eventto session SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents = clickEvents.groupBy(e -> e.getSessionId()); JavaPairRDD<String, Session> sessions = grouped .flatMapValues(sessionizer);
  • 27.
    Step 2: eventto session //Sessionizer public Session call(Iterable<ClickEvent> clickEvents) { List<ClickEvent> ordered = sortByTimestamp(clickEvents); Session session = new Session(); for (ClickEvent event: ordered) { session.addClick(event) } return session; }
  • 28.
    Step 2: eventto session class Session { public Boolean isBad = False; public List<Long> clickTimestamps; public void addClick(ClickEvent e) { clickTimestamps.add(e.getTimestamp()); } public void badify() { this.isBad = True; } }
  • 29.
    Step 3: detectbad sessions JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<Session> markedSessions = sessions .map(s -> s.clickTimestamps.size() > THRESHOLD); JavaRDD<Session> badSessions = markedSessions .filter(s -> s.isBad()); badSessions.collect().foreach(println);
  • 30.
    Congratulation! MISSION COMPLETED YOU justsaved the world with a simple idea within ~10 minutes.
  • 31.
    Using Spark pros ●Sparking is funny, community, tools ● Easy to start with it ● Language support: Python, Scala, Java, R ● Unified stack: batch, streaming, SQL, ML
  • 32.
    Using Spark cons ●You need memory and memory ● Distributed application, hard to debug ● Hard to optimize
  • 33.
    Lessons learned ● Donot use default config, always optimize! ● Eliminate technical debt + automate ● Failures happen, use monitoring from the very first breath + fault tolerant implementation ● Sparking is funny, but not a hammer for everything
  • 34.
    Data platform future ●Would like to play with Redshift ● Change data format (avro, parquet, …) ● Would like to play with streaming ● Would like to play with Spark 2.0
  • 35.
    WE ARE HIRING! working@exPrezi office, K9 check out the company in Forbes :-) amazing company culture BUT the real reason ….
  • 36.
    WE ARE HIRING! …is our mood manager, Bigyó :)
  • 37.