Spark your legacy - Distributing an 8-year Monolith

1© 2015 Kenshoo, Ltd. Proprietary Information
Spark Your Legacy:
Distributing an 8-year Monolith
Tzach Zohar, Kenshoo, May 2015

Who?
Tzach Zohar
Architect @ Kenshoo
tzach.zohar@kenshoo.com
http://il.linkedin.com/in/tzachzohar

Where?
● Online advertising technology
● 9-year old startup
● ~500 employees
● Data-intensive (aren’t we all?)

Agenda
● Project Background
● Why not to Greenfield
● Refactoring Challenges
● Solutions

Project Background

Domain: Data Aggregation
● Of: advertising metrics
● On: versatile, batched, occasionally re-stated input
● By: many different keys
● When: now + ~0.5 hour
● While: filtering and normalising per business rules
● For: eternity (data lives forever)

Slow
Sources
Fast
Custom
Re-stated
Normalize
Aggregate
By X
By Y
By X + Y
...
Observations

Slow
Sources
Fast
Custom
Re-stated
Normalize
Aggregate
By X
By Y
By X + Y
...
Observations
Aggregate

Requirement: Better, Faster
● Higher throughput: business is growing
● More keys: and ad-hoc aggregations
● Linear scalability: anything else is not cost-effective
● Easy to enhance: by any decent developer

Chosen Design: Spark
sources
Normalize
Driver
HDFS + Spark Cluster

sources
Normalize
Driver
Landing
Zone

sources
Normalize
Driver
Landing
Zone
By X
By Y
By X+Y
...
Spark
Jobs

B: New Shiny
System
Great, but how do we get there?
A: Legacy
System
Refactoring?
“Greenfield” project?
???

Why Not to “Greenfield”

Q1 Q3Q2
Legacy
Challenge: Moving Target

Q1 Q3Q2
Legacy
New System

Q1 Q3Q2
Legacy Legacy’
New System

Challenge: Zero Diff Tolerance
● Different clients have different data, different
customizations, different scales
● Our data is often validated against external
sources

Challenge: Code Is Our Only Spec
?
But it isn’t necessarily a friendly one...

Challenge: Code Is Our Only Spec
What exactly should the new system do?

Challenge: Test Reuse?
Tests assume a single-server setup...

Challenge: Test Reuse?
Some are coupled with current implementation...

Refactoring Challenges

Challenge: Legacy Code
Some of it still untested

Challenge: Tight Coupling
Implementation is tightly coupled with many other components
Kenshoo Server
Search
Engines
SE API Facade
WebUserInterface
Proxy Servers Client's Website
Client Users
Client Systems /
DWH
Entity
Mgmt/
DAO
Normalizers
Optimization
Algorithms
DataProviders/
ScoreSQL
Builder
Client Configuration
SEM Entity Data
Performance Data
Campaign
Generation Tools
(RTC, KW Tool)
Report Generation
Bulk Editing and
Advanced Features
Co
nf.
DA
O
Kenshoo Editor
FTP Sites
Tracking Processor
Aggregator
HELP ME!

Challenge: Paradigm Shift
How do you gradually refactor a single-node java application
into a distributed Spark application?

Solutions

Legacy System New System
Solution #1: Shared Code

Core
Business
Rules
1. Refactor legacy code to create stand-alone jar

Core
Business
Rules
2. Build new system around this core code
1. Refactor legacy code to create stand-alone jar
Core
Business
Rules

Business rules refactored into Java static methods, to
avoid serialization issue in Spark

Solution #2: Empiric Reverse Engineering

Solution #3: Local Mode Testing
Legacy System
New Aggregation
System
Spark

Legacy System
New Aggregation
System
Spark
1. Embed Spark in Aggregation System

Legacy System
New Aggregation
System
Spark
1. Embed Spark in Aggregation System
2. Embed Aggregation System in Legacy

Solution #4: Side-by-Side
Both at the component level and at the system level

Questions?

Spark your legacy - Distributing an 8-year Monolith

Recomendados

Recomendados

Más contenido relacionado

Más de Tzach Zohar

Más de Tzach Zohar (6)

Último

Último (20)

Spark your legacy - Distributing an 8-year Monolith