Metail and Elastic MapReduce

1
April 2016 – AWS Loft, London
Gareth Rogers, Data Engineer

2
Metail lets you try on clothes online
Discover clothes on
your body shape
Create, save outfits
and share
Shop with confidence
of size and fit

3
Proven impact as validated by
American business schools and A/B tests
‘‘
…customers who had access to the fitting tool are more likely to come back to the
site, and this effect is statistically significant… ‘‘
…shows approximately a 5.1 percent reduction in returns compared to the
control group…In other words, providing fit information reduces average fulfilment costs”
…sales for users with access to the tool were substantially higher overall - 22.32 percent larger
‘‘
Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment”
by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio
Moreno (Northwestern University) –Oct 21, 2015
DATA
1000+ GARMENTS
POINTS3M

4
Architecture Theory
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
• Should include a speed layer to give a real time view on sampled data
– We’ve not implemented this
New
Data
Batch Layer
Master dataset
Serving Layer
Batch views
Query
Query
Query

5
Architecture Practice – Data Collection

6
Architecture Practice – Data Collection
New Data and Collection
• We’re using Snowplow for the initial stages of our pipeline
• Using their JavaScript tracker and Cloudfront collector
configuration
• Tracker performs a GET request on a Cloudfront distributed
image (pixel)
• Query parameters of the contain the event data e.g. GET
http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
• Cloudfront configured to log the requests to S3
• We now have our master record

7
Architecture Practice – Serving Layer
Serving Layer• Initially queries over Hadoop  Redshift came along 
• RedshiftSQL good for small data science team!
• Not so good for everyone else in the company
• Introduced Looker 
• Data model in SQL
• Dashboards
• Point and click data exploration
• Permissions
• Version control

8
Architecture Practice – Batch Layer
• Daily process the raw events to create batch view
• Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS
• Create views of the master record through enrichment and aggregation
• Populates the schema for speedy Redshift queries
Batch Layer

9
Extract Transform and Load (ETL)
• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR
– Copy data from Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster

10
• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR
– Copy data from Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster

11
• To that cluster we add steps
• Initial step use s3distcp to aggregate the log files
• Snowplow’s ETL written in Scalding
– Scalding = Cascading (Java higher level MapReduce libraries) in Scala
– They provide a compiled JAR hosted in S3

12
• Metail’s ETL is very similar to
Snowplow’s
• Use AWS’ Data Pipeline to drive the
workflow
– Really great to get going
– But quickly hit complexity
limitations

13
• Metail ETL written in
– Cascalog, logic programming over Hadoop
– Cascalog = Cascading + Datalog in Clojure
– Ridiculously compact and expressive
– But steep learning curve and impenetrable errors

14
• Soon Parkour a Clojure wrapper over Hadoop Java API
– Access to full Hadoop API with no abstractions just more idiomatic Clojure
– Learning curve is mainly Hadoop
– Errors still impenetrable

15
Summary
• This pipeline has been built and managed by 3-5 people
• It’s about a year and a half old and continues to evolve
• Composed of a few different technologies and EMR used to do the batch processing
• Using EMR has made cluster managing and scaling straightforward
• The synergy between EMR and S3 is a powerful feature
– Encourages immutable infrastructure
– You don’t need your compute cluster running to hold your data!

Metail and Elastic MapReduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Metail and Elastic MapReduce

Similar a Metail and Elastic MapReduce (20)

Último

Último (20)

Metail and Elastic MapReduce