Metail allows users to discover clothes on their body shape online with minimum measurements from the user. With your avatar you can create outfits and coupled with our size advice this gives you a confidence in the size and fit.
I'm part of the team within Metail that has built a pipeline to collection, enriched and serve data to the company and our clients, and which has been used to validate Metail's product. This talk was given at the AWS Loft in London 21st April 2016 where I gave an overview of the end-to-end pipeline and then went into detail how we're using AWS' EMR to perform a batch processing of the collected data which is then served internally with Redshift.
1. 1
April 2016 – AWS Loft, London
Gareth Rogers, Data Engineer
2. 2
Metail lets you try on clothes online
Discover clothes on
your body shape
Create, save outfits
and share
Shop with confidence
of size and fit
3. 3
Proven impact as validated by
American business schools and A/B tests
‘‘
…customers who had access to the fitting tool are more likely to come back to the
site, and this effect is statistically significant… ‘‘
…shows approximately a 5.1 percent reduction in returns compared to the
control group…In other words, providing fit information reduces average fulfilment costs”
…sales for users with access to the tool were substantially higher overall - 22.32 percent larger
‘‘
Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment”
by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio
Moreno (Northwestern University) –Oct 21, 2015
DATA
1000+ GARMENTS
POINTS3M
4. 4
Architecture Theory
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
• Should include a speed layer to give a real time view on sampled data
– We’ve not implemented this
New
Data
Batch Layer
Master dataset
Serving Layer
Batch views
Query
Query
Query
6. 6
Architecture Practice – Data Collection
New Data and Collection
• We’re using Snowplow for the initial stages of our pipeline
• Using their JavaScript tracker and Cloudfront collector
configuration
• Tracker performs a GET request on a Cloudfront distributed
image (pixel)
• Query parameters of the contain the event data e.g. GET
http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
• Cloudfront configured to log the requests to S3
• We now have our master record
7. 7
Architecture Practice – Serving Layer
Serving Layer• Initially queries over Hadoop Redshift came along
• RedshiftSQL good for small data science team!
• Not so good for everyone else in the company
• Introduced Looker
• Data model in SQL
• Dashboards
• Point and click data exploration
• Permissions
• Version control
8. 8
Architecture Practice – Batch Layer
• Daily process the raw events to create batch view
• Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS
• Create views of the master record through enrichment and aggregation
• Populates the schema for speedy Redshift queries
Batch Layer
9. 9
Extract Transform and Load (ETL)
• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR
– Copy data from Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
10. 10
Extract Transform and Load (ETL)
• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR
– Copy data from Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
11. 11
Extract Transform and Load (ETL)
• To that cluster we add steps
• Initial step use s3distcp to aggregate the log files
• Snowplow’s ETL written in Scalding
– Scalding = Cascading (Java higher level MapReduce libraries) in Scala
– They provide a compiled JAR hosted in S3
12. 12
Extract Transform and Load (ETL)
• Metail’s ETL is very similar to
Snowplow’s
• Use AWS’ Data Pipeline to drive the
workflow
– Really great to get going
– But quickly hit complexity
limitations
13. 13
Extract Transform and Load (ETL)
• Metail ETL written in
– Cascalog, logic programming over Hadoop
– Cascalog = Cascading + Datalog in Clojure
– Ridiculously compact and expressive
– But steep learning curve and impenetrable errors
14. 14
Extract Transform and Load (ETL)
• Soon Parkour a Clojure wrapper over Hadoop Java API
– Access to full Hadoop API with no abstractions just more idiomatic Clojure
– Learning curve is mainly Hadoop
– Errors still impenetrable
15. 15
Summary
• This pipeline has been built and managed by 3-5 people
• It’s about a year and a half old and continues to evolve
• Composed of a few different technologies and EMR used to do the batch processing
• Using EMR has made cluster managing and scaling straightforward
• The synergy between EMR and S3 is a powerful feature
– Encourages immutable infrastructure
– You don’t need your compute cluster running to hold your data!