Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Putting the Spark into Functional Fashion Tech Analystics

5 visualizaciones

Publicado el

This talk explains the functional origins of Metail's data analytics pipeline, explains what Apache Spark is and how we are using it.

Publicado en: Datos y análisis
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Putting the Spark into Functional Fashion Tech Analystics

  1. 1. @metail#VoxxedBristol Putting the Spark into Functional Fashion Tech Analytics Gareth Rogers Metail
  2. 2. 2 ● Who are Metail and what do we do ● Our data pipeline ● What is Apache Spark and our experiences ● Live coding demonstration of Spark? Introduction
  3. 3. 3 Metail Making clothing fit for all
  4. 4. 4 ● Create your MeModel with just a few clicks ● See how the clothes look on you ● Primarily clickstream analysis ● Understanding the user journey The Metail Experience
  5. 5. 5 Composed Photography With Metail Shoot model once Choose poses Style & restyle Compositing Shoot clothes at source ● Understanding process flow ● Discovering bottlenecks ● Optimising the workflow ● Understanding costs
  6. 6. 6 • Batch pipeline modelled on a lambda architecture • Transformed by Clojure, Spark in Clojure or SQL • Managed by applications written in Clojure • Built on and using Amazon Web Services (AWS) Metail’s Data Pipeline
  7. 7. 7 Metail’s Data Pipeline
  8. 8. 8 Metail’s Data Pipeline To analytics dashboards
  9. 9. 9 ● Our pipeline is heavily influenced by functional programmings paradigms ● Declarative ● Immutable data structures ● Pure functions -- effects only dependent on input state ● Minimise side effects Functional Pipeline
  10. 10. 10 • Clojure is predominately a functional programming language • Aims to be approachable • Allow interactive development • A lisp programming language • Everything is dynamically typed Clojure
  11. 11. 11 • Spark is a general purpose distributed data processing engine – Can scale from processing a single line to terabytes of data • Functional paradigm – Declarative – Functions are first class – Immutable datasets • Written in Scala a JVM based language – Just like Clojure – Has a Java API – Clojure has great Java interop Apache Spark
  12. 12. 12 • Operates on a Directed Acyclic Graph (DAG) – Link together data processing steps – May be dependent on multiple parent steps – Transforms cannot return to an older partition Apache Spark
  13. 13. 13 Apache Spark
  14. 14. 14 • Consists of a driver and multiple workers • Declare a Spark session and build a job graph • Driver coordinates with a cluster to execute the graph on the workers Apache Spark
  15. 15. 15 • Operates on in-memory datasets – where possible, it will spill to disk • Based on Resilient Distributed Datasets (RDDs) • Each RDD represents on dataset • Split into multiple partitions distributed through the cluster Apache Spark
  16. 16. 16 Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Cluster: YARN/Meso/Kubernetes
  17. 17. 17 • Using the Sparkling Clojure library – This wraps the Spark Java API – Does the heavy lifting of Clojure data structure serialisation • Mostly using the core API • RDDs holding rows of Clojure maps • Runs on AWS Elastic MapReduce – Tune your cluster • Uses DynamoDB to for inter-batch state Metail and Spark
  18. 18. 18 • Metail is making clothing fit for all • We’re incorporating metrics derived from our collected data • We have several pipelines collecting, transforming and visualising our data • When dealing with datasets functional programming offers many advantages • Give them a go! – Summary
  19. 19. 19 • Learning Clojure: • A random web based Clojure REPL: • Bases of Metail Experience pipeline • Dashboarding and SQL warehouse management: • Tuning you cluster – Required on EMR Resources
  20. 20. 20 Questions?