Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Science as Scale

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 33 Anuncio

Más Contenido Relacionado

Presentaciones para usted (19)

Similares a Data Science as Scale (20)

Anuncio

Más reciente (20)

Data Science as Scale

  1. 1. Data Science at Scale An introduction to distributed thinking Conor Murphy Since the world drives to a delirious state of things, we must drive to a delirious point of view. - Jean Baudrillard
  2. 2. About me ● BA, MA in Philosophy ● 4 years with data in non-profits ● galvanizeU faculty ● Emerging from 6-month deep dive into data engineering
  3. 3. Our Roadmap ● Motivation ● Data science vs data engineering ● Data engineering crash course ● One application: Spark ● One paradigm: probabilistic algorithms
  4. 4. Motivation for Data Science at Scale
  5. 5. What is “big data”? ● A buzzword ● More data than can fit in memory on one machine ● The three (or four) V’s ○ Volume ○ Variety ○ Velocity ○ (Varacity)
  6. 6. Hardware changes ● Laptops, bigger laptops, clusters ● Cloud computing ● Memory vs. disk ● Commodity hardware
  7. 7. Volume - the scale of data ● 40 Zettabytes (43 trillion gb) will be collected by 2020 ● 294B emails/day ● 170k tweets/min ● 48 hrs video/min to YouTube ● 5B mobile phones ● Web Source: Deloitte, IBM
  8. 8. Variety - different types of data ● 100’s of millions of health monitors ● Internet of things
  9. 9. Velocity - the speed of data ● NYSE captures 1 TB of data/session ● Streams (and firehoses) of data ● Fraud detection, high freq. trading
  10. 10. (Veracity) - how reliable is our data? ● ⅓ of business leaders don’t trust information they use to make decisions ● Poor data quality costs US economy $3.1T/yr ● User-generated data, data lakes, and dark data
  11. 11. In brief, we’re on an exponential curve ● The four V’s are in the realm of Moore’s law ● Leveraging this could result in exponential impact ● Thinking in a distributed manner allows us to use this resource and better adapt to new technologies
  12. 12. Data Science vs Data Engineering
  13. 13. A motivating example: The Netflix Prize ● Data scientists optimize for accuracy (think Kaggle) ● We often can’t use the best performing models ● Data engineers optimize for ???
  14. 14. Paradigms ● Data scientists - probabilistic, statistical, experimental, precision ● Data Engineers - distributed, scale, resilience, latency, security, ...
  15. 15. Other helpful distinctions ● Static vs streaming datasets ● Choice of programming languages ○ Python/R vs. Scala/Java/C ● Turning experiments into institutional knowledge (with the headaches of bad code that go with it)
  16. 16. Data science is shifting towards engineering ● Minimize developer time ● Choose tools that are scalable ● More software development best practices are required in DS Source: Domino Labs
  17. 17. Data Engineering Crash Course
  18. 18. Source: Cross Industry Standard Process for Data Mining and Nathan Marz’ Big Data
  19. 19. More data vs tuning ● More data generally outperforms more complex models “Note that the curves appear to be log-linear even out to one billion words.” Source: Banko and Brill
  20. 20. Scaling Computation and Storage Source: DataBricks (note: older test done on disk) and DataStax
  21. 21. One (exciting) option: the SMACK Stack ● Spark/Scala ● Mesos ● Akka ● Cassandra ● Kafka
  22. 22. Enter Spark
  23. 23. What is Spark? ● A fast and general engine for large-scale data processing ○ Can theoretically scale infinitely ○ In-memory database ○ Streaming ○ Machine Learning ○ Deep Learning? ● A DAG scheduler ● In memory ● Catalyst Optimizer/Project Tungsten Source: DataBricks
  24. 24. Spark as an Ecosystem ● There are distinct advantages to approaching Spark for an array of tasks including: ○ OLTP and OLAP ○ ETL (Extract, Transform, Load) ○ Machine Learning ○ Stream Processing
  25. 25. Probabilistic algorithms
  26. 26. A motivating example: inexact databases Source: BlinkDB (no longer developed)
  27. 27. Probabilistic Algorithms ● Algorithms that have some degree of randomness (often with hash functions) in their logic ● An alternative to sampling ○ Has bounded errors ○ Uses less alpha ● Use when: exact algorithms take too long or won’t complete
  28. 28. Examples ● Bloom filter ● HyperLogLog ● Locality sensitive hashing ● In the tools we’ve seen: ○ HashingTF (Spark) ○ Database optimization (Cassandra)
  29. 29. Final Remarks
  30. 30. Summary ● Think of this field in terms of thinking and mindsets ○ e.g. distributed thinking, probabilistic reasoning ● Never chase technologies ● Use Spark as an ecosystem ● Remember the principles of big data architectures
  31. 31. A closing note on ethics ● These are impactful tools ● The same tools that make precision medicine back predatory lending ● Be modest; be altruistic
  32. 32. Thanks!--now let’s grab coffee and geek out Twitter: @conorbmurphy LinkedIn: linkedin.com/in/conorbmurphy Slideshare: slideshare.net/ConorBMurphy Email: conor.murphy@galvanize.com

×