Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data science a practitioner's perspective

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Data science
Data science
Cargando en…3
×

Eche un vistazo a continuación

1 de 19 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Data science a practitioner's perspective (20)

Anuncio

Más reciente (20)

Data science a practitioner's perspective

  1. 1. Data Science A practitioner’s perspective Amir Ziai @amirziai
  2. 2. Who am I? ● Data Scientist at ZEFR, ad tech, LA ● Previously worked in healthcare, SaaS, and finance
  3. 3. Agenda ● Data Science ● My perspective ○ Problems ○ Pitfalls ○ Minimum skills ○ How to build your skills ● Resources
  4. 4. Data Science, a short history ● 1960, Peter Naur used it as a substitute for computer science ● 1977, Jeff Wu gave the “Statistics = Data Science?” lecture ● 2008, DJ Patil and Jeff Hammerbacher used “data scientist” to describe their job ● 2011, McKinsey, shortage of 140k analysts and 1.5M managers by 2018 ● 2015, Data Scientists don’t scale ● 2016, Why You’re Not Getting Value from Your Data Science https://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/
  5. 5. Data Science, growth
  6. 6. Data Science, hyped? http://www.kdnuggets.com/wp-content/uploads/gartner-2014-hype-cycle.jpeg
  7. 7. Data Science, too broad ● BI Analyst/Engineer ● Analytics Engineer ● Data Engineer ● Statistician ● Research Scientist ● Machine Learning Engineer ● AI Engineer ● Solutions Specialist (with analytical background) ● Software Architect ● Financial Modeler ● Actuary ● ...
  8. 8. Data Science, definition “Data Scientist is a Data Analyst who lives in California” “Data Scientist is statistics on a Mac” “...someone who is better at statistics than any software engineer and better at software engineering than any statistician”
  9. 9. Data Science, the many Venn diagrams
  10. 10. Data Science, process ● Data wrangling (get data from any source, reshape, scale up if needed) ● Problem formulation and modeling (ML, DL, AI) ● Communicate the findings (visualization, UI/UX) ● Productize (SWE, Data Engineering, DevOps) In the context of: ● Benefit (business value) ● Cost (development, infrastructure, and architecture)
  11. 11. My perspective, what does ZEFR do? ● Ingesting hundreds of millions of videos per day ● Help brands show relevant ads ● Identify content for monetization ● Data science ○ Optimize advertising campaigns ○ Forecast inventory ○ Process text, image, audio, and video ○ Petabyte scale
  12. 12. My perspective, scale and automation Requirements ● Billions of examples, million of features to train the models with ● Scoring on a similar scale of data ● Models to be re-trained near real-time Implications ● Have to use cloud computing and distributed systems ● Small deltas in quality and algorithm efficiency magnified to massive cost or benefit deltas ● Solid software engineering and automation is key
  13. 13. My perspective, example Task ● Train a better forecasting model (vs. a benchmark statistic) ● Hundreds of terabytes of historical data available Process ● Wrangling Pre-process and featurize (Spark, S3, RedShift) ● Modeling VW, H2O, hyper-parameter optimization ● Communication Justify cost of 100 node EMR cluster ($1,000 per day) ● Productize Test, deploy, automate with Jenkins, ECS and Kafka
  14. 14. My perspective, the grind Weeks of tuning the infrastructure, finding the right features, reasoning through algorithm complexity
  15. 15. My perspective, pitfalls ● Unreasonable expectations ○ Hype, just hire a few PhDs ○ Is data science too easy? ● Throwing it over the fence* ○ Data science builds models in R/Python, engineering implements it in Java, C, Scala ● Dismissing the importance of good software engineering practices ○ Use tests, understand algo complexity, do code reviews, experiments should be reproducible ● Dismissing the importance of understanding and formulating the problem ○ Get out and talk to people ● Dismissing or not understanding architecture, infrastructure, and cost/benefit * Full disclosure: article is written by my boss Jonathan Morra at ZEFR
  16. 16. My perspective, data science platforms ● Many companies have recognized the problem with the the disconnect between data science and engineering ● Facebook and Uber have in-house platforms ● A number of commercial solutions: Sense, Domino Data Labs, DataScience, Data Robot, Yhat, just to name a few ● Very expensive and inflexible in our case https://blog.dominodatalab.com/uber-and-the-need-for-a-data-science-platform/ https://medium.com/@novakkm/the-purpose-of-platforms-in-data-science-965e2124edf8#.vwlz3idyw https://code.facebook.com/posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
  17. 17. My perspective, minimum data science requirements - Statically-typed language (C, Java, Scala) - Dynamically-typed language (Python, R) - SQL (lag, partition, joins, rank, nested subqueries) - NoSQL (JSON, MongoDB, Couch) - Data wrangling (Pandas, dplyr, Julia, PySpark, Dask) - Command-line fu - Cloud computing (spin up instances, S3, ssh) and environment isolation - Software engineering best practices (testing, version control, complexity) - ML theory (bias/variance, complexity, encoding, hashing, feature engineering) - ML practice (sklearn, R, Julia, MLLib, H2O, TensorFlow) - Basic stats (experiment design, hypothesis testing, moments)
  18. 18. My perspective, how to build your skills ● Take courses in areas of weakness (Udacity, Coursera) ● Showcase your skills with projects on GitHub ● Write a blog about things you’re good at to refine your understanding ● Do Kaggle competitions ● Contribute to StackOverflow and/or CrossValidated ● Contribute to open source projects (sklearn, tensorflow, dask, spark)
  19. 19. Resources Newsletters, blogs and people to follow Data Elixir, Data Science Weekly, The Morning Paper, Intuition Machine, The Wild Week in AI, MLConf, Talking Machines, Partially Derivative, Brandon Rohr, Julian Evans, Chris Fregly, Bryan Smith, Stitch Fix, Unofficial Google Data Science Blog, Variance Explained, Wes McKinney, Peter Norvig’s iPython notebooks, Frank Chen of a16z, Fast Forward Labs, Chris Olah, Andrej Karpathy, Open AI, Indico, John Cook, ...

×