Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Academia to Data Science - A Hitchhiker's Guide

3.364 visualizaciones

Publicado el

Recently I gave a talk at UC Berkeley regarding the transition from academia to industry in the context of Machine Learning and Data Science related roles. I based most of my slides on my own transition from being an Astrophysicist to a Machine Learning Expert. I hope this will be useful to many. Feedback is welcome!

Publicado en: Empleo

Academia to Data Science - A Hitchhiker's Guide

  1. 1. A Hitchhiker’s Guide to Data Science sudeep das Sudeep Das Senior Machine Learning Researcher @datamusing
  2. 2. My Journey
  3. 3. Ph. D. Astrophysics Cosmic Microwave Background Gravitational Lensing
  4. 4. Beats Music Core Recommendation Systems Group
  5. 5. What do I do?
  6. 6. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics The Grand Innovation Workflow Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  7. 7. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics In some companies, this is a data scientist Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  8. 8. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics In some other companies, this is a data scientist Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  9. 9. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics yet in some other companies, this is a data scientist Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  10. 10. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics At Netflix, this is broadly what I do Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  11. 11. Tools of the trade
  12. 12. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics SQL, Spark (scala), PySpark, Python-Pandas, Hive,AWS-S3 Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  13. 13. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics Matplotlib, Tableau, Vega, Plotly, custom javascript (d3) Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  14. 14. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metrics Hive, s3, APIs in Flask/Django/Java Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  15. 15. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipeline Monitor offline metricsPython, SciKit-learn, Jupyter notebooks, TensorFlow/Keras, XGBoost, SparkML/scala, Zeppelin ... Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  16. 16. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipelines Monitor offline metrics Docker, company specific platforms Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  17. 17. Identify Problem Understand what is important to the business Deep Data Dives Visualizations Communicate to Stakeholders Sometimes top down, sometimes ground Up Idea Generation Prepare Data Build Models Implement in Production Test Hypotheses Slice/dice/ massage data Work with data teams to ensure data integrity Make sure data tables/feeds that you need are stood up Offline/online data integrity Prototype features Modeling extremes: out-of the-box Logistic Regression, GBMs to adapting an emergent idea from a recent paper! Set up offline training pipelines Monitor offline metrics Java, Scala, in some cases Python, company specific Design the experiment/hypot hesis/cell structure Integrate your models with the production systems (code review, load testing) Hook up with the testing platform Read results of experiments to determine significance Slice and dice the online data to determine if your test affected the intended audience If results are flat, rinse and repeat!
  18. 18. Types of Problems
  19. 19. ● Personalization ● Search ● Object recognition ● Voice/speech recognition ● Pattern recognition ● Natural Language Processing ● Trend prediction ● Segmentation/clustering ● Dynamic Pricing ● Optimization ● Outlier Detection At Netflix, we do a bit of everything
  20. 20. Emergent Trends
  21. 21. Probabilistic Graphical Models - Bayes Nets Deep Learning Causal Inference (Deep) Reinforcement Learning
  22. 22. What academia prepares you for
  23. 23. ● Perseverance ● Ability to pick up new technical skills ● Presentation skills ● Some quantitative visualization skills ● Ability to distil technical research in related areas and adapt it to the problem at hand ● If you are from a quantitative and experimental field: ○ Mathematical abilities ○ Knowledge of Basic Statistics - error analysis, experiment design ○ Some parameter estimation, bayesian inference exposure ○ Some ability to write code ○ Some exposure to general machine learning ● Learning from failure: Most A/B tests fail - so do experiments in academia ● Writing papers/ technical blogs etc.
  24. 24. What academia doesn’t prepare you for
  25. 25. ● Being a good listener ● Asking questions ● Understanding and articulating the business value of your technical pursuit ● Writing clean, maintainable code with documentation and unit tests ● Ability to collaborate across teams and cultures - cross-functionally ● Admitting that “Good enough” is better than perfect ● Coping with quick project timelines ● Documenting, sharing, getting early input on projects ● Dealing with live, large, and exceptionally dirty datasets. ● Understanding that research in Industry is results driven and not publication driven. ● Stepping out of your focus area and seeing your problem in the bigger context of where your company is headed.
  26. 26. Marketing Yourself
  27. 27. Fill in your basic skills gaps Databases, SQL, Spark familiarity Data Structures Algo/CS 101 Get really strong in one language - highly recommend Python - pandas, scikit ecosystem Good coding practices - documentation, modular code, unit tests Amp up your ML Knowledge Create an Online Presence Improve soft skills Interview Prep Your friends: Online courses and open datasets! Do mini projects on ML, esp. Deep Learning, Reinforcement Learning. Get creative! Get a rock solid foundation in basic stats. Kaggle Competitions Github repo so recruiters can look at your code. Put your hobby projects online Write a blog post on something new you learned Follow/contribute to Stackoverflow Landing the First Job! Identify weakness in communication skills and work on them. Pick up speaking engagements at meetups, at your university, and conferences such as PyData Do collaborative projects with people who are also transitioning Practise whiteboarding, collaborative coding on CoderPad Standard books like Cracking the Coding Interview, Glassdoor Go for some “dry run” interviews. Do background research on the company - be inquisitive, ask questions Keep at it!
  28. 28. @datamusing

×