Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

How to Prepare for a Career in Data Science

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 31 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a How to Prepare for a Career in Data Science (20)

Anuncio

Más reciente (20)

How to Prepare for a Career in Data Science

  1. 1. How to Prepare for a Career in Data Science Juuso Parkkinen, PhD - @ouzor Head of Data Science, Nightingale Health - @NgaleHealth Aalto University, November 25, 2019
  2. 2. Outline 1.My Career as a Data Scientist 2.Data Science Workflow 3.Data Science and Business
  3. 3. My Career as a Data Scientist
  4. 4. The Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  5. 5. My career steps MSc in bioinformation technology from HUT / Aalto PhD in bioinformatics and machine learning from Aalto Data Scientist (consultant) at Reaktor Data Scientist at Nightingale Health
  6. 6. Data Science research: probabilistic models for biomedical problems 7 More on my research and other projects: https://ouzor.github.io/projects.html
  7. 7. Data Science as a hobby: open tools for open data Blogging Open source programming Open Knowledge -community Blogs: https://louhos.github.io/, https://ouzor.github.io/
  8. 8. Open data science example: Biking activity in Helsinki How do various factors affect biking activity in Helsinki? Data sources: - Automatic bike activity counters from multiple sites - Weather data from FMI Bike activity modelled with Negative Binomial distribution using R (mgcv::gam) Done with Janne Sinkkonen and Antti Poikola Data, code & results: https://github.com/apoikola/fillarilaskennat 9
  9. 9. Open Data Science at Reaktor: Apartment price modelling Kannattaakokauppa.fi by Reaktor: http://kannattaakokauppa.fi More about the model: https://ouzor.github.io/blog/2016/03/08/apartment-price-model.html
  10. 10. Data Science Workflow 11
  11. 11. Data Science in the vacuum Typically starts with a clean data set and a clear (modelling) task. Example: Weather data in csv, and a goal to predict humidity. What might be different in the real world? 12
  12. 12. Data Science Workflow in the Real World 1. Identifying and defining the problem 2. Accessing data 3. Preprocessing and cleaning the data 4. Exploratory data analysis and visualisation 5. Statistical modelling or machine learning 6. End result Note the difference between academic interests and practical relevance! 13 ITERATION
  13. 13. Identifying and defining the problem Learn to be critical and ask good questions! • Why is this problem important? • How does solving this improve our user experience? • How does solving this improve our business? • Is the problem really something we should solve, or is it something where we happen to have data or methods available? • Do we even need to solve this problem!? Only after the problem is identified, you can start thinking about data science - Do we have relevant data to support solving the problem? - Can we use modelling to solve the problem (e.g. prediction or classification)? 14
  14. 14. Accessing data Data exists in variety of sources and formats. A data scientist might need to access data from any of these in a reasonable time. Typical data sources: Files, APIs, Data bases, web scraping Typical data formats: - CSV, TSV, Excel - JSON (XML less nowadays) - Lot’s of strange structure in text files Domain-specific formats: - Relational data (networks) - Spatial data - Gene expression, genomic data 15
  15. 15. Example: Weather data from WFS API http://opendata.fmi.fi/wfs?service=WFS&version=2.0.0&request=getFeature&storedquery_id=fmi::forecast::hirlam::surface::point::multipointcoverage&place=helsinki& 16
  16. 16. Be very careful with Excel data formatting! 17
  17. 17. Preprocessing and cleaning (”wrangling” / ”munging”) “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham Having data in a tidy format makes data analysis, visualisation and modelling easier. Data frames in R and Python. Read more about tidy data: https://r4ds.had.co.nz/tidy-data.html 18
  18. 18. Exploratory Data Analysis and Visualisation The goal of Exploratory Data Analysis is to get to know your data, using visual summaries and computing descriptive statistics. Includes identifying missing data, outliers and other possible problems with the data. This informs preprocessing and cleaning, and typically needs a couple of iterations before the data is ready for analysis. You should also contact domain experts and confirm if the data looks as it should. It’s hard to define when the data is really ”clean”. You will develop an instict for this over time. 19
  19. 19. Statistical modelling and machine learning Modelling is one way to reach a goal in data analysis, not a goal in itself. Pick a suitable method based on your goals - not the other way around! Start with simple methods, add complexity gradually, if needed. You can get pretty far with linear or logistic regression. 20
  20. 20. End result The end result of a data science project can be many things, such as - A single figure describing the association of two variables - A comprehensive report for a client or business department - A machine learning product ready to be deployed into production In most projects, it is important to write some kind of report of documentation of what has been done. Learning to communicate effectively is a very important skill for data scientists. This includes producing clear visual summaries of the main results, and using generally understandable language. 21
  21. 21. Deplying Data Products Data science is useful in creating insights, increasing understanding, and informing decision making. The biggest impact however comes from intelligent systems that operate automatically and continuously, such as recommendation engines. This typically means that data science products are deployed as part of larger software systems. Deploying your first data products can be frightening for data scientist with no programming background. Get support from software developers or data engineers! 22
  22. 22. Data Science Tools – Some tips Make everything reproducible and use version control! Tidyverse is a family of R packages that cover most of the data science workflow. Many similar tools exist for Python! Tidyverse: https://www.tidyverse.org/ R for Data Science: https://r4ds.had.co.nz/ 23
  23. 23. How to learn the Data Science Workflow? Data Science is an art – you only learn it by doing! • Pick challenging courses with large and realistic projects • Start a hobby project, for example using some open data set, and share the code and results (e.g. GitHub) • Participate competitions and challenges • Tidytuesdays: https://github.com/rfordatascience/tidytuesday • Kaggle: https://www.kaggle.com/ Learning a proper Data Science Workflow will help you in producing reliable results in a reasonable time. This will benefit your career regardless of whether you work in the academia, industry, or somewhere else. 24
  24. 24. Data Science and Business 25
  25. 25. Agile Data Science Any sufficiently interesting problem has more than one ”correct” answer. You can use anything between 2 hours and a PhD on single problem. Try to recognize how much effort each problem is worth of. You can often get a satisfactory solution with 20% of the effort compared to a ”perfect” solution. Learn to fail fast. Sometimes data science solutions do not work, and it’s good to realise this as soon as possible. Adopting agile software development practices helps! Agile Data Science with R: https://edwinth.github.io/ADSwR/index.html 26
  26. 26. Data Science in a Team No single person can master every possible data science skill. Data scientists work effectively in teams, with complementary skill sets and backgrounds. When looking for you first job as a data scientits, look for places where there are senior people who can help you learn and grow as a data scientist. 27
  27. 27. Data Science Use Cases 28
  28. 28. Data Science as part of a Product or Project Data Science is typically only a small part of the larger Product or Project. It is important to know what the overall goal is, and to adjust data science development towards that. You need to collaborate with other people, such as designers, software developers, marketing and sales people, customers, etc. 29
  29. 29. Some takeaway notes Data Science is an art – you only learn it by doing. Find ways to continuously learn and practice your skills, with e.g. hobby projects or competitions. Finding a problem worth solving is hard. There is never a single correction solution. Curiosity and critical thinking are invaluable!
  30. 30. Thank you! Juuso Parkkinen, PhD - @ouzor Head of Data Science, Nightingale Health - @NgaleHealth www.nightingalehealth.com

×