LinkedIn emplea cookies para mejorar la funcionalidad y el rendimiento de nuestro sitio web, así como para ofrecer publicidad relevante. Si continúas navegando por ese sitio web, aceptas el uso de cookies. Consulta nuestras Condiciones de uso y nuestra Política de privacidad para más información.
LinkedIn emplea cookies para mejorar la funcionalidad y el rendimiento de nuestro sitio web, así como para ofrecer publicidad relevante. Si continúas navegando por ese sitio web, aceptas el uso de cookies. Consulta nuestra Política de privacidad y nuestras Condiciones de uso para más información.
The Data Science Venn Diagram
My career steps
MSc in bioinformation technology
from HUT / Aalto
PhD in bioinformatics and machine
learning from Aalto
Data Scientist (consultant) at Reaktor
Data Scientist at Nightingale Health
Data Science research: probabilistic
models for biomedical problems
More on my research and other projects:
Data Science as a hobby: open tools for open data
Open source programming
Open Knowledge -community
Blogs: https://louhos.github.io/, https://ouzor.github.io/
Open data science example: Biking activity in Helsinki
How do various factors affect biking activity
- Automatic bike activity counters from
- Weather data from FMI
Bike activity modelled with Negative
Binomial distribution using R (mgcv::gam)
Done with Janne Sinkkonen and Antti
Data, code & results:
Open Data Science at Reaktor: Apartment price modelling
Kannattaakokauppa.fi by Reaktor: http://kannattaakokauppa.fi
More about the model: https://ouzor.github.io/blog/2016/03/08/apartment-price-model.html
Data Science in the vacuum
Typically starts with a clean data set and a clear (modelling) task.
Example: Weather data in csv, and a goal to predict humidity.
What might be different in the real world?
Data Science Workflow in the Real World
1. Identifying and defining the problem
2. Accessing data
3. Preprocessing and cleaning the data
4. Exploratory data analysis and visualisation
5. Statistical modelling or machine learning
6. End result
Note the difference between academic interests and practical relevance!
Identifying and defining the problem
Learn to be critical and ask good questions!
• Why is this problem important?
• How does solving this improve our user experience?
• How does solving this improve our business?
• Is the problem really something we should solve, or is it something where we happen to have data or methods
• Do we even need to solve this problem!?
Only after the problem is identified, you can start thinking about data science
- Do we have relevant data to support solving the problem?
- Can we use modelling to solve the problem (e.g. prediction or classification)?
Data exists in variety of sources and formats.
A data scientist might need to access data from any of
these in a reasonable time.
Typical data sources: Files, APIs, Data bases, web
Typical data formats:
- CSV, TSV, Excel
- JSON (XML less nowadays)
- Lot’s of strange structure in text files
- Relational data (networks)
- Spatial data
- Gene expression, genomic data
Example: Weather data from WFS API
Be very careful with Excel data formatting!
Preprocessing and cleaning (”wrangling” / ”munging”)
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Having data in a tidy format makes data analysis, visualisation and modelling easier.
Data frames in R and Python.
Read more about tidy data: https://r4ds.had.co.nz/tidy-data.html
Exploratory Data Analysis and Visualisation
The goal of Exploratory Data Analysis is to get
to know your data, using visual summaries and
computing descriptive statistics.
Includes identifying missing data, outliers and
other possible problems with the data.
This informs preprocessing and cleaning, and
typically needs a couple of iterations before the
data is ready for analysis.
You should also contact domain experts and
confirm if the data looks as it should.
It’s hard to define when the data is really
”clean”. You will develop an instict for this
Statistical modelling and machine learning
Modelling is one way to reach a goal in data analysis, not a
goal in itself.
Pick a suitable method based on your goals - not the other
Start with simple methods, add complexity gradually, if
You can get pretty far with linear or logistic regression.
The end result of a data science project can be many things, such as
- A single figure describing the association of two variables
- A comprehensive report for a client or business department
- A machine learning product ready to be deployed into production
In most projects, it is important to write some kind of report of documentation of what has been done.
Learning to communicate effectively is a very important skill for data scientists. This includes producing clear visual
summaries of the main results, and using generally understandable language.
Deplying Data Products
Data science is useful in creating insights, increasing understanding, and informing decision making.
The biggest impact however comes from intelligent systems that operate automatically and continuously, such as
recommendation engines. This typically means that data science products are deployed as part of larger software
Deploying your first data products can be frightening for data scientist with no programming background.
Get support from software developers or data engineers!
Data Science Tools – Some tips
Make everything reproducible and use version control!
Tidyverse is a family of R packages that cover most of the
data science workflow.
Many similar tools exist for Python!
R for Data Science: https://r4ds.had.co.nz/
How to learn the Data Science Workflow?
Data Science is an art – you only learn it by doing!
• Pick challenging courses with large and realistic projects
• Start a hobby project, for example using some open data set, and share the code and results (e.g. GitHub)
• Participate competitions and challenges
• Tidytuesdays: https://github.com/rfordatascience/tidytuesday
• Kaggle: https://www.kaggle.com/
Learning a proper Data Science Workflow will help you in producing reliable results in a reasonable time.
This will benefit your career regardless of whether you work in the academia, industry, or somewhere else.
Agile Data Science
Any sufficiently interesting problem has more than one ”correct” answer.
You can use anything between 2 hours and a PhD on single problem. Try to recognize how much effort each problem
is worth of.
You can often get a satisfactory solution with 20% of the effort compared to a ”perfect” solution.
Learn to fail fast. Sometimes data science solutions do not work, and it’s good to realise this as soon as possible.
Adopting agile software development practices helps!
Agile Data Science with R: https://edwinth.github.io/ADSwR/index.html
Data Science in a Team
No single person can master every possible data science
Data scientists work effectively in teams, with
complementary skill sets and backgrounds.
When looking for you first job as a data scientits, look for
places where there are senior people who can help you
learn and grow as a data scientist.
Data Science as part of a Product or Project
Data Science is typically only a small part of the larger
Product or Project.
It is important to know what the overall goal is, and to
adjust data science development towards that.
You need to collaborate with other people, such as
designers, software developers, marketing and sales
people, customers, etc.
Some takeaway notes
Data Science is an art – you only learn it by doing.
Find ways to continuously learn and practice your skills, with e.g.
hobby projects or competitions.
Finding a problem worth solving is hard.
There is never a single correction solution.
Curiosity and critical thinking are invaluable!
Juuso Parkkinen, PhD - @ouzor
Head of Data Science, Nightingale Health - @NgaleHealth