Behind the scenes of data science

LegIA Squad
BEHIND THE
SCENES OF
DATA SCIENCE
by Loïc Lejoly

Today's
OutlineTopic Highlights
• Data is everywhere
• Data access complexity
• Data science projects
• Typical data science workflow
• How to facilitate Data Acquisition / Data Exploitation
• Labelling tool LegIAnnotate
• What I need to start a project
• Concrete examples

Data is everywhere
“Without a systematic way to start and
keep data clean, bad data will happen.”
— Donato Diorio
Since the rise of data storage, we have been collecting large quantities of
information about users, processes, monitoring, logs, e.g.
This information can be stored on flat files, databases, images, e.g.

Data access complexity
Companies own a large amount of data and a part of it is not exploited for
different reasons:
• Unawareness of data usefulness
• Added value of data to company
• Difficulties to access data
• Data quality problem
• Storage systems date back to the 20th century
• No documentation (e.g. no data semantic)
• Complex reverse engineering
• Compatibility problems (e.g. no existing connectors between old and new systems)
• People who worked on it not available anymore

Data science projects
DATA SCIENCE PROJECTS
LEVERAGE YOUR BUSINESS
DATA PROJECTS
• Data collection / Labelling
• Database Creation / Management
• Data warehouse Creation /
Management
• Data architecture / storage
STATISTICAL PROJECTS
• Machine Learning
• Optimization
• Artificial Intelligence
• R&D
DATA VISUALIZATION PROJECTS
• Excel Sheets tabular / graphics
• Visualization and Analytics tools
(Power BI, Tableau, Plotly, D3.js etc.)
Not only analytical projects
Can be small and simple projects
Can be bigger projects

Typical data science workflow
1. Detection of sources to use and set up
the pipeline to collect data from them
2. Preparation of the data. Transformation
of raw data to desired data format.
3. Exploitation of the Data Mart in a lab
environment using analytics and / or
machine learning algorithms to draw
insights
4. Move to a production environment

How to facilitate data acquisition / preparation
DATA ACQUISITION
• Ask yourself what is the data needed
• Detect where and on which infrastructure
this data is stored
• Use of external data (Data enrichment)
• Free and paid data (e.g.: weather,
images)
• Custom crawling scripts
• Crowdsourcing (e.g. labelling tools)
DATA PREPARATION
• Data Cleaning (data cleansing)
• Data profiling
• Understand the data
• Determine the quality of output
• Data granularity
• Actual infrastructure VS new one (e.g. move
excel files to flat files, Relational or No-Relational
DBs)
• Custom scripts
Acquiring or preparing data is rarely an easy task.

A Labelling tool: LegIAnnotate
An image annotation tool to create datasets that will serve to train computer
vision models.
Benefits:
• Collaborative labelling tool
• Easy to use
• Customizable to suit your needs
• Data storage standardization
• Full application control
Link: https://legiannotate.nrb-ai.nrbdigital.be/

Data science project starter pack
• What is my final goal with this project? (Important)
• What will my outcome be?
• What kind of data do I have? Which format? Which quality?
• A good overview of the business
• Different expert profiles (Data Scientists, full stack devs, DB architects, etc.)
• Communication

Data science project starter pack
Env. Lab (conda, VirtualEnv, e.g.)
Jupyter Notebook
Analytics and ML Libs
Languages (R, Python, Javascript, e.g.)
External Sources
Databases
Document files
Production Env.
Visualization
Dashboarding

Will a data science project be successful?
Do I have data?
Is my data qualitative?
Is this data sufficient?
Let’s start Let’s collect
more data
Depends on the
use case
Do I need data?
Let’s collect data Let’s start

Example: Fraud detection in insurance
• What is my final goal with this project?
Detect fraudulent affiliate
A confidence measure (e.g. probability of being
a fraudulent affiliate)
• What kind of data do I have? Which format?
Which quality?
Data is stored on old database systems and we
are not sure about the data quality. Data is not
labelled (i.e. we do not have a target associated
with the record)
Data extraction and quality problems 
Migration vs existing data storage
ML Algorithm
CLAIM FRAUD!
DATA OUTPUTMODEL
DECISION
TREES

Example: Optimization of data center energy consumption
• What is my final goal with this project?
Reduce the energy consumption of a data
center
Give recommendations about parameters
to tweak to reduce consumption
• What kind of data do I have? Which
format? Which quality?
Data about the data center energy
consumption as well as information about
the element that could influence the
energy consumption (e.g. weather,)
• Good comprehension of the business
Data center automation engineer experts
that can share their expertise
DATA OUTPUTMODEL
Data center
information at a
regular basis
Recommendations
Optimization
model
• Difficulties in keeping the
collectors up with the data
platform
• The data collection process is done
without a concertation with data
scientists.

Meet the Team
LEJOLY LOÏC
Data Scientist at NRB
DOLORIS SAMY
Data Scientist at NRB
LEILA REBBOUH
Head of Data Science at
NRB
@LoicLejoly
in/loic-lejoly/
loic.lejoly@nrb.be
@SamyDoloris
in/samy-doloris-490421158/
samy.doloris@nrb.be
@leilarebbouh
in/leilarebbouh/
leila.rebbouh@nrb.be

Behind the scenes of data science

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Behind the scenes of data science

Similar a Behind the scenes of data science (20)

Último

Último (20)

Behind the scenes of data science

Notas del editor