Most of the time, when you hear about Artificial Intelligence (AI), people talk about new algorithms or even the computation power needed to train them. But Data is one of the most important factors in AI.
2. Today's
OutlineTopic Highlights
• Data is everywhere
• Data access complexity
• Data science projects
• Typical data science workflow
• How to facilitate Data Acquisition / Data Exploitation
• Labelling tool LegIAnnotate
• What I need to start a project
• Concrete examples
3. Data is everywhere
“Without a systematic way to start and
keep data clean, bad data will happen.”
— Donato Diorio
Since the rise of data storage, we have been collecting large quantities of
information about users, processes, monitoring, logs, e.g.
This information can be stored on flat files, databases, images, e.g.
4. Data access complexity
Companies own a large amount of data and a part of it is not exploited for
different reasons:
• Unawareness of data usefulness
• Added value of data to company
• Difficulties to access data
• Data quality problem
• Storage systems date back to the 20th century
• No documentation (e.g. no data semantic)
• Complex reverse engineering
• Compatibility problems (e.g. no existing connectors between old and new systems)
• People who worked on it not available anymore
5. Data science projects
DATA SCIENCE PROJECTS
LEVERAGE YOUR BUSINESS
DATA PROJECTS
• Data collection / Labelling
• Database Creation / Management
• Data warehouse Creation /
Management
• Data architecture / storage
STATISTICAL PROJECTS
• Machine Learning
• Optimization
• Artificial Intelligence
• R&D
DATA VISUALIZATION PROJECTS
• Excel Sheets tabular / graphics
• Visualization and Analytics tools
(Power BI, Tableau, Plotly, D3.js etc.)
Not only analytical projects
Can be small and simple projects
Can be bigger projects
6. Typical data science workflow
1. Detection of sources to use and set up
the pipeline to collect data from them
2. Preparation of the data. Transformation
of raw data to desired data format.
3. Exploitation of the Data Mart in a lab
environment using analytics and / or
machine learning algorithms to draw
insights
4. Move to a production environment
7. How to facilitate data acquisition / preparation
DATA ACQUISITION
• Ask yourself what is the data needed
• Detect where and on which infrastructure
this data is stored
• Use of external data (Data enrichment)
• Free and paid data (e.g.: weather,
images)
• Custom crawling scripts
• Crowdsourcing (e.g. labelling tools)
DATA PREPARATION
• Data Cleaning (data cleansing)
• Data profiling
• Understand the data
• Determine the quality of output
• Data granularity
• Actual infrastructure VS new one (e.g. move
excel files to flat files, Relational or No-Relational
DBs)
• Custom scripts
Acquiring or preparing data is rarely an easy task.
8. A Labelling tool: LegIAnnotate
An image annotation tool to create datasets that will serve to train computer
vision models.
Benefits:
• Collaborative labelling tool
• Easy to use
• Customizable to suit your needs
• Data storage standardization
• Full application control
Link: https://legiannotate.nrb-ai.nrbdigital.be/
9. Data science project starter pack
• What is my final goal with this project? (Important)
• What will my outcome be?
• What kind of data do I have? Which format? Which quality?
• A good overview of the business
• Different expert profiles (Data Scientists, full stack devs, DB architects, etc.)
• Communication
10. Data science project starter pack
Env. Lab (conda, VirtualEnv, e.g.)
Jupyter Notebook
Analytics and ML Libs
Languages (R, Python, Javascript, e.g.)
External Sources
Databases
Document files
Production Env.
Visualization
Dashboarding
11. Will a data science project be successful?
Do I have data?
Is my data qualitative?
Is this data sufficient?
Let’s start Let’s collect
more data
Depends on the
use case
Do I need data?
Let’s collect data Let’s start
12. Example: Fraud detection in insurance
• What is my final goal with this project?
Detect fraudulent affiliate
• What will my outcome be?
A confidence measure (e.g. probability of being
a fraudulent affiliate)
• What kind of data do I have? Which format?
Which quality?
Data is stored on old database systems and we
are not sure about the data quality. Data is not
labelled (i.e. we do not have a target associated
with the record)
Data extraction and quality problems
Migration vs existing data storage
ML Algorithm
CLAIM FRAUD!
DATA OUTPUTMODEL
DECISION
TREES
13. Example: Optimization of data center energy consumption
• What is my final goal with this project?
Reduce the energy consumption of a data
center
• What will my outcome be?
Give recommendations about parameters
to tweak to reduce consumption
• What kind of data do I have? Which
format? Which quality?
Data about the data center energy
consumption as well as information about
the element that could influence the
energy consumption (e.g. weather,)
• Good comprehension of the business
Data center automation engineer experts
that can share their expertise
DATA OUTPUTMODEL
Data center
information at a
regular basis
Recommendations
Optimization
model
• Difficulties in keeping the
collectors up with the data
platform
• The data collection process is done
without a concertation with data
scientists.
14. Meet the Team
LEJOLY LOÏC
Data Scientist at NRB
DOLORIS SAMY
Data Scientist at NRB
LEILA REBBOUH
Head of Data Science at
NRB
@LoicLejoly
in/loic-lejoly/
loic.lejoly@nrb.be
@SamyDoloris
in/samy-doloris-490421158/
samy.doloris@nrb.be
@leilarebbouh
in/leilarebbouh/
leila.rebbouh@nrb.be
Notas del editor
Webinar focused on data science topic
trending word
What is behind data science less known
As the name suggests it is related to data as well as science
Will be focused on DATA
Also A point to mention:
Data Science projects not only data scientists
IT and NON IT profiles
Different data scientists profiles (Business centric, Data centric, Statistical/ML centric)
- Modules in apps and services to easily collect data (Google Analytics, Cloud Services, Cookies, Etc.)
A lot of data not used properly or not enough
Example: English website
User data
60% of the users are french
Various types of data science projects
-Flat file = CSV, TSV e.g.
Data granularity:
- periodicity
- Level of data details ( sensor temp vs sensor temp, humidity, wind etc.)
tool based on Make Sense github repo
Open source
- Sponsor corwdsourcing
Important To avoid to take a wrong dev path and loose crucial time not reachable project
Continuous iteration process
- Depends on the use case: an example with a model to detect bad quality data based on a certain treshold