2. 2001 Programming Languages
2004 Natural Language Processing
2006 Social Recommendation
2008 Distributed Computing
2011 Social Gaming
2012 Advertising
2013 Dataiku
2009 Web Mining
Type Spent
Coding
2010
100%
100%
80%
50%
20%
0%
10%
50%
20%
Favorite
Language
C
Exascript
Exascript
Exascript
Python
Powerpoint
Python
Java
None
Largest
Dataset
100GB
100GB
10GB
10TB
100TB
100kB
500GB
100TB
10TB
I’m Florian and I like data
4. Goals For Today
• Big Data with the bias of what I know of it
(Analytics …)
• Big Data: History and Feelings
• What are the key technologies to watch ?
• Some practical use cases ?
• How to get started ?
34. 2000 2013
1000$
/
GB
6$
/
GB
$10
/
GB
$0.06
/
GB
memory
divided
by
150
disk
cost
divided
by
250
MAP
REDUCE
times
HACK
REDUCE
times
A
PERSISTENT
MEMORY
PROBLEM
39. HOW
BIG
IS
BIG
DATA
?
Web
Site
– $1Billion
revenue
per
year
– 10
Millions
Unique
Visitor
per
month
– 100.Millions
orders
/
actions
/
per
day
10TB
RAW
DATA
1TB
REFINED
DATA
43. ALL
>
SPARK
Real-‐Time
Resilient
Distributed
Memory
Framework
• Abstraction
with
any
DAG
operation
on
data:
-‐ Filter
-‐ Map
-‐ Reduce
-‐ Cache
44. SPARK
AND
ITS
ECOSYSTEM
SHARK
MLBASE
STREAMING
Real-‐Time
Queries
Real-‐Time
Updates
In-‐Memory
Learning
SPARK
46. www.dataiku.com
Turn Device Logs
Into Next Years' Business
Parking
ticket
machine
data
OpenStreetMap
data
Cleaning
and
enrichment
of
data
Crossing
data
Data Science Studio
Creation
of
a
predictive
algorithm
Availability
of
the
predictions
Each
street
is
segmented
into
small
pieces
that
are
enriched
with
geospatial
information.
The
parking
ticket
history
is
joined
with
the
points
of
interest
from
OpenStreetMap.
The
availability
of
parking
lots
is
predicted
by
street
segments
from
the
joined
data.
The
algorithm
is
finally
integrated
in
the
iPhone
app
«
Find
me
a
space
».
by
47. www.dataiku.com
Optimizing Last Mile with
Data Science Studio
Data Science Studio
Historical delivery
and retrieval data
Modeling of a score
for each delivery
Cleaning and temporal
enrichment of data
Data aggregation by
geographic location
Incorporation of new deliveries
to the existing model
by
48. • Reformulation de la
recherche
• Pas de réponse
• Clic sur un pro
• Top recherche
• Clic de navigation ou filtre
COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES
VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?
20 M
Analyse &
corrections
automatisation
>10
occurrences1,4M
requêtes
>200M
recherches
✗ ✓
0,5M requêtes
priorisées
51. www.dataiku.com
Multiple
Data
Sources
Analyst Team
Many
Models
CRM
Logs
2015 : BUILD YOUR FACTORY
Server Cluster
Light Software
Personalised
Experience Model
Acquisition
Cost Opportunity
Model
Stock Optimisation
Model
Optimize
Delivery
56. STEP 2 : PRACTICE
• Try to enter in a Contest on kaggle.com or
• or datascience.net
• Join a meetup
57. www.dataiku.com
http://www.dataiku.com/dss/trynow/
Dataiku HQ
2 rue Jean Lantier
75001 Paris France
Dataiku West
2423A Durant Avenue
Berkeley, CA 94704
Florian
florian.douetteau@dataiku.com
You have ideas
“My data is too dirty. I don’t even know where to start ”
“We could probably better understand ours users. But how ?
“There’s a trend here, but our full historical data is just too big”
You have data
You need a tool