SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
IS DATA	PREPARATION	THE	
NEXT
BIG	DATA	DISRUPTION?
The	22nd	International	Conference	on	Distributed	Multimedia	Systems
DMS	2016
Grand	Hotel	Salerno,	Salerno,	Italy
November	25	- 26,	2016
• SCENARIO
• BIG	DATA	IN	THE	DATA	DRIVEN	ENTERPRISE
• WHAT	DATA	PREPARATION	SHOULD	COVER
• CREATING	READY	DATA	USING	FRACTALS
• CASE	STUDY
Agenda
Source	Forrester	2016
1. DOES	THE	BUSINESS	ANALYST	UNDERSTAND	THE	DATA	SCIENTIST?
2. WHY	DATA	DRIVEN	COMPANIES	ARE	HIRING	DATA	JOURNALISTS?
3. WHY	DARK	DATA	EXTERNAL	TO	DATA	LAKES	CONTINUE	TO	GROW?
4. WHY	IT	IS	REQUIRED	SO	LONG	TIME	FOR	MAKING	DATA?
5. DATA	PLAY	AND	NARRATIVES?
HOW LONG TIME AVAILABLE TO EXPLOIT DATA PROCESSING OUTPUT?
77%
Data	Processing
23%
Data	Analysis
Source	Bloor2016
90%	IS	DARK
12%	AVAILABLE	FOR	BUSINESS	INSIGHTS	
88%	IS	JUST	STORED
80%	RECORDINGs,	PDFs AND	TEXTs
source	IDC	2016
+4300%	ANNUAL	DATA	GENERATION
Data	preparation	is	an	iterative	process	for	exploring	and	transforming	raw	data	into	forms	
suitable	for	data	science,	data	discovery,	and	analytics.	
Self-service	data	preparation	tools	(SSDP)	are	user-oriented	tools	that	enable	data	preparation	
capabilities	such	as	data	cataloging	- inventorying,	data	discovery,	data	exploration,	data	
transformation,	data	structuring,	surfacing	of	sensitive	attributes	and	anomaly	detection.	
These	tools	are	aimed	at	reducing	the	time	and	complexity	of	preparing	data	and	improving	
analyst	productivity.
Pre	process
Prepare
Discover
Exploit
Raw Technically	correct
Ready	Data
Patterns
Formatted
Multimedia	
domain
Missing
Multimedia
Depending	on	how	you	count	them,	there	are	
anywhere	from	20	to	50	providers	of	self-service	
data	preparation	tools.	However,	they’re	not	all	
equal,	and	users	should	carefully	examine	the	
offering	to	measure	they’re	getting	what	they	
expect.	
Many	BI	and	Advanced	Analytics	vendors		(Tableau,	Qlik,	Sas etc.)	
have	jumped	onto	SSDP,	even	if		their	capabilities	aren’t	separate	
from	their	core	offerings	and	shows	limitations	in	term	of	
Performances,	Neutrality,	Custom	processing.
The	key	reason	why	self-service	data	prep	will	survive	as	its	own	
category	entity	is	the	growing	realization	that	data	preparation	
needs	to	be	kept	separate	from	analysis	and	Discovery.	
The	volumes	and	the	number	of	data	sources	will	not	be	
decreasing,	and	neither	will	the	number	of	BI	tools.	
To	that	end,	it’s	likely	that	self-service	data	prep	will	remain	a	
product	category	unto	itself	for	the	foreseeable	future.
Source	Bloor2016
Where	we	are
BIG	DATA	IN	THE	DATA	DRIVEN	ENTERPRISE
WE ALL ARE	AWARE
I.T.	DIVISION
IS GOING TO	BUILD
PLANETS OF	DATA
WHICH	ARE	WORLDS MADE	OF
DATA	BASEs,	DATA	LAKEs,
DATA	WAREHOUSEs,	
STRUCTUREs,	AND	SCHEMAs
IT SEEMS THAT
THESE WORLDS ARE	CALLED
“BIGDATA”
BUT,	WE’RE AFRAID TO	CREATE	THEM,
LORDS	ARE	TAKING LONGER THAN 7	DAYS
AND,	UNFORTUNATELY,	WORSE…
IT SEEMS THAT
HUMANS	HAVEN’T	
ACCESS	TO	THOSE
WORLDS
Bottom	line:
Is	the	data	preparation	the	bridge	between	
planets	of	data	and	the	user?
BigData is	not	Just	technology,	responsibility	
should	be	allocated	on	the	basis	of	the	
following	critical	factors:
1. Raw	data	will be	transfered to	the	preparation	unit
(push),	or
2. the	preparation	unit has to	read data	from	the	data	
lake (pull)?
3. the	data	lake has been designed to	stage	or	to	store
raw	data?
4. what about the	variability of	the	context and	data?
PULL
IT
Data	lake	purpose
PUSH
STORESTAGE
Data	Communication mode
END	USER
IT
END	USER END	USER
Low	
variability
High	
variability
Backgrounds
WHAT	DATA	PREPARATION	SHOULD	COVER
raw	data	r	cold,
analytics	hot
reality
1993	understanding	comics
How	to	Connect	
analytics	and	
details?
A	database	is	
required	to	
contextualize	
languages	and	
realities
Bottom Line:
Usage of data should be faster, cost less with minimum data
movement requirements
• materialize	reality	and	language	in	a	
consistent	database
• couple	language	and	reality	using	
keyback features
• Bind	external	algorithm	using	Open	
(Standard?)	User	Exits
• foster	holistic	views	of	data	through	
Grid	Data	Unification
blending
Context,	languages	and	facts
CREATING	READY	DATA	USING	FRACTAL	ADC
rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …
DateBirth UDateB Age
11/1/90 1/11/90 26
12/2/89 2/12/89 26
1.1.68 1/1/68 48
31-1-61 1/31/61 56
Ncity city state
1 Miami Fl
2 NYC NY
3 Rome Italy
Map DictionaryLuggage
hierarchy
Data	complex Storage	group
name city DateBirth
Aldo Miami 11/1/90
Sara NYC 12/2/89
Anna Rome 1.1.68
Sara NYC 31-1-61
Data	source
Fractal	conversion
Transform
DateBirth
Add Geo	
classification
ADC	is	a	fractal	like	algorithm	that	converts	an	input	raw	data	and	related	data	processing	into	a	set	of	
chained	binary	blocks,	formulas	and	long	pointers.	
We	show	that	ADC	represents	an	important	set	of	computations…	The	advantages	of	ADC	are	that:
it	is	described	by	a	small	number	of	parameters	and	has	a	priori	known	sizes	of	the	views	,			the	views	can	be	generated	
independently,			the	overhead	of	combining	the	generated	views	is	predictable,		the	data	set	can	be	partitioned	into	a	
number	of	independently	generated	subsets,		the	elements	of	the	data	set	are	pseudo	random
These	properties	make	ADC	a	strong	candidate	for	a	data	intensive	grid	benchmark	<	M.	Frumkin NASA	NAS	Division	>
Using the fractal engine,
performances are extreme
Use	case
MATERIAL	TESTING
• Complex	Json,	Oracle,	csv,	wmv data
• Manual	data	processing	executed	using	
Mathlab
• Hours	of	Scientist	work	to	detect	outlier
• Impossibility	to	replicate	tests	with	same	
results
• Scarce	know	how	capitalization
• Blend	of	data	happens	at	Narrative	
writing	time
Terabyte	level	staging
Rigid	batch	processing
No	history
Digital	reality Language
Fractal
Data	base
Bottom	Line:	
Everyday	we	hear	from	entrepreneurs	doing	their	best	to	turn	their	big	ideas	in	a	consistent	and	
successful	online	business.	Here	IT	is	the	enabler	but,	unfortunately,	sometimes	the	T	part	has	a	negative	
influence	on	the	development	of	the	core	idea.
The	ideal	tool	kit	is	made	for	who	wish	to	exploit	the	I	part	of	the	IT,	so	that	entrepreneurs	having	great	
ideas,	can	craft	their	business	themselves.	And	they	should!
©2016	datonix	Spa
Thank you

Más contenido relacionado

Similar a Implementing Data Preparation in Distributed Multimedia System

Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
BSP Media Group
 

Similar a Implementing Data Preparation in Distributed Multimedia System (20)

SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Governing and Preparing Data for Analytics and Business
Governing and Preparing Data for Analytics and BusinessGoverning and Preparing Data for Analytics and Business
Governing and Preparing Data for Analytics and Business
 
CPA ONE 2016 - Big data: big decisions or big fallacy
CPA ONE 2016 - Big data: big decisions or big fallacyCPA ONE 2016 - Big data: big decisions or big fallacy
CPA ONE 2016 - Big data: big decisions or big fallacy
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Data & Analytic Innovations: 5 lessons from our customers
Data & Analytic Innovations: 5 lessons from our customersData & Analytic Innovations: 5 lessons from our customers
Data & Analytic Innovations: 5 lessons from our customers
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
 
Delivering Analytics at The Speed of Transactions with Data Fabric
Delivering Analytics at The Speed of Transactions with Data FabricDelivering Analytics at The Speed of Transactions with Data Fabric
Delivering Analytics at The Speed of Transactions with Data Fabric
 
Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
Module 1 the power of data
Module 1 the power of dataModule 1 the power of data
Module 1 the power of data
 
short talk at Kean
short talk at Keanshort talk at Kean
short talk at Kean
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigData
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
 
The Evolution of Data Stack: From Query Accelerators to Data Fabrics
The Evolution of Data Stack: From Query Accelerators to Data FabricsThe Evolution of Data Stack: From Query Accelerators to Data Fabrics
The Evolution of Data Stack: From Query Accelerators to Data Fabrics
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Capgemini’s Data WARP: Accelerate your Journey to Insights
Capgemini’s Data WARP: Accelerate your Journey to InsightsCapgemini’s Data WARP: Accelerate your Journey to Insights
Capgemini’s Data WARP: Accelerate your Journey to Insights
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Implementing Data Preparation in Distributed Multimedia System