Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
IS DATA	PREPARATION	THE	
NEXT
BIG	DATA	DISRUPTION?
The	22nd	International	Conference	on	Distributed	Multimedia	Systems
DMS...
• SCENARIO
• BIG	DATA	IN	THE	DATA	DRIVEN	ENTERPRISE
• WHAT	DATA	PREPARATION	SHOULD	COVER
• CREATING	READY	DATA	USING	FRACT...
1. DOES	THE	BUSINESS	ANALYST	UNDERSTAND	THE	DATA	SCIENTIST?
2. WHY	DATA	DRIVEN	COMPANIES	ARE	HIRING	DATA	JOURNALISTS?
3. W...
90%	IS	DARK
12%	AVAILABLE	FOR	BUSINESS	INSIGHTS	
88%	IS	JUST	STORED
80%	RECORDINGs,	PDFs AND	TEXTs
source	IDC	2016
+4300%	...
Data	preparation	is	an	iterative	process	for	exploring	and	transforming	raw	data	into	forms	
suitable	for	data	science,	da...
Depending	on	how	you	count	them,	there	are	
anywhere	from	20	to	50	providers	of	self-service	
data	preparation	tools.	Howe...
Where	we	are
BIG	DATA	IN	THE	DATA	DRIVEN	ENTERPRISE
WE ALL ARE	AWARE
I.T.	DIVISION
IS GOING TO	BUILD
PLANETS OF	DATA
WHICH	ARE	WORLDS MADE	OF
DATA	BASEs,	DATA	LAKEs,
DATA	WAREHOUSEs,	
STRUCTUREs,	AND	SCHEMAs
IT SEEMS THAT
THESE WORLDS ARE	CALLED
“BIGDATA”
BUT,	WE’RE AFRAID TO	CREATE	THEM,
LORDS	ARE	TAKING LONGER THAN 7	DAYS
AND,	UNFORTUNATELY,	WORSE…
IT SEEMS THAT
HUMANS	HAVE...
Bottom	line:
Is	the	data	preparation	the	bridge	between	
planets	of	data	and	the	user?
BigData is	not	Just	technology,	res...
Backgrounds
WHAT	DATA	PREPARATION	SHOULD	COVER
raw	data	r	cold,
analytics	hot
reality
1993	understanding	comics
How	to	Connect	
analytics	and	
details?
A	database	is	
required	to	
contextualize	
languages	and	
realities
Bottom Line:
Usage of data should be faster, cost less with minimum data
movement requirements
• materialize	reality	and	l...
blending
Context,	languages	and	facts
CREATING	READY	DATA	USING	FRACTAL	ADC
rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …
DateBirt...
Using the fractal engine,
performances are extreme
Use	case
MATERIAL	TESTING
• Complex	Json,	Oracle,	csv,	wmv data
• Manual	data	processing	executed	using	
Mathlab
• Hours	of	Scienti...
Terabyte	level	staging
Rigid	batch	processing
No	history
Digital	reality Language
Fractal
Data	base
Bottom	Line:	
Everyday	we	hear	from	entrepreneurs	doing	their	best	to	turn	their	big	ideas	in	a	consistent	and	
successful...
©2016	datonix	Spa
Thank you
Próxima SlideShare
Cargando en…5
×

Implementing Data Preparation in Distributed Multimedia System

422 visualizaciones

Publicado el

The keynote presented @ DMS2016 has explored following critical issues:

how successfully extract, prepare and manage consistent data and multimedia content from distributed multimedia systems,
how they will implement next generation data discovery techniques on data and multimedia content,
how modern data science is evolving to deliver more agile and high value analytics to all users on multimedia information.

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

Implementing Data Preparation in Distributed Multimedia System

  1. 1. IS DATA PREPARATION THE NEXT BIG DATA DISRUPTION? The 22nd International Conference on Distributed Multimedia Systems DMS 2016 Grand Hotel Salerno, Salerno, Italy November 25 - 26, 2016
  2. 2. • SCENARIO • BIG DATA IN THE DATA DRIVEN ENTERPRISE • WHAT DATA PREPARATION SHOULD COVER • CREATING READY DATA USING FRACTALS • CASE STUDY Agenda Source Forrester 2016
  3. 3. 1. DOES THE BUSINESS ANALYST UNDERSTAND THE DATA SCIENTIST? 2. WHY DATA DRIVEN COMPANIES ARE HIRING DATA JOURNALISTS? 3. WHY DARK DATA EXTERNAL TO DATA LAKES CONTINUE TO GROW? 4. WHY IT IS REQUIRED SO LONG TIME FOR MAKING DATA? 5. DATA PLAY AND NARRATIVES? HOW LONG TIME AVAILABLE TO EXPLOIT DATA PROCESSING OUTPUT? 77% Data Processing 23% Data Analysis Source Bloor2016
  4. 4. 90% IS DARK 12% AVAILABLE FOR BUSINESS INSIGHTS 88% IS JUST STORED 80% RECORDINGs, PDFs AND TEXTs source IDC 2016 +4300% ANNUAL DATA GENERATION
  5. 5. Data preparation is an iterative process for exploring and transforming raw data into forms suitable for data science, data discovery, and analytics. Self-service data preparation tools (SSDP) are user-oriented tools that enable data preparation capabilities such as data cataloging - inventorying, data discovery, data exploration, data transformation, data structuring, surfacing of sensitive attributes and anomaly detection. These tools are aimed at reducing the time and complexity of preparing data and improving analyst productivity. Pre process Prepare Discover Exploit Raw Technically correct Ready Data Patterns Formatted Multimedia domain Missing Multimedia
  6. 6. Depending on how you count them, there are anywhere from 20 to 50 providers of self-service data preparation tools. However, they’re not all equal, and users should carefully examine the offering to measure they’re getting what they expect. Many BI and Advanced Analytics vendors (Tableau, Qlik, Sas etc.) have jumped onto SSDP, even if their capabilities aren’t separate from their core offerings and shows limitations in term of Performances, Neutrality, Custom processing. The key reason why self-service data prep will survive as its own category entity is the growing realization that data preparation needs to be kept separate from analysis and Discovery. The volumes and the number of data sources will not be decreasing, and neither will the number of BI tools. To that end, it’s likely that self-service data prep will remain a product category unto itself for the foreseeable future. Source Bloor2016
  7. 7. Where we are BIG DATA IN THE DATA DRIVEN ENTERPRISE
  8. 8. WE ALL ARE AWARE I.T. DIVISION IS GOING TO BUILD PLANETS OF DATA
  9. 9. WHICH ARE WORLDS MADE OF DATA BASEs, DATA LAKEs, DATA WAREHOUSEs, STRUCTUREs, AND SCHEMAs
  10. 10. IT SEEMS THAT THESE WORLDS ARE CALLED “BIGDATA”
  11. 11. BUT, WE’RE AFRAID TO CREATE THEM, LORDS ARE TAKING LONGER THAN 7 DAYS AND, UNFORTUNATELY, WORSE… IT SEEMS THAT HUMANS HAVEN’T ACCESS TO THOSE WORLDS
  12. 12. Bottom line: Is the data preparation the bridge between planets of data and the user? BigData is not Just technology, responsibility should be allocated on the basis of the following critical factors: 1. Raw data will be transfered to the preparation unit (push), or 2. the preparation unit has to read data from the data lake (pull)? 3. the data lake has been designed to stage or to store raw data? 4. what about the variability of the context and data? PULL IT Data lake purpose PUSH STORESTAGE Data Communication mode END USER IT END USER END USER Low variability High variability
  13. 13. Backgrounds WHAT DATA PREPARATION SHOULD COVER
  14. 14. raw data r cold, analytics hot
  15. 15. reality 1993 understanding comics How to Connect analytics and details?
  16. 16. A database is required to contextualize languages and realities
  17. 17. Bottom Line: Usage of data should be faster, cost less with minimum data movement requirements • materialize reality and language in a consistent database • couple language and reality using keyback features • Bind external algorithm using Open (Standard?) User Exits • foster holistic views of data through Grid Data Unification
  18. 18. blending Context, languages and facts CREATING READY DATA USING FRACTAL ADC
  19. 19. rowId Nname Ncity 1 1 1 2 2 2 3 3 3 4 2 2 Key Value NValue Name Aldo 1 Name Sara 2 Name Anna 3 City Miami 1 … … … DateBirth UDateB Age 11/1/90 1/11/90 26 12/2/89 2/12/89 26 1.1.68 1/1/68 48 31-1-61 1/31/61 56 Ncity city state 1 Miami Fl 2 NYC NY 3 Rome Italy Map DictionaryLuggage hierarchy Data complex Storage group name city DateBirth Aldo Miami 11/1/90 Sara NYC 12/2/89 Anna Rome 1.1.68 Sara NYC 31-1-61 Data source Fractal conversion Transform DateBirth Add Geo classification ADC is a fractal like algorithm that converts an input raw data and related data processing into a set of chained binary blocks, formulas and long pointers. We show that ADC represents an important set of computations… The advantages of ADC are that: it is described by a small number of parameters and has a priori known sizes of the views , the views can be generated independently, the overhead of combining the generated views is predictable, the data set can be partitioned into a number of independently generated subsets, the elements of the data set are pseudo random These properties make ADC a strong candidate for a data intensive grid benchmark < M. Frumkin NASA NAS Division >
  20. 20. Using the fractal engine, performances are extreme
  21. 21. Use case
  22. 22. MATERIAL TESTING • Complex Json, Oracle, csv, wmv data • Manual data processing executed using Mathlab • Hours of Scientist work to detect outlier • Impossibility to replicate tests with same results • Scarce know how capitalization • Blend of data happens at Narrative writing time
  23. 23. Terabyte level staging Rigid batch processing No history Digital reality Language Fractal Data base
  24. 24. Bottom Line: Everyday we hear from entrepreneurs doing their best to turn their big ideas in a consistent and successful online business. Here IT is the enabler but, unfortunately, sometimes the T part has a negative influence on the development of the core idea. The ideal tool kit is made for who wish to exploit the I part of the IT, so that entrepreneurs having great ideas, can craft their business themselves. And they should!
  25. 25. ©2016 datonix Spa Thank you

×