SlideShare a Scribd company logo
1 of 45
Download to read offline
Genoveva Vargas-Solar
Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France
genoveva.vargas@imag.fr
Moving forward data centric sciences
weaving AI, Big Data & HPC
AICSSA, Jordan, October-November, 2018
http://www.vargas-solar.com
3
D AT A R E V O L U T I O N
4
+
“Data is everything and everything is data”, Pythian
Turning reality phenomena into data thanks to the Big Data trend
5
6
Rendering into data, aspects of the world that have never been quantified
Any individual can analyse huge amounts of data in short periods of time
- Analytical knowledge: most of the crucial algorithms are accessible
- Use rich data to make evidence-based decisions open to virtually any person or company
DATIFICATION
DIGITAL HUMANITIES
… UNFINISHED FUGUE
Fuga a 3 Soggetti (Contrapunctus XIV):
- 4-voice triple fugue
- the third subject of which is based on
the
B A C H motif
« At the point where the composer introduces the name BACH in the
countersubject to this fugue, the composer died. »
8
What makes Bach sound like Bach?
http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/
The Art of Fugue is based on a single subject employed in some variation in each canon and fugue
9
• Identify the notes performed at specific times in a recording
• Classify the instruments that perform in a recording
• Classify the composer of a recording
• Identify precise onset times of the notes in a recording
• Predict the next note in a recording, conditioned on history
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various
constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
10
11
DATA SCIENCE
The representation of complex environments by rich data opens up the possibility of applying all the scientific
knowledge regarding how to infer knowledge from data
Definition:
- Methodology by which actionable insights can be inferred from data
- Complex, multifaceted field that can be approached from several points of view: ethics, methodology,
business models, how to deal with big data, data engineering, data governance, etc.
Objective:
- Production of beliefs informed by data and to be used as the basis of decision making
- N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
12
Computational Science
Digital humanities
Social Data Science Network Science
DATA CENTRIC SCIENCES
Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid”
conclusions, models, simulations, understanding
Develop methodologies weaving data management, greedy algorithms, and programming
models that must be tuned to be deployed in different target computer architectures
Computational Science
Digital humanities
Social Data Science Network Science
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
13
Experimental Sciences
Computational Science
Digital humanities
Social Data Science Network Science
14
1000 Yottabytes 1 Brontobyte
1000 Brontobytes 1 Geopbyte
Computation
(Algorithm: mathematical model)
Experiment setting
(Architecture: computing environment)
D AT A
15
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions
in which data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
16
NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH
RAW DATA
• Heterogeneous (variety)
• Huge (volume)
• Incomplete, unprecise, missing, contradictory (veracity)
• Continuous releases produced at different rates (velocity)
• Proprietary, critical, private (value)
DIGITAL DATA COLLECTIONS
Consumed data:
• different sizes
• quality, uncertainty, ambiguity degree
• evolution in structure, completeness, production conditions, conditions in which
data is retrieved
• content, explicit cultural, contextual, background properties
• access policies modification
Conditions of consumption:
• reproducibility, transparency degree (avoid “software artefacts”)
17
DIGITAL DATA COLLECTIONS
18
EXPLORING DATA COLLECTIONS
19https://web.facebook.com/data/
20
ü Helping to select the right tool for
preprocessing or analysis
ü Making use of humans’ abilities to
recognize patterns
Not always sure what we are looking for (until we find it)
Query expression [guidance ∣ automatic generation]3,2
• Multi-scale query processing for gradual exploration
• Query morphing to adjust for proximity results
• Queries as answers: query alternatives to cope with lack of providence
Results filtering, analysis, visualization2
• Result-set post processing for conveying meaningful data
Data exploration systems & environments1
• Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database
cracking
• Auto-tuning database kernels : incremental, adaptive, partial indexing
1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM.
2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM.
3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
KEY MOTIVATIONS
EXPLORING DATA COLLECTIONS
21
QUANTITATIVE ANALYSIS OF DATA
Concepts:
- Population: collection of objects, items (“units”)
- Sample: a part of the observed population
Descriptive statistics: simplify data presenting quantitative descriptions
- Measures and concepts to describe the quantitative features
- Provide summaries about the samples as an approximation of the population
- Frequency of the notes performed at specific intervals in a recording
- Identify precise onset times of the notes in a recording
22
LOOKING BEYOND DATA
Inferential statistics: infer the population characteristic
- draws conclusions beyond the analysed data
- reaches conclusions regarding made hypotheses
- Classify the instruments that perform in a recording
- Predict the next note in a recording, conditioned on history
- Inferring a musical score from a recording
23
DATA CURATION
Preserving Describing
Extracting meta-data
ExploringHarvesting
ETL
Parallel Data
Processing
Platforms
Spark (RDD – Tables/Graphs)
Hadoop ecosystem tools (e.g., Pig)
Parallel Data
Processing
Platforms
NoSQL & NewSQL
(Parallel)
Parallel
Data Querying &
Analytics
Structured
Data provision
Parallel data
collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)
Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS,
Big Data Analytics Stacks (Asterix, BDAS)
Parallel analytics (Matlab, R)
CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
ARTIFICIAL INTELLIGENCE
BEYOND KNOWLEDGE
https://ai100.stanford.edu/2016-report
24
25
26
LOOKING BEYOND KNOWLEDGE
Music information retrieval
- Automatic music transcription
- Inferring a musical score from a recording
Generative models fabricating performances under various constraints
- Can we learn to synthesize a performance given a score?
- Can we generate a fugue in the style of Bach using a melody by Brahms?
SETTING UP DATA CENTRIC EXPERIMENTS
27
28
https://web.facebook.com/data/https://azure.microsoft.com/
+
§Data collections with characteristics difficult to process on a single machine or
traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse
massive data collections
à Tools imposing the use of parallel processing and distributed storage
DATA COLLECTIONS ALIAS BIG DATA
29
30
DATA SCIENCE ECOSYSTEM &
INTEGRATED DEVELOPMENT ENVIRONMENT
The integrated development environment (IDE) is an essential tool designed to
maximize programmer productivity.
- The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the
debugger.
- Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment)
Programming language:
- Python one of the most flexible programming languages because it can be seen as a multiparadigm language
- Alternatives are MATLAB and R
Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
WEB INTEGRATED
DEVELOPMENT
ENVIRONMENT
31
DATA SCIENCE VIRTUAL MACHINE
COMPUTING CAPACITY
33
D AT A B E Y O N D T H E C O N F O R T Z O N E
34
35
+
Curated
Increased versatility
& complexity
Increased scalability
& speed
Data collections rawness degree
Key-Value
stores
Document
stores
NewSQL
Relational databases
Graph
Databases
Extensible
record stores
QueryingLook up (R/W)
Analytics
AggregationProcessing Navigation
ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE
36
Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
Sharded & colocated
Input data
Distributed File System
Classification
Data
transformation
Tagged opus execution
Multimedia
multiform data
Indexing classes
INDEXING & STORING
• the precise time of each note every recording
• the instrument that plays each note
• the note's position in the metrical structure of the composition
37
SHARDING DATA ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File SystemMultimedia multiform data
38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html
Automatic and elastic data collections sharding tools to parametrize data access &
exploitation by parallel programs willing to scale-up in different target architectures
SHARDING ACROSS DIFFERENT STORES
Sharded & colocated
Input data
Distributed File System
Factors:
- RAM - Disk
- CPU - Network
Sharded data architecture
39
Balanced and smooth fragmentation
(size, location, availability)
Optimum distribution across shards
providing storage spaces (chunks)
+
Persistence
- Which part of the document must persist?
- Explicit vs. implicit persistence
- In memory / hard disk Fragmentation/Sharding & replication:
- Vertical or horizontal fragmentation
- Strategies: range, hash, tagged
- Distribution & location
Availability & Fault tolerance
- Replication & distribution
Memory/Cache
SHARDING DATA ACROSS DIFFERENT STORES
Raw data collections
40
411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
DATA DELIVERY FOR GREEDY PROCESSING
“Multi-view computational problem”
Iterative data processing and visualization tasks need to share CPU cycles
42
Data is a bottleneck
APPLICATION
DRAM
DISK/DATABASE
CPU
Multiples Cores
GPU
Thousands of Cores
1-5GBps1-10GBps
Provide data storage, fetching and delivery
strategies
­ Architecture: distributed file system across nodes
­ Data sharding and replication: on storage and
memory
­ Fetch to fulfil multi-facet application requirements
­ Prefetching
­ Memory indexing
­ Reduce impedance mismatch
43
§ Manage data collections with different uses and access patterns because
these properties tend to reach limits of:
§ the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a
certain period, and
§ the pace (computing speed) in which data must be consumed (harvested, prepared and processed).
§ Build underlying value added data managers that can
§ Exploit available resources making a compromise between QoS properties and SLA requirements considering all the
levels of the stack
§ Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’
availability and the data properties
OPPORTUNITIES
F I N A L C O M E N T S
44
45
Move from design based on intuition & experience to a more formal & systematic way
to design systems
Addressing data centric sciences problems is a matter of designing complex systems according
to a multidisciplinary vision
46
Let’s weave a golden trilogy
Big Data, AI & HPC
47

More Related Content

What's hot

What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...Simplilearn
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 
Detection of plant diseases
Detection of plant diseasesDetection of plant diseases
Detection of plant diseasesMuneesh Wari
 
Crop predction ppt using ANN
Crop predction ppt using ANNCrop predction ppt using ANN
Crop predction ppt using ANNAstha Jain
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural networkIndira Nayak
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Smart digital farming
Smart digital farmingSmart digital farming
Smart digital farmingClusteriX20
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?BalaBit
 
IRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
IRJET - Analysis of Crop Yield Prediction by using Machine Learning AlgorithmsIRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
IRJET - Analysis of Crop Yield Prediction by using Machine Learning AlgorithmsIRJET Journal
 
Data visualization in a Nutshell
Data visualization in a NutshellData visualization in a Nutshell
Data visualization in a NutshellWingChan46
 
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...Ceni Babaoglu, PhD
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Prakhar Rastogi
 

What's hot (20)

What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
Detection of plant diseases
Detection of plant diseasesDetection of plant diseases
Detection of plant diseases
 
Crop predction ppt using ANN
Crop predction ppt using ANNCrop predction ppt using ANN
Crop predction ppt using ANN
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural network
 
Retailing in the Metaverse
Retailing in the MetaverseRetailing in the Metaverse
Retailing in the Metaverse
 
Data science
Data scienceData science
Data science
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Smart digital farming
Smart digital farmingSmart digital farming
Smart digital farming
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
GoogLeNet Insights
GoogLeNet InsightsGoogLeNet Insights
GoogLeNet Insights
 
IRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
IRJET - Analysis of Crop Yield Prediction by using Machine Learning AlgorithmsIRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
IRJET - Analysis of Crop Yield Prediction by using Machine Learning Algorithms
 
Data visualization in a Nutshell
Data visualization in a NutshellData visualization in a Nutshell
Data visualization in a Nutshell
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
 
Data science for everyone
Data science for everyoneData science for everyone
Data science for everyone
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 

Similar to Moving forward data centric sciences weaving AI, Big Data & HPC

Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesBalázs Kégl
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
New Forms of Data for e-Research
New Forms of Data for e-ResearchNew Forms of Data for e-Research
New Forms of Data for e-ResearchDavid De Roure
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things PayamBarnaghi
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
The Ai & I at Work
The Ai & I at WorkThe Ai & I at Work
The Ai & I at WorkTarek Hoteit
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 

Similar to Moving forward data centric sciences weaving AI, Big Data & HPC (20)

Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 
New Forms of Data for e-Research
New Forms of Data for e-ResearchNew Forms of Data for e-Research
New Forms of Data for e-Research
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
Dwdm
DwdmDwdm
Dwdm
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things Intelligent Data Processing for the Internet of Things
Intelligent Data Processing for the Internet of Things
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
The Ai & I at Work
The Ai & I at WorkThe Ai & I at Work
The Ai & I at Work
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 

More from Genoveva Vargas-Solar

More from Genoveva Vargas-Solar (10)

Aiccsa 2021-w-stem
Aiccsa 2021-w-stemAiccsa 2021-w-stem
Aiccsa 2021-w-stem
 
Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial Intelligence
 
Data w-steamm
Data w-steammData w-steamm
Data w-steamm
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
3 map reduce perspectives
3 map reduce perspectives3 map reduce perspectives
3 map reduce perspectives
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
1 mapreduce-fest
1 mapreduce-fest1 mapreduce-fest
1 mapreduce-fest
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 

Recently uploaded

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 

Recently uploaded (16)

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 

Moving forward data centric sciences weaving AI, Big Data & HPC

  • 1. Genoveva Vargas-Solar Senior Scientist, French Council of Scientific Research, LIG-LAFMIA, France genoveva.vargas@imag.fr Moving forward data centric sciences weaving AI, Big Data & HPC AICSSA, Jordan, October-November, 2018 http://www.vargas-solar.com 3
  • 2. D AT A R E V O L U T I O N 4
  • 3. + “Data is everything and everything is data”, Pythian Turning reality phenomena into data thanks to the Big Data trend 5
  • 4. 6 Rendering into data, aspects of the world that have never been quantified Any individual can analyse huge amounts of data in short periods of time - Analytical knowledge: most of the crucial algorithms are accessible - Use rich data to make evidence-based decisions open to virtually any person or company DATIFICATION
  • 6. … UNFINISHED FUGUE Fuga a 3 Soggetti (Contrapunctus XIV): - 4-voice triple fugue - the third subject of which is based on the B A C H motif « At the point where the composer introduces the name BACH in the countersubject to this fugue, the composer died. » 8
  • 7. What makes Bach sound like Bach? http://www.washington.edu/news/2016/11/30/what-makes-bach-sound-like-bach-new-dataset-teaches-algorithms-classical-music/ The Art of Fugue is based on a single subject employed in some variation in each canon and fugue 9 • Identify the notes performed at specific times in a recording • Classify the instruments that perform in a recording • Classify the composer of a recording • Identify precise onset times of the notes in a recording • Predict the next note in a recording, conditioned on history Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 8. 10
  • 9. 11 DATA SCIENCE The representation of complex environments by rich data opens up the possibility of applying all the scientific knowledge regarding how to infer knowledge from data Definition: - Methodology by which actionable insights can be inferred from data - Complex, multifaceted field that can be approached from several points of view: ethics, methodology, business models, how to deal with big data, data engineering, data governance, etc. Objective: - Production of beliefs informed by data and to be used as the basis of decision making - N.B. In the absence of data, beliefs are uninformed and decisions are based on best practices or intuition
  • 10. 12 Computational Science Digital humanities Social Data Science Network Science DATA CENTRIC SCIENCES Data collections as backbone for conducting experiments, drive hypothesis and lead to “valid” conclusions, models, simulations, understanding Develop methodologies weaving data management, greedy algorithms, and programming models that must be tuned to be deployed in different target computer architectures
  • 11. Computational Science Digital humanities Social Data Science Network Science 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte 13 Experimental Sciences
  • 12. Computational Science Digital humanities Social Data Science Network Science 14 1000 Yottabytes 1 Brontobyte 1000 Brontobytes 1 Geopbyte Computation (Algorithm: mathematical model) Experiment setting (Architecture: computing environment)
  • 14. Consumed data: • different sizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 16 NEITHER MANAGEABLE NOR EXPLOITABLE AS SUCH RAW DATA • Heterogeneous (variety) • Huge (volume) • Incomplete, unprecise, missing, contradictory (veracity) • Continuous releases produced at different rates (velocity) • Proprietary, critical, private (value) DIGITAL DATA COLLECTIONS
  • 15. Consumed data: • different sizes • quality, uncertainty, ambiguity degree • evolution in structure, completeness, production conditions, conditions in which data is retrieved • content, explicit cultural, contextual, background properties • access policies modification Conditions of consumption: • reproducibility, transparency degree (avoid “software artefacts”) 17 DIGITAL DATA COLLECTIONS
  • 18. 20 ü Helping to select the right tool for preprocessing or analysis ü Making use of humans’ abilities to recognize patterns Not always sure what we are looking for (until we find it) Query expression [guidance ∣ automatic generation]3,2 • Multi-scale query processing for gradual exploration • Query morphing to adjust for proximity results • Queries as answers: query alternatives to cope with lack of providence Results filtering, analysis, visualization2 • Result-set post processing for conveying meaningful data Data exploration systems & environments1 • Data systems kernels are tailored for data exploration: no preparation easy-to-use fast database cracking • Auto-tuning database kernels : incremental, adaptive, partial indexing 1. Xi, S. L., Babarinsa, O., Wasay, A., Wei, X., Dayan, N., & Idreos, S. (2017, May). Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 557-572). ACM. 2. Athanassoulis, M., & Idreos, S. (2015, May). Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (p. 2). ACM. 3. Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity. KEY MOTIVATIONS EXPLORING DATA COLLECTIONS
  • 19. 21 QUANTITATIVE ANALYSIS OF DATA Concepts: - Population: collection of objects, items (“units”) - Sample: a part of the observed population Descriptive statistics: simplify data presenting quantitative descriptions - Measures and concepts to describe the quantitative features - Provide summaries about the samples as an approximation of the population - Frequency of the notes performed at specific intervals in a recording - Identify precise onset times of the notes in a recording
  • 20. 22 LOOKING BEYOND DATA Inferential statistics: infer the population characteristic - draws conclusions beyond the analysed data - reaches conclusions regarding made hypotheses - Classify the instruments that perform in a recording - Predict the next note in a recording, conditioned on history - Inferring a musical score from a recording
  • 21. 23 DATA CURATION Preserving Describing Extracting meta-data ExploringHarvesting ETL Parallel Data Processing Platforms Spark (RDD – Tables/Graphs) Hadoop ecosystem tools (e.g., Pig) Parallel Data Processing Platforms NoSQL & NewSQL (Parallel) Parallel Data Querying & Analytics Structured Data provision Parallel data collection (Flink, Stream, Flume) Spark (descriptive statistics functions) Hadoop ecosystem tools (e.g., Hive) Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS) Parallel analytics (Matlab, R) CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; Gavin Kemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
  • 23. 25
  • 24. 26 LOOKING BEYOND KNOWLEDGE Music information retrieval - Automatic music transcription - Inferring a musical score from a recording Generative models fabricating performances under various constraints - Can we learn to synthesize a performance given a score? - Can we generate a fugue in the style of Bach using a melody by Brahms?
  • 25. SETTING UP DATA CENTRIC EXPERIMENTS 27
  • 27. + §Data collections with characteristics difficult to process on a single machine or traditional databases §A new generation of tools, methods and technologies to collect, process and analyse massive data collections à Tools imposing the use of parallel processing and distributed storage DATA COLLECTIONS ALIAS BIG DATA 29
  • 28. 30 DATA SCIENCE ECOSYSTEM & INTEGRATED DEVELOPMENT ENVIRONMENT The integrated development environment (IDE) is an essential tool designed to maximize programmer productivity. - The basic pieces of any IDE are three: the editor, the compiler, (or interpreter) and the debugger. - Examples: PyCharm,9 WingIDE10, SPYDER (Scientific Python Development EnviRonment) Programming language: - Python one of the most flexible programming languages because it can be seen as a multiparadigm language - Alternatives are MATLAB and R Fundamental libraries for data scientists in Python: NumPy, SciPy, Scikit-Learn, and Pandas
  • 32. D AT A B E Y O N D T H E C O N F O R T Z O N E 34
  • 33. 35
  • 34. + Curated Increased versatility & complexity Increased scalability & speed Data collections rawness degree Key-Value stores Document stores NewSQL Relational databases Graph Databases Extensible record stores QueryingLook up (R/W) Analytics AggregationProcessing Navigation ELASTIC DATA PROCESSING & MANAGEMENT AT SCALE 36 Descriptive Statistics Inferential Statistics Supervised Learning UnSupervised Learning
  • 35. Sharded & colocated Input data Distributed File System Classification Data transformation Tagged opus execution Multimedia multiform data Indexing classes INDEXING & STORING • the precise time of each note every recording • the instrument that plays each note • the note's position in the metrical structure of the composition 37
  • 36. SHARDING DATA ACROSS DIFFERENT STORES Sharded & colocated Input data Distributed File SystemMultimedia multiform data 38MusicNet: 330 classical music recordings, 1 million annotated labels indicatinghttp://homes.cs.washington.edu/~thickstn/musicnet.html Automatic and elastic data collections sharding tools to parametrize data access & exploitation by parallel programs willing to scale-up in different target architectures
  • 37. SHARDING ACROSS DIFFERENT STORES Sharded & colocated Input data Distributed File System Factors: - RAM - Disk - CPU - Network Sharded data architecture 39 Balanced and smooth fragmentation (size, location, availability) Optimum distribution across shards providing storage spaces (chunks)
  • 38. + Persistence - Which part of the document must persist? - Explicit vs. implicit persistence - In memory / hard disk Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation - Strategies: range, hash, tagged - Distribution & location Availability & Fault tolerance - Replication & distribution Memory/Cache SHARDING DATA ACROSS DIFFERENT STORES Raw data collections 40
  • 39. 411.Idreos, S., Dayan, M. A. N., Guo, D., Kester, M. S., Maas, L., & Zoumpatianos, K. Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity.
  • 40. DATA DELIVERY FOR GREEDY PROCESSING “Multi-view computational problem” Iterative data processing and visualization tasks need to share CPU cycles 42 Data is a bottleneck APPLICATION DRAM DISK/DATABASE CPU Multiples Cores GPU Thousands of Cores 1-5GBps1-10GBps Provide data storage, fetching and delivery strategies ­ Architecture: distributed file system across nodes ­ Data sharding and replication: on storage and memory ­ Fetch to fulfil multi-facet application requirements ­ Prefetching ­ Memory indexing ­ Reduce impedance mismatch
  • 41. 43 § Manage data collections with different uses and access patterns because these properties tend to reach limits of: § the storage capacity (main memory, cache and disks) required for archiving data collections permanently or during a certain period, and § the pace (computing speed) in which data must be consumed (harvested, prepared and processed). § Build underlying value added data managers that can § Exploit available resources making a compromise between QoS properties and SLA requirements considering all the levels of the stack § Deliver request results in a reasonable economic price, reliable, and efficient manner despite the devices, resources’ availability and the data properties OPPORTUNITIES
  • 42. F I N A L C O M E N T S 44
  • 43. 45 Move from design based on intuition & experience to a more formal & systematic way to design systems Addressing data centric sciences problems is a matter of designing complex systems according to a multidisciplinary vision
  • 44. 46 Let’s weave a golden trilogy Big Data, AI & HPC
  • 45. 47