Big Data: Beyond the hype, Delivering value explains Big Data technology and how it is transforming industry and society to members of the IDEAL-IST project.
IDEAL-IST is an international ICT (Information and Communication Technologies) network, with more than 65 ICT national partners from EU and Non-EU Countries. It assists ICT companies and research organizations worldwide wishing to find project partners for a participation in the Horizon 2020 program of the European Commission.
4. Overview
n Part I: What is “Big Data”?
n Part II: Data Driven Innovation:
Big Data is Transforming
Sectors by Breaking Silos and
Driving Ecosystems
n Part III: The Data Value Chain:
Tools and Techniques
n Part IV: The Next Wave of Big
Data Research and Innovation
n Part V: Data Science and Skills
Agenda
n Understand of what is
Big Data and its use
n High-level overview of
key technologies
¨ No formulas or complex
examples
¨ Lots of keywords (Sorry!)
n Feel for the key trends
and issues
Learning Objectives
8. 09/02/16 8www.bdva.eu
The “V’s” of Big Data
Volume Velocity Veracity Variety Value
Data at Rest
Terabytes to
exabytes of exis>ng
data to process
Data in
Mo>on
Streaming data,
requiring mseconds to
respond
Data in Many
Forms
Structured,
unstructured, text,
mul>media,…
Data in Doubt
Uncertainty due to
data inconsistency &
incompleteness,
ambigui>es, latency,
decep>on
€
€
€
€
€
€ €
€
Data into
Money
Business models can
be associated to the
data
Adapted by a post of Michael Walker on 28 November 2012
18. Technology
Providers
Data Value Chain
Core Value Chain
Extended Value Chain
Big Data Ecosystem
Suppliers of Complementary
Data Products and Services
End-Users of
my End-Users
Direct Data
End-Users
Direct Data
Suppliers
Data Value
Distribution
Channels
Suppliers of
my Data
Suppliers
Co-opetitors
(Competitors and cooperation)
Other Stakeholders and Peripheral Actors
Government Organisations
Regulators
Investors, Venture Capitalist & Incubators
Industry Associations
Data
Marketplace
Standardisation
Bodies
Start-ups and
Entrepreneurs
Researchers
& Academics
Stakeholders in a Big Data Value Ecosystem
19. Legal
Social
EconomicTechnology
Application
Data &
Skills
Big Data Value Ecosystem
Ownership
Copyright
Liability
Insolvency
Privacy
User Behaviour
Societal Impact
Collaboration
Business Models
Benchmarking
Open Source
Deployment Models
Information Pricing
Data-Driven Decision Making
Risk Management
Competitive Intelligence
Digital Humanities
Internet of Things
Verticals
Industry 4.0
Scalable Data Processing
Real-Time
Statistics/ML
Linguistics
HCI/Visualisation
The Dimensions of a Big Data Value Ecosystem
[adapted from Cavanillas et al. (2014)]
27. BIG
Big Data Public Private Forum
27
DATA POOLS IN HEALTHCARE
MAIN IMPACT BY INTEGRATING VARIOUS AND
HETEROGENEOUS DATA SOURCES
Clinical Data
§ Owned by providers (such as
hospitals, care centers, physicians,
etc.)
§ Encompass any information stored
within the classical hospital
information systems or EHR, such as
medical records, medical images, lab
results, genetic data, etc.
Claims, Cost &
Administrative Data
§ Owned by providers and payors
§ Encompass any data sets relevant for
reimbursement issues, such as
utilization of care, cost estimates,
claims, etc.
Pharmaceutical &
R&D Data
§ Owned by the pharmaceutical
companies, research labs/
academia, government
§ Encompass clinical trials,
clinical studies, population and
disease data, etc.
Patient Behaviour &
Sentiment Data
§ Owned by consumers
or monitoring device
producer
§ Encompass any
information related to
the patient behaviours
and preferences
Health data on the
web
§ Mainly open source
§ Examples are
websites such as
PatientLikeMe,
Linked Open Data,
etc.
Highest Impact
on integrated data sets
28. Big Data is Impacting in All Sectors
Economy Energy Environment Education
Health &
Wellbeing
Tourism Mobility Grovenance
35. 35 BIG 318062
BIG
Big Data Public Private Forum
THE DATA VALUE CHAIN
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
• Structured data
• Unstructured
data
• Event
processing
• Sensor
networks
• Protocols
• Real-time
• Data streams
• Multimodality
• Stream mining
• Semantic
analysis
• Machine
learning
• Information
extraction
• Linked Data
• Data discovery
• ‘Whole world’
semantics
• Ecosystems
• Community data
analysis
• Cross-sectorial
data analysis
• Data Quality
• Trust / Provenance
• Annotation
• Data validation
• Human-Data
Interaction
• Top-down/Bottom-
up
• Community /
Crowd
• Human
Computation
• Curation at scale
• Incentivisation
• Automation
• Interoperability
• In-Memory DBs
• NoSQL DBs
• NewSQL DBs
• Cloud storage
• Query Interfaces
• Scalability and
Performance
• Data Models
• Consistency,
Availability,
Partition-tolerance
• Security and
Privacy
• Standardization
• Decision support
• Predictions
• In-use analytics
• Simulation
• Exploration
• Modeling
• Control
• Domain-specific
usage
Big Data Value Chain
36. 36 BIG 318062
BIG
Big Data Public Private Forum
36 BIG 318062
DATA ACQUISITION OVERVIEW
▶ Process of gathering, filtering and cleaning data before the
data is put in a data warehouse or any other storage solution
on which data analysis can be carried out
Definition
▶ Mainly driven by 4 of 9 Vs
• Volume
• Velocity
• Variety
• Value
Scope
▶ Most data acquisition scenarios
assume high-volume, high-
velocity, high-variety but low-
value data
Key Technology
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
37. 37 BIG 318062
BIG
Big Data Public Private Forum
37 BIG 318062
END-TO-END ARCHITECTURES
Architectures
▶ Design end-to-end architectures for full data lifecycle
▶ Support for both “Data-at-Rest” and “Data-in-Motion”
▶ Data Hubs and Markets: Hadoop-based solutions tend to
become central integration point for all enterprise data
38. 38 BIG 318062
BIG
Big Data Public Private Forum
38 BIG 318062
DATA ANALYSIS OVERVIEW
Core Techniques
The techniques associated with Big Data Analysis will encompass those
related to data mining and machine learning, to information
extraction and new forms of data processing and reasoning including
for example, stream data processing and large-scale reasoning.
▶ Big Data Analysis is concerned with making raw data which has
been acquired amenable to use
▶ Supports decision making as well as domain specific usage.
Big Data Analysis
▶ Entity summarisation
▶ Data abstraction based on ontologies and communication workflow patters
▶ Recommendations and personal data
▶ Stream data processing
▶ Large scale reasoning & Large scale machine learning
State of the art areas
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
39. 39 BIG 318062
BIG
Big Data Public Private Forum
39 BIG 318062
THE ROLE OF COMMUNITY IN ANALYSIS
Community Analysis and Collection
§ Number of data collection points can be dramatically increased;
§ Communities are creating bespoke tools for the particular situation and to
handle any problems in data collection (Developer Ecosystem)
§ Citizen engagement is increased significantly
Real-time radiation monitoringCity Noise Levels
40. 40 BIG 318062
BIG
Big Data Public Private Forum
DATA CURATION OVERVIEW
▶ Digital Curation “Selection, preservation, maintenance,
collection, and archiving of digital assets”
▶ Data Curation “Active management of data over its life-cycle”
Definition
▶ Individual Curators
▶ Curation Departments
▶ Community-based (Emerging trend)
Who?
▶ (Semi-)Automated
▶ Crowdsourced Data Management
How?
▶ Accessible
▶ Authenticity
▶ Collaboration
▶ Discoverability
▶ Fitness for Use
Why?
▶ Integrity
▶ Reusability
▶ Security
▶ Sustainability
▶ Trustworthy
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
41. 41 BIG 318062
BIG
Big Data Public Private Forum
41 BIG 318062
Internal Community
- Domain Knowledge
- High Quality Responses
- Trustable
BLENDING HUMAN AND ALGORITHM
Blended Approaches
▶ Blended human and algorithmic data processing
approaches for coping with data acquisition, transformation,
curation, access, and analysis challenges for Big Data
Analytics &
Algorithms
Entity Linking
Data Fusion
Relation Extraction
Human
Computation
Relevance Judgment
Data Verification
Disambiguation
Better Data
Web Data
Databases
Sensor Data
Programmers Managers
External Crowd
- High Availability
- Large Scale
- Expertise Variety
42. 42 BIG 318062
BIG
Big Data Public Private Forum
RECAPTCHA
n OCR
¨ ~ 1% error rate
¨ 20%-30% for 18th and 19th
century books
43. 43 BIG 318062
BIG
Big Data Public Private Forum
A CROSS-SECTOR TREND…
Telco, Media, & Entertainment
Manufacturing, Retail, Energy & Transport
Public Sector Life Sciences
44. 44 BIG 318062
BIG
Big Data Public Private Forum
44 BIG 318062
DATA STORAGE OVERVIEW
▶ Is responsible for analysing different aspects of storing,
organizing and manipulating of information on electronic
data storage devices
Definition
▶ Data organization and modelling
▶ Basic data manipulations (Create,
Read, Update, Delete - CRUD)
▶ Data compression
▶ Data recovery, concurrency,
consistency, integrity and security
▶ Database systems architecture,
availability and partition tolerance
Key Topics
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
46. 46 BIG 318062
BIG
Big Data Public Private Forum
46 BIG 318062
Mathworks
Analytical
Databases
ANALYSIS OF BIG DATA VOLUMES
Towards Integrated Analy>cs
• Integrated Systems
• Single data model
• Potentially higher
performance
• Lower development
complexity
• Separate Systems
• Different data models
• May negatively impact
performances
• Higher development
complexity
DBMS
Data
Management
Analytics
Rasdaman
SciDB Revolution
Analytics
ClouderaRDBMS
47. 47 BIG 318062
BIG
Big Data Public Private Forum
47 BIG 318062
TRADEOFF: SIZE VS. COMPLEXITY
48. 48 BIG 318062
BIG
Big Data Public Private Forum
48 BIG 318062
§ Decision support
§ Descriptive
§ Predictive
§ Prescriptive analysis
§ Data exploration
§ Extends Visualisation to
§ Visual Analytics
§ Key areas include:
§ Industry 4.0 (industrial
internet)
§ Predictive maintenance
§ Smart data and service
integration
DATA USAGE OVERVIEW
▶ Key task of Data Usage is to
support business decisions
▶ Lookup, Learn, Investigate
▶ Exploratory browsing
▶ Search
▶ Analytics
▶ Closely related to Business
Intelligence and Data Mining
technologies, but extending them
▶ Off-line vs. real-time support
▶ Automated decisions
Definition Decision Making
Data
Acquisition
Data
Analysis
Data
Curation
Data
Storage
Data
Usage
49. 49 BIG 318062
BIG
Big Data Public Private Forum
49 BIG 318062
IMPROVING USABILITY
Usability
▶ Lowering the usability barrier for data tools: Users should
be able to directly manipulate the data
▶ Improvement of Human-Data interaction: Enabling experts
& casual users to query, explore, transform, & curate data
▶ Interactive exploration: Big Data generates insights beyond
existing models, new analysis interfaces must support browsing
and modeling (visual analytics)
▶ Convergence within
analytical frameworks
Analytical databases for better
performance and lower
development complexity
(Mahout, Spark, Hadoop/R,
rasdaman, SciDB)
50. PART IV: THE NEXT WAVE OF BIG
DATA RESEARCH AND INNOVATION
51. 09/02/16 51www.bdva.eu
The Big Data Value Strategic Research and
Innovation Agenda (BDV SRIA) defines the
overall goals, main technical and non-technical
priorities, and a research and innovation
roadmap for the European contractual Public
Private Partnership (cPPP) on Big Data Value.
What is the SRIA?
Strategic Research and Innovation Agenda
What is the SRIA?
Version 1.0 was published by BDVA in January 2015
Version 2.0 due this month.
Latest Version
• Built upon inputs and analysis from SMEs and Large
Enterprises, public organisaEons, and research and
academic insEtuEons.
• Mul>ple workshops and consulta>ons took place to ensure
the widest representaEon of views and posiEons
• Approximately 200 organisa>ons and other relevant
stakeholders physically par>cipa>ng and contribuEng.
SRIA is based on strong community involvement
52. 09/02/16 52www.bdva.eu
BDV SRIA Technical Priorities
Data Management
Engineering the management of data
Data Processing Architectures
Optimized architectures for analytics both data at rest and in motion with low latency delivering real-time analytics
Deep Analytics
Deep analytics to improve data understanding, deep learning, meaningfulness of data
Data Protection and Preservation Mechanism
To make data owners comfortable about sharing data in an experimental setting
Data Visualization and User Experience
Enable intelligent visualization of complex information relying on enhanced user experience and usability
53. 09/02/16 53www.bdva.eu
How do semantically annotated unstructured and semi-structured
data without imposing extra-effort to data producers.
How to unlock data silos by creating interoperability standards and
technologies for storing and exchanging of data?
How to improve and assess the data quality from the various
domains?
How to ensure consistent data provenance along the data value
chain?
How do handle the sheer unbound size of data as well as enforcing
consistent quality as the data scales in volume, velocity and
variability?
How to integrate analytics results from two different worlds: the
data and the business processes?
How to bundle and provision data, software and data analytics results
to ensure reuse of intermediate results?
Data Management
Challenges
54. 09/02/16 54www.bdva.eu
How to integrate the processing of data in motion and data at rest,
e.g.
• Real-time Analytics & Stream Processing
• New Big Data-specific parallelization techniques
How to parallelize and distribute analytics tasks in order to cope
efficiently with data in motion? The challenge is to develop complex
analytics techniques at scale and for data in motion in order to extract
knowledge out of the data and develop decision support applications
How to analyze data generated by IoT applications? I.e. how to
develop algorithms for IoT dataflows analytics
How to ensure performance and scalability of the algorithms? I.e.
the performance has to scale by orders of magnitude while reducing
energy consumption with the best effort integration between hardware
and software.
Data Processing Architectures
Challenges
55. 09/02/16 55www.bdva.eu
How to produce predictive and prescriptive analytics results?
i.e.by deep learning techniques and graph mining techniques applied
on extremely large graphs. Contextualization that combines
heterogeneous data and data streams via graphs to improve the
quality of mining processes, classifiers, and event discovery
How to foster the semantic analysis of data? I.e. How to improve
data analysis to provide a near-real-time interpretation of the data
How t o validate content? I.e. How to implement veracity models for
validating content
How to develop new and open analytics frameworks?
How to improve the scalability and processing speed for the
aforementioned algorithms
How to develop advanced business analytics and Intelligence
techniques?
Deep Analytics
Challenges
56. 09/02/16 56www.bdva.eu
How to ensure privacy and data anonymisation as key
requirements for data sharing and exchange?
How to foster differential privacy, private information retrieval,
homomorphic encryption?
How to provide technical means that allow data owners to control
the access and usage of their data?
How to ensuring irreversibility of the anonymisation?
How to develop scalable solutions?
How to preserve data anonymity while ensuring high data quality ?
Data Protection and Anonymisation
Challenges
57. 09/02/16 57www.bdva.eu
Data Visualization
How to present data analytics reports that encompass complex
documents containing a variety of data sources?
How to address the various design challenges in representing
complex information?
Interfaces need to be humane
just-in-time delivery of relevant information
Filtering versus hiding of information
How to enable advanced data visualisation incorporating data
variety?
How to align the user-driven vs. data-driven data access paradigm?
How to develop intuitive interfaces while exploiting the advanced
discovery aspects of Big Data analytics?
Challenges
63. The Data Landscape
▶ Much of Big Data technology is evolutionary
▶ Old technologies applied in a new context
▶ Volume, Variety, Velocity, Value …
Technology Evolution
Process Revolution
▶ Business process change must be revolutionary
to enable new opportunities
▶ Industry 4.0 (Smart Manufacturing)
▶ Predictive maintenance
▶ Opportunities for data-driven improvements
▶ integration with customer and supplier data
▶ Moving from infrastructure services (IaaS) to
software (SaaS) to business processes (BPaaS) to
knowledge (KaaS)
64. The Data Landscape
▶ The long tail of data variety is a major shift in
the data landscape
▶ Coping with data variety and verifiability are
central challenges and opportunities for Big Data
▶ Cross-sectorial uses of Big Data will open up
new business opportunities
▶ Need for scalable approaches to cope with data
under different format and semantic assumptions
Variety and Reuse