Digital transformation is driving a new wave of large-scale datafication in every aspect of our world. Today our society creates data ecosystems where data moves among actors within complex information supply chains that can form around an organization, community, sector, or smart environment. These ecosystems of data can be exploited to transform our world and present new challenges and opportunities in the design of intelligent systems. This talk presents my recent work on using the dataspace paradigm as a best-effort approach to data management within data ecosystems. The talk explores the theoretical foundations and principles of dataspaces and details a set of specialized best-effort techniques and models to enable loose administrative proximity and semantic integration of heterogeneous data sources. Finally, I share my perspectives on future dataspace research challenges, including multimedia data, data governance and the role of dataspaces to enable large-scale data sharing within Europe to power data-driven AI.
Generative AI on Enterprise Cloud with NiFi and Milvus
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent Systems
1. From Data Platforms to Dataspaces:
Enabling Data Ecosystems for Intelligent Systems
Edward Curry,
Insight SFI Research Centre for Data Analytics
edward.curry@nuigalway.ie
LDAC2021 - 9th Linked Data in Architecture and Construction Workshop (11 - 13 October 2021)
2. Overview
• Part I: Data Ecosystems for Intelligent Systems
• Part II: Real-time Linked Dataspaces
• Part III: Final Thoughts on Research Directions and Data Policy
3. Contents
Part I: Fundamentals and Concepts
Part II: Data Support Services
Part III: Stream and Event Processing Services
Part IV: Intelligent Systems and Applications
Part V: Future Directions
Team
http://dataspaces.info
Web:
dataspaces.info
A Team Effort: Open Access Book
10. Real World Digital World
Sensors Orient
Decide
Actuators Act
Observe
Physical Twin
(Asset-centric)
Digital Twin
(System-centric)
Digital
Twins
http://dataspaces.info 10
12. Distributed and Decentralised Data Ecosystems
Key Barrier: Interoperability – Protocols and Semantics
12
Curry, E. and Sheth, A. (2018) ‘Next-Generation Smart Environments: From System of Systems to Data Ecosystems’,
IEEE Intelligent Systems, 33(3), pp. 69–76. doi: 10.1109/MIS.2018.033001418.
14. Data
Ecosystem
socio-technical system
extracting value from data
value chains by interacting
organisations and individuals
oriented to business and
societal purposes
marketplace, competition,
collaboration
Curry, E. (2016) ‘The Big Data Value Chain: Definitions, Concepts,
and Theoretical Approaches’, in Cavanillas, J. M., Curry, E., and
Wahlster, W. (eds) New Horizons for a Data-Driven Economy..
16. The “gold mining” metaphor applied to data processing
Transforming Transport has
made use of a total of 164
terabytes of data from 160
different data sources
19. Traditional Approaches to Data Integration
Low
High
High
Frequency
of use
Cost of administration &
semantic integration using
traditional approaches
Popularity
/
Use
Number of data sources, entities, attributes
http://dataspaces.info
The Long Tail of Data
20. 20
• Heterogeneous, complex and large-scale data
• Very-large and dynamic “schemas”
• Open Environments: distributed, decentralised
decoupled data sources, anonymous users, multi-
domain, lack of global order of information flow
• Multiple perspectives
(conceptualisations) of the reality.
• Ambiguity, vagueness, inconsistency.
Content Space: From Rigid
Schemas to Schema-less.....
...and Fundamental
Decentralisation
21. The Red
Queen
Hypothesis
“It takes all the running you can do, to keep in the
same place. If you want to get somewhere else,
you must run at least twice as fast as that!”
Lewis Carroll's Through the Looking-Glass
23. Data Platforms will Fuel AI-Driven Decision-Making
Data Generation and Analysis
(including IoT)
Data Platforms
(Access and Portability)
AI and Decision Platforms
24. IoT-Enablement
Layer 1 - Communication and Sensing
IPv6, Wi-Fi, RFID, CoAP, AVB, etc.
Layer 3 - Data
Schema, Entities, Catalog, Sharing, Access/Control, etc.
Layer 4 – Intelligent Apps, Analytics, and Users
Datasets
Things / Sensors
Contextual Data Sources
(including legacy systems)
Predictive
Analytics
Situation
Awareness
Decision
Support
Digital
Twin
Machine
Learning
Users
Layer 2 - Middleware
Peer-to-Peer, Events, Pub/Sub, SOA, SDN, etc.
A Data Sharing Layer is needed….
Adapted from: L. Atzori, A. Iera, and G. Morabito, “The
Internet of Things: A survey,” Comput. Networks, vol. 54,
no. 15, pp. 2787–2805, Oct. 2010.
http://dataspaces.info
25. Human Interactivity: Web Search
From Structure to Knowledge Graph
to Search
~1995
~100K Websites
Exact Results
Human Curated
~1998
~2.4M Websites
Approximate Results
Computed
~2012
~700M
Approximate Results + Exact
Computed + Crowd
25
26. Cost of Data Management Solutions
http://dataspaces.info
Administrative Proximity
– Close vs. Loose Coordination
– Assumptions concerning
guarantees such as data, access,
quality, and consistency,
Semantic Integration
– Degree to which data schemas are
matched up (types, attributes, and
names).
26
Halevy, A., Franklin, M. and Maier, D. 2006. Principles of dataspace
systems. 25th ACM SIGMOD-SIGACT-SIGART symposium on Principles of
database systems - PODS ’06 (New York, New York, USA, 2006), 1–9.
27. Approximate and Best Effort Approaches
Low
High
High
Frequency
of use Approximate &
best-effort
approaches
Cost of administration &
semantic integration using
traditional approaches
Popularity
/
Use
Number of data sources, entities, attributes
http://dataspaces.info
The Long Tail of Data
28. Dataspace
“Dataspaces are not a data integration approach; rather, they are
more of a data co-existence approach. The goal of dataspace
support is to provide base functionality over all data sources,
regardless of how integrated they are.”
(Halevy, A., Franklin, M. and Maier, D. 2006.)
29. Enabling platform for data management for intelligent
systems within smart environments
Combines the pay-as-you-go paradigm of dataspaces,
linked data, and knowledge graphs with entity-centric
real-time queries
Real-time Linked Dataspaces
29
Principles: (adapted from by Halevy et al.)
• Must deal with many different formats of streams
and events.
• Does not subsume the stream and event processing
engines; they still provide individual access via their
native interfaces.
• Queries in are provided on a best-effort and
approximate basis.
• Must provide pathways to improve the integration
among the data sources, including streams and
events, in a pay-as-you-go fashion.
30. Key Challenge
http://dataspaces.info
Investigate techniques to enable approximate
and best-effort support services for loose
administrative proximity and semantic
integration
Incremental support services
• Catalog
• entity management
• query and search
• data discovery
• human tasks
• quality of service
• complex event
processing
• streams dissemination
• approximate semantic
event matching
32. • Distributional hypothesis: the context surrounding a given word in a text provides
relevant information about its meaning.
– "a word is characterized by the company it keeps" was popularized by Firth in the 1950s
• Simplified semantic model: Associational and quantitative.
32
A wife is a female partner in a marriage. The term "wife" seems to
be a close term to bride, the latter is a female participant in a
wedding ceremony, while a wife is a married woman during her
marriage.
...
Distributional Semantic Model
32
33. c1
child
husband
spouse
cn
c2
function (number of times that the words occur in c1)
0.7
0.5
Distributional Semantic Model
Distributional
semantic model:
Semantic statistical
knowledge extracted
from large Web
corpora
Works as a semantic
ranking function
E.g. esa(room, building)= 0.099
E.g. esa(room, car)= 0.009
θ
Gabrilovich, E.; Markovitch, S.(2007). Computing semantic relatedness using Wikipedia-based
Explicit Semantic Analysis. Proc. 20th Int'l Joint Conf. on Artificial Intelligence (IJCAI).
33
34. Schema-Agnostic Natural Language Queries
NobelPrizeWinner
A
Semantic Gap
Marie Curie
:type
Possible Data Representations
Information Need: Who are the children of Marie Curie married to?
Marie Curie
2
B C
Marie Curie
Henry R. Labouisse
Ève Curie
Irène Joliot-Curie
:motherOf
:motherOf :wifeOf
:type
:numberOfKids
Frédéric Joliot-Curie
:wifeOf
Frédéric Joliot-Curie
Irène Joliot-Curie
:Spouse
:Child
Henry R. Labouisse
Ève Curie
:Spouse
:Child
Scientist
Freitas, A. and Curry, E. (2014) ‘Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional
Semantics Approach’, in 18th International Conference on Intelligent User Interfaces (IUI’14): ACM
35. Marie Curie children married to Person
:Marie Curie
Query:
Linked
Data:
:Ève Curie
:motherOf
:Henry R. Labouisse
:wifeOf
Distributional Semantic Search
Information Need: Who are the children of Marie Curie married to?
37. Challenges
• Heterogeneity in Event Semantics
(000s schema)
• Heterogeneity in processing Rules
(000s of rule tied to schema)
• Manually Implemented
Approximate Semantic Event Matcher
• Distributional Event Semantics
• Enables pay-as-you-go event
matching for data streams
• Replaced 48,000 exact rules with
100 approximate rules with around
85% accuracy
Approximate Semantic Matching of Streams
37
Hasan, S. and Curry, E. (2014) ‘Approximate Semantic Matching of Events
for the Internet of Things’, ACM Transactions on Internet Technology, 14(1).
38. Intelligent Systems and Applications
http://dataspaces.info
L
OCATION
Airport Office Home Mixed Use School
LINATE AIRPORT,
MILAN, ITALY
INSIGHT,
GALWAY, IRELAND
HOUSES,
THERMI, GREECE
ENGINEERING,
NUI GALWAY
COLÁISTE NA
COIRIBE, IRELAND
T
ARGET
U
SER
S
• Corporate users
• ~9.5 million
passengers
• Utilities
management
• Maintenance
staff
• Environmental
managers
• 130 staff
• Office consumers
• Operations
managers
• Utility providers
• Building
managers
• Domestic
consumers
(adults, young
adults and
children)
• Utility providers
• Mixed/Public
consumers
• Building
managers
• 100 staff
• 1000 students
(ages 18 to 24)
• Mixed/Public
consumers
• School
management
• Maintenance
staff
• 500 students
(ages 12 to 18)
• 40 teachers
I
NFRASTRUCTURE
• Safety critical
• 10 km water
network
• Multiple
buildings
• Water meters
• Energy meters
• Legacy systems
• 2190 m2 space
• 22 offices + 160
open plan spaces
• Conference room
• 4 meeting rooms
• 3 kitchens
• Data centre
• 30 person café
• Energy meters
• 10 households
• Typical variety of
domestic settings
including kitchen,
showers, baths,
living room,
bedrooms, and
garden
• Water meters
• Water meters
• Energy meters
• Rainwater
harvesting
• Café
• Weather station
• Wet labs
• Showers
• Water meters
• Energy meters
• Rainwater
harvesting
Smart Water
and Energy
Management
Pilots
39. Smart School
CnaC School in
Galway, Ireland
Mixed Use
Galway, Ireland
Building
Manager
University Students
Smart Airport
Milan Linate,
Italy
Corporate
Staff
Passengers
Smart Homes
Municipality of
Thermi, Greece
Smart Office
Galway, Ireland
Families
Operational
Staff
Researchers
Application
Developers
Teaching Staff School Students
Data
Scientist
Need to target different Target Users
40. IoT-enabled
Digital Twins
and
Intelligent
Applications
Real-time Linked Dataspace
Datasets
Things / Sensors
Entity Management Service
Catalog &
Access Control
Service
Personal Dashboard
Public Dashboards
Decision Analytics and
Machine Learning
Notifications Apps
Alerts
Orient Decide
Act
Search & Query
Service
Entity-Centric
Real-Time Query
Service
Complex Event
Processing Service
Digital Twin
CEP
D
Human Task Service
Human Task
Service
Observe
http://dataspaces.info
“OODA” Loop
43. Experiences and Lessons Learnt from Dataspaces
spaces.info
• Developer education need for stream processing and approximate
results
• Incremental data management can support agile software
development
• Build the business case for data-driven innovation
• Integration with legacy data is a significant cost in smart environments
• The 5 star pay-as-you-go model simplified communication with non-
technical users
• A secure canonical source for entity data simplifies application
development
• Data quality with things and sensors is challenging in an operational
environment
• Working with three pipelines adds overhead (LAMBDA + Entity Layer)
43
44. Part III: Final Thoughts on
Research Directions and Data
Policy
45. http://dataspaces.info 45
Large-scale Decentralised Support Services
• Enhanced Supported Services
• Scaling Entity Management
• Maintenance and Operation Cost
Multimedia/Knowledge-Intensive Event
Processing
• Support Services for Multimedia Data
• Placement of Multimedia Data and
Workloads
• Adaptive Training of Classifiers
• Complex Multimedia Event Processing
Trusted Data Sharing
• Trusted Platforms
• Usage Control
• Personal/ Industrial Dataspaces
Ecosystem Governance and Economic
Models
• Decentralised Data Governance
• Economic Models
Incremental Intelligent Systems
Engineering Cognitive Adaptability
• Pay-as-you-go Systems
• Cognitive Adaptability
Towards Human-centric Systems
• Explainable Artificial Intelligence
and Data Provenance
• Human-in-the-loop
Future Research Directions
47. Overview
Multimodal Event Processing
• Shift from Structure to Unstructured
• Enabling Intelligent Systems with Real-
time Multimodal Data
Multimodal Data is a game changer
for Smart Environments….
47
• Multimodal Data Streams
• Structured
• Video
• Audio
• Rich-Content Processing
• Larger data volumes
• Larger Content-space
• Content Extraction Costs
• Edge and Resources
• Computational Intensive
• Network Intensive
48. Person
Person
Vest
Vest
Hat
Hat
Temp
Wind
Speed
Lux
Site
Structured Sensor Streams Unstructured Sensor Streams
occupant
Left/right
wearing
wearing
wearing
wearing
occupant
has
has
has
Real-time Health and Safety Monitoring
Queries
§ Is everyone wearing
PPE/hardhat?
§ Are there any visitors?
§ Is it a safe working
temperature?
§ Is smoke detected?
§ Is the wind speed
safe?
§ Is there any unsafe
behaviour?
49. Neuro Symbolic
Gnosis: Neuro-Symbolic Event Processing
Camera
Sensor
Query 1
IoMT Sources IoMT Applications
Camera
Camera
Sensor
Sensor
…
…
Query 2
Query 3
Sound
Sound
Sound
Complex Event Matcher
Single Event Matcher
History Rules
Multimedia Flows
Structured Flows
50. Multimodal Event Processing Language
Yadav, P. et al. (2021) ‘Query-Driven Video Event Processing for
the Internet of Multimedia Things (Demo)’, Proceedings of the
VLDB Endowment, 14(12), pp. 2847–2850.
52. “The future is already here –
it’s just not evenly distributed.” William Gibson
53. (Open) Data is Key to AI
“The world’s most valuable resource is
no longer oil, but data. The data
economy demands a new approach to
antitrust rules”
The Economist
…startups and established firms that are
just beginning to use AI need access to
data in order to train their AI systems.
Difficulty in accessing the necessary data
can create a barrier to entry, potentially
reducing competition and innovation. -
Forbes
54. From Open Data to …….
Public Digital Infrastructures
Forward-thinking societies
will see the provision of
digital infrastructure
(including data platforms) as
a shared societal service in
the same way as water,
sanitation, and healthcare.
54
57. European Strategy for Data
Data can flow within the
EU and across sectors
European rules and values
are fully respected
Rules for access and use of data are
fair, practical and clear & clear data
governance mechanisms are in place
A common European data space, a single market for data
Availability of high quality data
to create and innovate
58. Health
Industrial &
Manufacturing Agriculture Culture Mobility Green Deal Security
Cloud Federation, common European data spaces and AI
Public
Administration
• Driven by stakeholders
• Rich pool of data of varying degree of openness
• Sectoral data governance (contracts, licenses,
access rights, usage rights)
• Technical tools for data pooling and sharing
High Value
Datasets
From
public
sector
AI Testing and
Experimentation Facilities
AI on demand platform
IaaS (Infrastructure as a Service)
Servers, computing, OS, storage, network
PaaS (Platforms as a Service)
Smart Interoperability Middleware
SaaS (Software as a Service)
Software, ERP, CRM, data analytics
Edge
Infrastructure
& Services
High-
Performance
Computing
Federation of Cloud & HPC Infrastructure & Services
Cloud stack management and multi-cloud / hybrid cloud, cloud governance
Marketplace for Cloud to Edge based Services
Cloud services meeting high requirements for data protection, security, portability, interoperability, energy efficiency
Media