SlideShare una empresa de Scribd logo
1 de 31
WWW.LEDS-PROJEKT.DE
ECCENCA CORPORATE MEMORY
SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES
September 29, 20161
MOTIVATION
Enterprise Data Management Objective:
“Ensure all data is aligned to a common
meaning in order to achieve automation in
performing complex analytics and generating
trusted reports.”
Source:
2015 Data Management Industry Benchmark -
EDM Council
September 29, 20162
In 2015 only 7% of
respondents claim to
already be using shared
and unambiguous
definitions of data across
the firm and have it
accessible as operational
metadata.
7%
ARCHITECTURE
September 29, 20163
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Corporate
Memory
Inbound
Data Sources
Outbound and
Consumption
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems
Big Data DWH-
Infrastructure
ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Data Ingestion
• Files in the data lake (CSV, XML, Excel)
• (relational) Databases
ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Data Lake
• Emerging approach to handle large amounts
of data
• Cost-effective storage
• Data is held in their native formats
Good
Does not force an up-front integration of the
ingested data sets
Bad
Retaining an overview of disparate data silos in
the lake without having a coherent shared view
is a challenging issue
ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Data Warehouses
• Existing infrastucture
• Typically relational databases
ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Metadata Layer
• Dataset Metadata
• Ontologies
• Integration Rules
ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Graphical User Interface
Customer Applications
INTEGRATION PROCESS
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 29, 20169
DATASET MANAGEMENT
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 29, 201610
DATASET CATALOG
• Enables the user to explore and manage datasets in the data lake
• Files in the data lake (CSV, XML, Excel)
• Databases (Apache Hive or external databases)
September 29, 201611
MANAGING METADATA
• Exploring and editing dataset metadata
• Semantic content information, like textual
descriptions, tags and related Persons
• Technical information and parameters, like
formats, data model and encoding
• Access information, like access path or
URL, source system or API call
• Organizational provenance, like
organizational units owning or maintaining
the dataset
September 29, 201612
DATASET DISCOVERY
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 29, 201613
DATASET DISCOVERY
• Goal: Augment a dataset with data from related datasets
• Automatic discovery of dataset with overlapping information
• Explorative interface
• Discovery is based on two data parts
• Business meta data
• Profiling summary
September 29, 201614
DISCOVERY VIEW
• Datasets are matched based on their metadata (profiling + business data)
September 29, 201615
DATASET PROFILING
• Datasets often contain implicit and explicit schema information
• Column names, data formats, enumerated values etc.
• Example: column contains formatted dates
• Idea: Extract a dataset summary
• For each column / property the summary contains:
1. Data type (e.g., number, date, industry classification)
2. Data format (e.g., date format)
3. Data statistics (e.g., range, distribution, most frequent values)
• Materialized as RDF with UI view
September 29, 201616
DETECTING DATA TYPES
• Detecting common datatypes as well as user-defined types
• Common datatypes
• Numbers
• Dates / Times
• Geographic locations (geo-coordinates, states, countries)
• User-defined data types can be integrated by adding an ontology /
taxonomy
• Usually a SKOS taxonomy
• Managed as another dataset in the dataset management
• Example: Industry taxonomy
• Standard taxonomy (NACE, SIC, NAICS) or company specific
September 29, 201617
FORMATS AND STATISTICS
• For some types, the data format is detected
• Example: Dates are formatted in DD-MM-YYYY
• Two functions are generated:
1. Parser that is able to read the detected representation
2. Normalizer that converts the parsed values into a configurable, organization-wide
target representation
• Statistics summarize the values:
• Value range and distribution
• Most frequent values
• Data selectivity
September 29, 201618
DISCOVERY VIEW
• Datasets are matched based on their metadata (profiling + business data)
September 29,
2016
19
INTEGRATION PROCESS
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 29, 201620
DATA INTEGRATION
• The integration process is driven by a set of rules
• Lifting Rules map the source datasets to a ontology
• Linking Rules connect different datasets to a knowledge graph
• Rules are operator trees, consisting of four types of operators
• Data Access Operators
• Transformation Operators
• Similarity Operators
• Aggregation Operators
• Rules can be learned using genetic programming algorithms
• Rules are human understandable and can be edited
September 29, 201621
DATASET LIFTING
• Objective: Map the datasets in the data lake to a consistent vocabulary.
• A lifting rule consists of a number of mappings
• Each mapping assigns a term in the original data set (such as a column for tabular
data) to a term in the target ontology (such as a property provided by an ontology).
• Multiple mappings for each dataset can be managed to allow different
views on the same data.
• Initial mappings are generated automatically based on the profiling results
from where the user can continue to build on.
September 29, 201622
LIFTING EXAMPLE
September 29, 201623
Bond ISIN Country Industry
NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking
SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical
Equipment
Electricite de France
(EDF), 6,5% 26jan2019
USF2893TAB29 France Utilities
NEDWBK CAD 5,2%25
fibo:hasSecurityIdentifier
Utilities
Industry Ontology
Banking
France
Country Ontology
Germany
EMEA
“CA639832AA25”
fibo:legallyRecordedIn
fibo:industrySector
LINKING
• Goal: Connect individual datasets to a knowledge graph
• Identify related entities in different datasets and link them
• Either entities describing the same real world object or another relation
September 29, 201624
NEDWBK CAD 5,2%25
ratingScore
Industry OntologyCountry Ontology
EMEA
“AAA”
fibo:legallyRecordedIn
fibo:industrySector
Rating CAD 5,2%25
hasRating
fibo:industrySector
fibo:legallyRecordedIn
LINKAGE RULES
• Linking is based on domain-specific rules
• Specify the conditions that must hold true for two entities to be linked
September 29, 201625
LEARNING LINKAGE RULES
Problem: Manually writing rules is time-consuming and requires expertise
Approach: Interactive machine learning algorithm for generating rules
• Generates a rule based on a number of user-confirmed link candidates.
• Link candidates are actively selected by the learning algorithm to include link candidates
that yield a high information gain.
• The user does not need any knowledge of the characteristics
of the dataset or any particular similarity computation techniques.
September 29, 201626
INTEGRATION PROCESS
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
VIEW GENERATION
• The user selects a set of lifted and linked datasets
September 29, 201628
Hadoop
Data Lake
DATA ACCESS
• Generate data flows based on
Apache Spark
• The data flows utilize Resilient
Distributed Datasets (RDDs)
• RDDs derive new data sets from
existing data sets by applying a
chain of transformations
• A derived data set can either
• be recomputed on-the-fly
• persisted on stable storage
• Data flows can be executed
efficiently on Hadoop clusters.
September 29, 201629
Corporate
Bonds
Data Lifting 1
(Apache Spark
RDD)
Data Linking
(Apache Spark RDD)
Internal
Ratings
Data Lifting 2
(Apache Spark
RDD)
External
Ratings
Data Lifting 3
(Apache Spark
RDD)
eccenca
Corporate
Memory
Data
Consumer
SQL CSV
Excel
Spark
API
DEMO
Contact
Dr. Robert Isele
Tel: +49 151 17238616
email: robert.isele@eccenca.com
eccencaCommand your Data!

Más contenido relacionado

La actualidad más candente

Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and Opportunities
Srinath Srinivasa
 

La actualidad más candente (20)

Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Chalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
Chalitha Perera | Cross Media Concept and Entity Driven Search for EnterpriseChalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
Chalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
 
David Kuilman | Creating a Semantic Enterprise Content model to support conti...
David Kuilman | Creating a Semantic Enterprise Content model to support conti...David Kuilman | Creating a Semantic Enterprise Content model to support conti...
David Kuilman | Creating a Semantic Enterprise Content model to support conti...
 
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
Joe Pairman | Multiplying the Power of Taxonomy with Granular, Structured Con...
 
Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics Demo
 
Semantic E-Commerce - Use Cases in Enterprise Web Applications
Semantic E-Commerce - Use Cases in Enterprise Web ApplicationsSemantic E-Commerce - Use Cases in Enterprise Web Applications
Semantic E-Commerce - Use Cases in Enterprise Web Applications
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
 
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
Big Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and OpportunitiesBig Data and the Semantic Web: Challenges and Opportunities
Big Data and the Semantic Web: Challenges and Opportunities
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic Technologies
 
How to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk AnalyticsHow to Reveal Hidden Relationships in Data and Risk Analytics
How to Reveal Hidden Relationships in Data and Risk Analytics
 
Semantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceSemantic Technology in Publishing & Finance
Semantic Technology in Publishing & Finance
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Solution architecture
Solution architectureSolution architecture
Solution architecture
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 

Destacado

Kostas Kastrantas | Business Opportunities with Linked Open Data
Kostas Kastrantas  | Business Opportunities with Linked Open DataKostas Kastrantas  | Business Opportunities with Linked Open Data
Kostas Kastrantas | Business Opportunities with Linked Open Data
semanticsconference
 

Destacado (19)

Michael Fuchs | How to compute semantic relationships between entities and fa...
Michael Fuchs | How to compute semantic relationships between entities and fa...Michael Fuchs | How to compute semantic relationships between entities and fa...
Michael Fuchs | How to compute semantic relationships between entities and fa...
 
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
Camilo Thorne, Stefano Faralli and Heiner Stuckenschmidt | Entity Linking for...
 
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
Jörg Waitelonis, Henrik Jürges and Harald Sack | Don't compare Apples to Oran...
 
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
Sebastian Bader | Semantic Technologies for Assisted Decision-Making in Indus...
 
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
 
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
Vladimir Alexiev | Semantic Enrichment of Twitter Microposts Helps Understand...
 
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Un...
 
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
Phil Ritchie | Putting Standards into Action: Multilingual and Semantic Enric...
 
Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...
Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...
Shuangyong Song, Qingliang Miao and Yao Meng | Linking Images to Semantic Kno...
 
Victor Charpenay | Standardized Semantics for an Open Web of Things
Victor Charpenay | Standardized Semantics for an Open Web of ThingsVictor Charpenay | Standardized Semantics for an Open Web of Things
Victor Charpenay | Standardized Semantics for an Open Web of Things
 
Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...
Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...
Najmeh Mousavi Nejad, Simon Scerri, Sören Auer and Elisa M. Sibarani | EULAid...
 
Kostas Kastrantas | Business Opportunities with Linked Open Data
Kostas Kastrantas  | Business Opportunities with Linked Open DataKostas Kastrantas  | Business Opportunities with Linked Open Data
Kostas Kastrantas | Business Opportunities with Linked Open Data
 
Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...
Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...
Kolawole John Adebayo, Luigi Di Caro and Guido Boella | A Supervised Keyphras...
 
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College,...
 
Thomas Vavra | New Ways of Handling Old Data
Thomas Vavra | New Ways of Handling Old DataThomas Vavra | New Ways of Handling Old Data
Thomas Vavra | New Ways of Handling Old Data
 
OOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria PovedaOOPS!: on-line ontology diagnosis by Maria Poveda
OOPS!: on-line ontology diagnosis by Maria Poveda
 
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
 
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
 
Miroslav Líška | Methodology data.gov.sk-semanticweb, LOD Slovakia and Slovpe...
Miroslav Líška | Methodology data.gov.sk-semanticweb, LOD Slovakia and Slovpe...Miroslav Líška | Methodology data.gov.sk-semanticweb, LOD Slovakia and Slovpe...
Miroslav Líška | Methodology data.gov.sk-semanticweb, LOD Slovakia and Slovpe...
 

Similar a Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

Beyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptx
Beyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptxBeyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptx
Beyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptx
Prasanna Hegde
 
Data Architecture for Solutions.pdf
Data Architecture for Solutions.pdfData Architecture for Solutions.pdf
Data Architecture for Solutions.pdf
Alan McSweeney
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo
 

Similar a Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes (20)

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Beyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptx
Beyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptxBeyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptx
Beyond the Data Horizon Unlocking Growth for 5X through Competitor Analysis.pptx
 
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business Success
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
ERP technology Areas.pptx
ERP technology Areas.pptxERP technology Areas.pptx
ERP technology Areas.pptx
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
Data & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft PlatformsData & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft Platforms
 
CRM-UG Summit Phoenix 2018 - What is Common Data Model and how to use it?
CRM-UG Summit Phoenix 2018 - What is Common Data Model and how to use it?CRM-UG Summit Phoenix 2018 - What is Common Data Model and how to use it?
CRM-UG Summit Phoenix 2018 - What is Common Data Model and how to use it?
 
Data Architecture for Solutions.pdf
Data Architecture for Solutions.pdfData Architecture for Solutions.pdf
Data Architecture for Solutions.pdf
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
 
How a Logical Data Fabric Enhances the Customer 360 View
How a Logical Data Fabric Enhances the Customer 360 ViewHow a Logical Data Fabric Enhances the Customer 360 View
How a Logical Data Fabric Enhances the Customer 360 View
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesVirtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & Bénéfices
 
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
Denodo Partner Connect: A Review of the Top 5 Differentiated Use Cases for th...
 

Más de semanticsconference

Más de semanticsconference (20)

Linear books to open world adventure
Linear books to open world adventureLinear books to open world adventure
Linear books to open world adventure
 
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
Session 1.2   high-precision, context-free entity linking exploiting unambigu...Session 1.2   high-precision, context-free entity linking exploiting unambigu...
Session 1.2 high-precision, context-free entity linking exploiting unambigu...
 
Session 4.3 semantic annotation for enhancing collaborative ideation
Session 4.3   semantic annotation for enhancing collaborative ideationSession 4.3   semantic annotation for enhancing collaborative ideation
Session 4.3 semantic annotation for enhancing collaborative ideation
 
Session 1.1 dalicc - data licenses clearance center
Session 1.1   dalicc - data licenses clearance centerSession 1.1   dalicc - data licenses clearance center
Session 1.1 dalicc - data licenses clearance center
 
Session 1.3 context information management across smart city knowledge domains
Session 1.3   context information management across smart city knowledge domainsSession 1.3   context information management across smart city knowledge domains
Session 1.3 context information management across smart city knowledge domains
 
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
Session 0.0   aussenac semanticsnl-pwebsem2017-v4Session 0.0   aussenac semanticsnl-pwebsem2017-v4
Session 0.0 aussenac semanticsnl-pwebsem2017-v4
 
Session 0.0 keynote sandeep sacheti - final hi res
Session 0.0   keynote sandeep sacheti - final hi resSession 0.0   keynote sandeep sacheti - final hi res
Session 0.0 keynote sandeep sacheti - final hi res
 
Session 1.1 linked data applied: a field report from the netherlands
Session 1.1   linked data applied: a field report from the netherlandsSession 1.1   linked data applied: a field report from the netherlands
Session 1.1 linked data applied: a field report from the netherlands
 
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
Session 1.2   enrich your knowledge graphs: linked data integration with pool...Session 1.2   enrich your knowledge graphs: linked data integration with pool...
Session 1.2 enrich your knowledge graphs: linked data integration with pool...
 
Session 1.4 connecting information from legislation and datasets using a ca...
Session 1.4   connecting information from legislation and datasets using a ca...Session 1.4   connecting information from legislation and datasets using a ca...
Session 1.4 connecting information from legislation and datasets using a ca...
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage information
 
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
Session 0.0   media panel - matthias priem - gtuo - semantics 2017Session 0.0   media panel - matthias priem - gtuo - semantics 2017
Session 0.0 media panel - matthias priem - gtuo - semantics 2017
 
Session 1.3 semantic asset management in the dutch rail engineering and con...
Session 1.3   semantic asset management in the dutch rail engineering and con...Session 1.3   semantic asset management in the dutch rail engineering and con...
Session 1.3 semantic asset management in the dutch rail engineering and con...
 
Session 1.3 energy, smart homes & smart grids: towards interoperability...
Session 1.3   energy, smart homes & smart grids: towards interoperability...Session 1.3   energy, smart homes & smart grids: towards interoperability...
Session 1.3 energy, smart homes & smart grids: towards interoperability...
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
Session 2.3 semantics for safeguarding & security – a police story
Session 2.3   semantics for safeguarding & security – a police storySession 2.3   semantics for safeguarding & security – a police story
Session 2.3 semantics for safeguarding & security – a police story
 
Session 2.5 semantic similarity based clustering of license excerpts for im...
Session 2.5   semantic similarity based clustering of license excerpts for im...Session 2.5   semantic similarity based clustering of license excerpts for im...
Session 2.5 semantic similarity based clustering of license excerpts for im...
 
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
Session 4.2   unleash the triple: leveraging a corporate discovery interface....Session 4.2   unleash the triple: leveraging a corporate discovery interface....
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
 
Session 1.6 slovak public metadata governance and management based on linke...
Session 1.6   slovak public metadata governance and management based on linke...Session 1.6   slovak public metadata governance and management based on linke...
Session 1.6 slovak public metadata governance and management based on linke...
 
Session 5.6 towards a semantic outlier detection framework in wireless sens...
Session 5.6   towards a semantic outlier detection framework in wireless sens...Session 5.6   towards a semantic outlier detection framework in wireless sens...
Session 5.6 towards a semantic outlier detection framework in wireless sens...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

  • 1. WWW.LEDS-PROJEKT.DE ECCENCA CORPORATE MEMORY SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES September 29, 20161
  • 2. MOTIVATION Enterprise Data Management Objective: “Ensure all data is aligned to a common meaning in order to achieve automation in performing complex analytics and generating trusted reports.” Source: 2015 Data Management Industry Benchmark - EDM Council September 29, 20162 In 2015 only 7% of respondents claim to already be using shared and unambiguous definitions of data across the firm and have it accessible as operational metadata. 7%
  • 3. ARCHITECTURE September 29, 20163 Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Corporate Memory Inbound Data Sources Outbound and Consumption Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure
  • 4. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Data Ingestion • Files in the data lake (CSV, XML, Excel) • (relational) Databases
  • 5. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Data Lake • Emerging approach to handle large amounts of data • Cost-effective storage • Data is held in their native formats Good Does not force an up-front integration of the ingested data sets Bad Retaining an overview of disparate data silos in the lake without having a coherent shared view is a challenging issue
  • 6. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Data Warehouses • Existing infrastucture • Typically relational databases
  • 7. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Metadata Layer • Dataset Metadata • Ontologies • Integration Rules
  • 8. ARCHITECTURE Management Accounting Risk Management Regulatory Reporting Treasury MarketingAccounting Inbound Raw Data Store Knowledge Graph for Meta Data, KPI Definition and Data Models Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems Big Data DWH- Infrastructure Graphical User Interface Customer Applications
  • 9. INTEGRATION PROCESS Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 20169
  • 10. DATASET MANAGEMENT Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 201610
  • 11. DATASET CATALOG • Enables the user to explore and manage datasets in the data lake • Files in the data lake (CSV, XML, Excel) • Databases (Apache Hive or external databases) September 29, 201611
  • 12. MANAGING METADATA • Exploring and editing dataset metadata • Semantic content information, like textual descriptions, tags and related Persons • Technical information and parameters, like formats, data model and encoding • Access information, like access path or URL, source system or API call • Organizational provenance, like organizational units owning or maintaining the dataset September 29, 201612
  • 13. DATASET DISCOVERY Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 201613
  • 14. DATASET DISCOVERY • Goal: Augment a dataset with data from related datasets • Automatic discovery of dataset with overlapping information • Explorative interface • Discovery is based on two data parts • Business meta data • Profiling summary September 29, 201614
  • 15. DISCOVERY VIEW • Datasets are matched based on their metadata (profiling + business data) September 29, 201615
  • 16. DATASET PROFILING • Datasets often contain implicit and explicit schema information • Column names, data formats, enumerated values etc. • Example: column contains formatted dates • Idea: Extract a dataset summary • For each column / property the summary contains: 1. Data type (e.g., number, date, industry classification) 2. Data format (e.g., date format) 3. Data statistics (e.g., range, distribution, most frequent values) • Materialized as RDF with UI view September 29, 201616
  • 17. DETECTING DATA TYPES • Detecting common datatypes as well as user-defined types • Common datatypes • Numbers • Dates / Times • Geographic locations (geo-coordinates, states, countries) • User-defined data types can be integrated by adding an ontology / taxonomy • Usually a SKOS taxonomy • Managed as another dataset in the dataset management • Example: Industry taxonomy • Standard taxonomy (NACE, SIC, NAICS) or company specific September 29, 201617
  • 18. FORMATS AND STATISTICS • For some types, the data format is detected • Example: Dates are formatted in DD-MM-YYYY • Two functions are generated: 1. Parser that is able to read the detected representation 2. Normalizer that converts the parsed values into a configurable, organization-wide target representation • Statistics summarize the values: • Value range and distribution • Most frequent values • Data selectivity September 29, 201618
  • 19. DISCOVERY VIEW • Datasets are matched based on their metadata (profiling + business data) September 29, 2016 19
  • 20. INTEGRATION PROCESS Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop September 29, 201620
  • 21. DATA INTEGRATION • The integration process is driven by a set of rules • Lifting Rules map the source datasets to a ontology • Linking Rules connect different datasets to a knowledge graph • Rules are operator trees, consisting of four types of operators • Data Access Operators • Transformation Operators • Similarity Operators • Aggregation Operators • Rules can be learned using genetic programming algorithms • Rules are human understandable and can be edited September 29, 201621
  • 22. DATASET LIFTING • Objective: Map the datasets in the data lake to a consistent vocabulary. • A lifting rule consists of a number of mappings • Each mapping assigns a term in the original data set (such as a column for tabular data) to a term in the target ontology (such as a property provided by an ontology). • Multiple mappings for each dataset can be managed to allow different views on the same data. • Initial mappings are generated automatically based on the profiling results from where the user can continue to build on. September 29, 201622
  • 23. LIFTING EXAMPLE September 29, 201623 Bond ISIN Country Industry NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical Equipment Electricite de France (EDF), 6,5% 26jan2019 USF2893TAB29 France Utilities NEDWBK CAD 5,2%25 fibo:hasSecurityIdentifier Utilities Industry Ontology Banking France Country Ontology Germany EMEA “CA639832AA25” fibo:legallyRecordedIn fibo:industrySector
  • 24. LINKING • Goal: Connect individual datasets to a knowledge graph • Identify related entities in different datasets and link them • Either entities describing the same real world object or another relation September 29, 201624 NEDWBK CAD 5,2%25 ratingScore Industry OntologyCountry Ontology EMEA “AAA” fibo:legallyRecordedIn fibo:industrySector Rating CAD 5,2%25 hasRating fibo:industrySector fibo:legallyRecordedIn
  • 25. LINKAGE RULES • Linking is based on domain-specific rules • Specify the conditions that must hold true for two entities to be linked September 29, 201625
  • 26. LEARNING LINKAGE RULES Problem: Manually writing rules is time-consuming and requires expertise Approach: Interactive machine learning algorithm for generating rules • Generates a rule based on a number of user-confirmed link candidates. • Link candidates are actively selected by the learning algorithm to include link candidates that yield a high information gain. • The user does not need any knowledge of the characteristics of the dataset or any particular similarity computation techniques. September 29, 201626
  • 27. INTEGRATION PROCESS Dataset Management •Catalog Datasets •Catalog Ontologies •Manage Metadata Dataset Discovery •Data Profiling •Dataset Exploration Dataset Integration •Dataset Lifting •Dataset Linking •Data Quality Validation Data Access •Domain Specific Consolidated Views •Execution on Hadoop
  • 28. VIEW GENERATION • The user selects a set of lifted and linked datasets September 29, 201628
  • 29. Hadoop Data Lake DATA ACCESS • Generate data flows based on Apache Spark • The data flows utilize Resilient Distributed Datasets (RDDs) • RDDs derive new data sets from existing data sets by applying a chain of transformations • A derived data set can either • be recomputed on-the-fly • persisted on stable storage • Data flows can be executed efficiently on Hadoop clusters. September 29, 201629 Corporate Bonds Data Lifting 1 (Apache Spark RDD) Data Linking (Apache Spark RDD) Internal Ratings Data Lifting 2 (Apache Spark RDD) External Ratings Data Lifting 3 (Apache Spark RDD) eccenca Corporate Memory Data Consumer SQL CSV Excel Spark API
  • 30. DEMO
  • 31. Contact Dr. Robert Isele Tel: +49 151 17238616 email: robert.isele@eccenca.com eccencaCommand your Data!

Notas del editor

  1. TODO more details on linkage rules or rules in generatl (operators etc.)
  2. - Explain why manually writing a rule is hard?