Over the last few years we have been building domain-specific knowledge graphs for a variety of real-world problems, including creating virtual museums, combating human trafficking, identifying illegal arms sales, and predicting cyber attacks. We have developed a variety of techniques to construct such knowledge graphs, including techniques for extracting data from online sources, aligning the data to a domain ontology, and linking the data across sources. In his talk I will present these techniques and describe our experience in applying Semantic Web technologies to build knowledge graphs for real-world problems.
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs using Semantic Web Technologies
1. From Artwork to Cyber Attacks: Lessons
Learned in Building Knowledge Graphs
using Semantic Web Technologies
Craig Knoblock
USC Information Sciences Institute
U.S. Semantic Technologies Symposium
March 1, 2018
4. Center on Knowledge Graphs: Projects
4Center on Knowledge GraphsUSC Information Sciences Institute
5. Goal: Building Knowledge Graphs
raw messy disconnected clean organized linked
hard to query, analyze & visualize easy to query, analyze & visualize
5Center on Knowledge GraphsUSC Information Sciences Institute
6. Questions Addressed in this Talk
1. Where should the Semantic Web data come from?
• Triplestores? Linked data? Schema.org?
2. What is the “best” representation of the data in a knowledge graph?
• Very detailed domain-specific ontologies?
3. How should we deal with incomplete and incorrect information
• Manual curation? Automated data cleaning?
4. How do we organize and store the data for efficient access?
• RDF? Triplestore?
6Center on Knowledge GraphsUSC Information Sciences Institute
7. Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
7Center on Knowledge GraphsUSC Information Sciences Institute
Feature
Extraction
8. Illegal Arms Sales
• 100s of web sites
• ATF wants to find people buying
and selling across state lines
• Challenge: extract and align the
data across sites
USC Information Sciences Institute Center on Knowledge Graphs 8
11. Automated Extraction
[Minton et al., Inferlink]
Input: A Pile of Pages
11Center on Knowledge GraphsUSC Information Sciences Institute
12. Automated Extraction
input:
a pile of pages
Classify by
Templates
pages clustered
by template
12Center on Knowledge GraphsUSC Information Sciences Institute
13. Automated Extraction
input:
a pile of pages
Classify by
Templates
pages clustered
by template
Infer
Extractor
Infer
Extractor
Infer
Extractor
Infer
Extractor
extractor
13Center on Knowledge GraphsUSC Information Sciences Institute
15. Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48)
.87
(39/45)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36)
Including
partial
and extra
data
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48)
.98
(44/45)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36)
10 websites, 5 pages each
fields
15Center on Knowledge GraphsUSC Information Sciences Institute
16. Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
16Center on Knowledge GraphsUSC Information Sciences Institute
17. Knowledge Graph for Predicting Cyber Attacks
Elastic
Search
Cyber
Domain OntologyBlogs
Twitter
Conferences
CPEs
Darkweb
marketplaces
News
CVEs
Darkweb
Forums
Abuse.ch Karma
Model
Model
Microsoft
Bulletins
17Center on Knowledge GraphsUSC Information Sciences Institute
19. Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{ JSON-LD }
Hierarchical
Sources
Cyber Ontology
19
[ Knoblock, Szekely, et al. ISWC 2012 ]
USC Information Sciences Institute
20. Map Source to Domain Ontology
Domain Ontology
Source
20
object property
data property
Software
Vulnerability
Topic
name
version
author
hasVulnerability
name
description
name
isTopicOf
PostisVulnerabilityOf
location
mentions
datePublished
topic
hasTopic
username
Person
isAuthorOf
Semantic Model: maps
source to domain
ontology
Column 1 Column
2
Column 3 Column 4 Column 5
Bro can you give me a.. English windows xp sp3 CVE-2016-1052 303828
… أناجربتالبرنامجوعملع Arabic jp2_cdef_destroy 147075
salve a tutti, ultimamento … Italian cve-2012-4969 execcommand vuln cve-2012-4969 107075
USC Information Sciences Institute Center on Knowledge Graphs
21. Semantic Types
Post Topic Vulnerabilit
y
Person
text language
name
userId
name
Post
21
Column 1 Column
2
Column 3 Column 4 Column 5
Bro can you give me a.. English windows xp sp3 CVE-2016-1052 303828
… أناجربتالبرنامجوعملع Arabic jp2_cdef_destroy 147075
salve a tutti, ultimamento … Italian cve-2012-4969 execcommand vuln cve-2012-4969 107075
USC Information Sciences Institute Center on Knowledge Graphs
24. Karma Learns the Source Models
Taheriyan et al., ISWC 2013, ICSC 2014
Domain Ontology
Learn
Semantic Types
Sample Data
Construct a Graph
Generate
Candidate Models
Rank Results
Known Semantic
Models
24Center on Knowledge GraphsUSC Information Sciences Institute
25. Learning Semantic Types
Requirements:
Learn from a small number of examples
Distinguish both string and numeric values
Can be learned quickly and is highly scalable to large numbers
of semantic types
Person OrganizationCity State
name birthdate name namename
Person
name date city state workplace
1 Fred Collins Oct 1959 Seattle WA Microsoft
2 Tina Peterson May 1980 New York NY Google
Domain Ontology
25Center on Knowledge GraphsUSC Information Sciences Institute
28. Construct a Graph
Construct a graph from semantic types and ontology
date
28USC Information Sciences Institute
29. Determine Relationships
Select minimal tree that connects all semantic types
A customized Steiner tree algorithm [Kou & Markowsky, 1981]
Initial Model
date 29USC Information Sciences Institute
30. Refining the Model
Correct Model
Impose constraints on Steiner Tree Algorithm
30Center on Knowledge GraphsUSC Information Sciences Institute
31. Knowledge
Graphs
Karma uses semantic models to create knowledge graphs
Karma semi-automatically builds
semantic models
31USC Information Sciences Institute Center on Knowledge Graphs
32. American Art
Collaborative
• Consortium of 14 American art
museums
• Explore the use of Linked Data for
research, education, and outreach
• Build 5* Linked Data for the museums
• Create tools to support the construction of Linked Data
32Center on Knowledge GraphsUSC Information Sciences Institute
[Knoblock et al., ISWC 2017]
33. Example Model of Actor for Amon Carter
33Center on Knowledge GraphsUSC Information Sciences Institute
34. Complete Model of Actor for Amon Carter
34Center on Knowledge GraphsUSC Information Sciences Institute
38. Statistics on What Was Mapped
38Center on Knowledge GraphsUSC Information Sciences Institute
39. Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
39Center on Knowledge GraphsUSC Information Sciences Institute
50. Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
50Center on Knowledge GraphsUSC Information Sciences Institute
51. Re-Clustering Improves Reconstruction Quality
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
51USC Information Sciences Institute
53. Steps To Build a KG
Crawling Extraction
DataAcquisition
Mapping To
Ontology
Entity Linking
&Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
53Center on Knowledge GraphsUSC Information Sciences Institute
58. Graph Construction
assembling the data for efficient query & analysis
- Data represented in JSON-LD
- Stored in ElasticSearch
• Cloud-based search engine based on Apache Lucene
• Horizontal scaling, replication, load balancing
• Queries are fast!
• Everything is a document
- bulk loading: massive data imports (> 100M web pages)
- real-time updates: live, changing data (~5,000 pages/hour)
58Center on Knowledge GraphsUSC Information Sciences Institute
60. Indexing for High Performance
Knowledge Graph Queries
Avg. Query Times in Milliseconds
Single User Query Load
1.2 billion triples
State of the Art Graph Database (RDF)
DIG indexing deployed in ElasticSearch
60Center on Knowledge GraphsUSC Information Sciences Institute
61. • Index time for 16 million documents ~2.5 Hours
• Query times:
• Average Query time for Keyword searches: 8 msec
• Find a specific CVE: 14 msec
• Get all mentions of a MS Bulletin in all sources: 48 msec
• Get all Malware named ‘Locky’ and sort results by observedDate: 68 msec
• Get all blogs mentioning keyword ‘microsoft’ with a date range: 98 msec
• Aggregate and give document counts for each publisher/sensor: 409 msec
61
Knowledge Graph Performance
in Cyber Domain
USC Information Sciences Institute Center on Knowledge Graphs
62. Questions Addressed in This Talk
1. Where should the Semantic Web data come from?
• Triplestores? Linked data? Schema.org?
2. What is the best representation of the data in a knowledge graph?
• Do we want to use the most detailed ontology possible
3. How should we deal with missing and incomplete information
• Manual curation? Automated data cleaning?
4. How do we organize and store the data for efficient access?
• RDF? Triplestore?
Questions Addressed in This Talk
Lessons Learned
62Center on Knowledge GraphsUSC Information Sciences Institute
63. Questions Addressed in This Talk
1. Where should the Semantic Web data come from?
• Triplestores? Linked data? Schema.org?
The Web!
Waiting for the rest of the world to adopt the Semantic Web and
provide the data in RDF is an approach doomed to failure!
Questions Addressed in This Talk
Lessons Learned
63Center on Knowledge GraphsUSC Information Sciences Institute
64. Questions Addressed in This Talk
1. Where should the Semantic Web data come from?
• Triplestores? Linked data? Schema.org?
2. What is the “best” representation of the data in a knowledge graph?
• Do we want to use the most detailed ontology possible
The simplest one you need for the problem you are trying to solve
Overly complicated ontologies that attempt to be comprehensive for a
domain, get in the way of solving the real problems
Questions Addressed in This Talk
Lessons Learned
64Center on Knowledge GraphsUSC Information Sciences Institute
65. Questions Addressed in This Talk
1. Where should the Semantic Web data come from?
• Triplestores? Linked data? Schema.org?
2. What is the best representation of the data in a knowledge graph?
• Carefully curated domain-specific ontologies?
3. How should we deal with missing and incorrect information
• Manual curation? Automated data cleaning?
Clean where possible, but need techniques that can face these problems
The world is a messy place and the ability to deal with it allows us to solve
real-world problems
Questions Addressed in This Talk
Lessons Learned
65Center on Knowledge GraphsUSC Information Sciences Institute
66. Questions Addressed in This Talk
1. Where should the Semantic Web data come from?
• Triplestores? Linked data? Schema.org?
2. What is the best representation of the data in a knowledge graph?
• Carefully curated domain-specific ontologies?
3. How should we deal with missing and incomplete information
• Manual curation? Automated data cleaning?
4. How do we organize and store the data for efficient access?
• RDF? Triplestore?
In whatever datastore best meets the goals of the problem!
It is a mistake to equate the Semantic Web with triples and triplestores.
Questions Addressed in This Talk
Lessons Learned
66Center on Knowledge GraphsUSC Information Sciences Institute
67. Important Directions for Future Research
1. Techniques for extracting data from the online sources
2. Approaches to quickly build, refine, and extend ontologies
to solve specific problems
3. Methods for semantically annotating data from extracted
sources
4. Scalable and configurable techniques for entity resolution
5. Highly scalable algorithms for querying and reasoning
6. Ability to publish and query semantic data on web pages
67Center on Knowledge GraphsUSC Information Sciences Institute
Tokenize values in a given labeled column into pure alphabetic, numeric and symbol tokens
Extract features from the tokens and the column name and associate them with column’s semantic type
Waiting for the rest of the world to adopt the Semantic Web and provide the data in RDF is an approach doomed to failure!
Overly complicated ontologies that attempt to be comprehensive for a domain, get in the way of solving the real problems
The world is a messy place and the ability to deal with it allows us to solve real-world problems
It is a mistake to equate the Semantic Web with triples and triplestores.