SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Building knowledge graphs
in DIG
Pedro Szekely and Craig Knoblock
University of Southern California
Information Sciences Institute
dig.isi.edu
Goal
USC Information Sciences Institute CC-By 2.0 2
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
Salient Statistics on
Human Trafficking
• Profits per Year: $32 Billion
• Average Age of Entry To Prostitution in the US: 14
• PIMP’s Profit Per Victim Per Year: $150,000
• Advertising Budget On the Web:$45 Million
CC-By 2.0 5USC Information Sciences Institute
Task: Tracking the Victim’s
Locations
>	100	million	pages	advertising	adult	services
USC Information Sciences Institute CC-By 2.0 6
Example: Investigating a Reported Victim
San	Diego,	where	else?
USC Information Sciences Institute CC-By 2.0 7
DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 8
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 9
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition
Data Acquisition
USC Information Sciences Institute CC-By 2.0 10
downloading relevant data
batch w real-time
Web pagesw Web service w database w
CSV w Excel w XML w JSON
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• trainable text extractors
• extraction from structured Web pages
• image features
• PDF extractor
Feature Extraction from Text
USC Information Sciences Institute CC-By 2.0 13
“YOU don't wanna miss out on
ME :) Perfect lil booty Green
eyes Long curly black hair Im a
Irish,Armenian and Filipino
mixed princess :) ❤ Kim ❤
7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15
mins 60 roses”
name: Kim
eye-color: green
hair-color: black
phone: 707-727-7477
rate: $60/15min
$80/30min
$120/60min
20 Examples
CC-By 2.0 14USC Information Sciences Institute
1,000’s of Tasks (2 Cents/Sentence)
CC-By 2.0 15
Performance of CRF Extractors
80
10
18
99
91 94
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
80
6
12
99
73
84
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
Eyes Hair
USC Information Sciences Institute CC-By 2.0 16
Structured Extraction
CC-By 2.0 17
Automated Extraction
input:	
a pile	of	pages
Classify	by
Templates
pages	clustered
by	template	
Infer
Extractor
Infer
Extractor
Infer
Extractor
Infer
Extractor
extractor
USC Information Sciences Institute CC-By 2.0 18
Unsupervised Extraction Tool
CC-By 2.0 19
Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48
)
.87
(39/45
)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36
)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48
)
.98
(44/45
)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36
)
10	websites,	5	pages	each
fields
USC Information Sciences Institute CC-By 2.0 20
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 21
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Alignment
USC Information Sciences Institute CC-By 2.0 22
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{	JSON-LD	}
Hierarchical	
Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 23
karma.isi.edu
Karma Solves Feature Alignment
CC-By 2.0 24USC Information Sciences Institute
Provenance
Domain Schema
took ~30 minutes to align
the output of the Stanford name extractor
Feature Alignment Statistics
• 5 contractors provided data
• ~ 15 datasets
• > 30 Karma models
• > 200 million records
• 1 hour processing in 20 node Hadoop cluster
CC-By 2.0 25USC Information Sciences Institute
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 26
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Entity Resolution
USC Information Sciences Institute CC-By 2.0 27
merging records that refer to the same entity
missing data
incorrect data
scale (~50 million records)
currently working on techniques to address
Entity Resolutuion on Strong Attributes
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
USC Information Sciences Institute CC-By 2.0 28
Linking Using Text Similarity
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O____U____T____C___A___L____L____S
L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
USC Information Sciences Institute CC-By 2.0 29
Linking Using Image Similarity
CC-By 2.0 30USC Information Sciences Institute
100 Million Images Technology: Deep Learning
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 31
Unsupervised Collective Entity
Resolution
USC Information Sciences Institute CC-By 2.0 32
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 33
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Graph Construction
USC Information Sciences Institute CC-By 2.0 34
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
Elastic Search Data Model
Adult
Service
Offer Person Phone
Web
Page
USC Information Sciences Institute CC-By 2.0 35
Indexing for High Performance
Knowledge Graph Queries
Avg.	Query	Times	in	Milliseconds
Single	User	Query	Load
1.2	billion	triples
State	of	the	Art	Graph	Database	(RDF)
DIG	indexing	deployed	in	ElasticSearch
USC Information Sciences Institute CC-By 2.0 36
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 37
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 40
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
Deployed	to	6
Law	Enforcement	
Agencies	and	Successfully	
Used	to	Prosecute	
Traffickers
USC Information Sciences Institute CC-By 2.0 41
DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
Identify illegal sales
Patent Trolls
Identify patent trolls
Cyber Attacks
Predict cyber attacks from dark web data
CC-By 2.0 42USC Information Sciences Institute
Conclusions
• Complete tool-chain to build domain-specific
knowledge graphs
• Integrates heterogeneous data: web pages,
databases, CSV, web APIs, images, etc.
• Scales to ~100 million pages, ~3 billion facts
• Deployed to law enforcement
USC Information Sciences Institute CC-By 2.0 43
Questions?
dig.isi.edu
Open Source, Apache 2 License
CC-By 2.0 44USC Information Sciences Institute

Más contenido relacionado

La actualidad más candente

Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchDavid Amerland
 
Bigdata and ai in p2 p industry: Knowledge graph and inference
Bigdata and ai in p2 p industry:  Knowledge graph and inferenceBigdata and ai in p2 p industry:  Knowledge graph and inference
Bigdata and ai in p2 p industry: Knowledge graph and inferencesfbiganalytics
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query UnderstandingDaniel Tunkelang
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for BioinformaticsHadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for Bioinformaticsosintegrators
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Bradley Allen
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewNeo4j
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideasMr SMAK
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Cataldo Musto
 

La actualidad más candente (20)

Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Semantic search
Semantic searchSemantic search
Semantic search
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic Search
 
Bigdata and ai in p2 p industry: Knowledge graph and inference
Bigdata and ai in p2 p industry:  Knowledge graph and inferenceBigdata and ai in p2 p industry:  Knowledge graph and inference
Bigdata and ai in p2 p industry: Knowledge graph and inference
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for BioinformaticsHadoop and Neo4j: A Winning Combination for Bioinformatics
Hadoop and Neo4j: A Winning Combination for Bioinformatics
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)Semantic Search using RDF Metadata (SemTech 2005)
Semantic Search using RDF Metadata (SemTech 2005)
 
The Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j OverviewThe Graph Database Universe: Neo4j Overview
The Graph Database Universe: Neo4j Overview
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Fyp ideas
Fyp ideasFyp ideas
Fyp ideas
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
Semantics-aware Techniques for Social Media Analysis, User Modeling and Recom...
 

Similar a Building Knowledge Graphs in DIG

Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwareLinked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwarePedro Szekely
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
RDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data servicesRDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data servicesASIS&T
 
Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Boris Adryan
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012Ian Foster
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cristian Consonni
 
Digital Twin and Smart Spaces
Digital Twin and Smart Spaces Digital Twin and Smart Spaces
Digital Twin and Smart Spaces SANGHEE SHIN
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisLarry Smarr
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsHugh McCamphill
 
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresUsing A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresInfiniteGraph
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"Tim Allison
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 

Similar a Building Knowledge Graphs in DIG (20)

Linked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping SoftwareLinked Data, Cultural Heritage & the Karma Mapping Software
Linked Data, Cultural Heritage & the Karma Mapping Software
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
RDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data servicesRDAP 15: You’re in good company: Unifying campus research data services
RDAP 15: You’re in good company: Unifying campus research data services
 
Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017Zühlke Meetup - Mai 2017
Zühlke Meetup - Mai 2017
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
 
Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...Cloud computing and networking course: paper presentation -Data Mining for In...
Cloud computing and networking course: paper presentation -Data Mining for In...
 
Digital Twin and Smart Spaces
Digital Twin and Smart Spaces Digital Twin and Smart Spaces
Digital Twin and Smart Spaces
 
Toward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data AnalysisToward a Global Research Platform for Big Data Analysis
Toward a Global Research Platform for Big Data Analysis
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely tests
 
Citizen-centric Linked Data Services for Smarter Cities
Citizen-centric Linked Data Services for Smarter CitiesCitizen-centric Linked Data Services for Smarter Cities
Citizen-centric Linked Data Services for Smarter Cities
 
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresUsing A Distributed Graph Database To Make Sense Of Disparate Data Stores
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
UNIT_1-BD.pptx
UNIT_1-BD.pptxUNIT_1-BD.pptx
UNIT_1-BD.pptx
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 

Último

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Building Knowledge Graphs in DIG

  • 1. Building knowledge graphs in DIG Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.edu
  • 2. Goal USC Information Sciences Institute CC-By 2.0 2 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 3. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 3 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 4. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 4 100 million pages ~ 100 Web sites help victims prosecute traffickers
  • 5. Salient Statistics on Human Trafficking • Profits per Year: $32 Billion • Average Age of Entry To Prostitution in the US: 14 • PIMP’s Profit Per Victim Per Year: $150,000 • Advertising Budget On the Web:$45 Million CC-By 2.0 5USC Information Sciences Institute
  • 6. Task: Tracking the Victim’s Locations > 100 million pages advertising adult services USC Information Sciences Institute CC-By 2.0 6
  • 7. Example: Investigating a Reported Victim San Diego, where else? USC Information Sciences Institute CC-By 2.0 7
  • 8. DIG Interface: Find the locations where a potential victim was advertised CC-By 2.0 8
  • 9. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 9 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface Data Acquisition
  • 10. Data Acquisition USC Information Sciences Institute CC-By 2.0 10 downloading relevant data batch w real-time Web pagesw Web service w database w CSV w Excel w XML w JSON
  • 11. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 11 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 12. Feature Extraction USC Information Sciences Institute CC-By 2.0 12 from raw sources to structured data • trainable text extractors • extraction from structured Web pages • image features • PDF extractor
  • 13. Feature Extraction from Text USC Information Sciences Institute CC-By 2.0 13 “YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish,Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses” name: Kim eye-color: green hair-color: black phone: 707-727-7477 rate: $60/15min $80/30min $120/60min
  • 14. 20 Examples CC-By 2.0 14USC Information Sciences Institute
  • 15. 1,000’s of Tasks (2 Cents/Sentence) CC-By 2.0 15
  • 16. Performance of CRF Extractors 80 10 18 99 91 94 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG 80 6 12 99 73 84 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG Eyes Hair USC Information Sciences Institute CC-By 2.0 16
  • 20. Extraction Evaluation Title Desc Seller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48 ) .87 (39/45 ) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36 ) Pretty Good 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48 ) .98 (44/45 ) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36 ) 10 websites, 5 pages each fields USC Information Sciences Institute CC-By 2.0 20
  • 21. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 21 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 22. Feature Alignment USC Information Sciences Institute CC-By 2.0 22 from multiple schemas to a common domain schema - CSV, Excel - Database tables - Web services - Extractors - Nomenclature - Spelling Multiple Schemas
  • 23. Karma: Mapping Data to Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Schema.org USC Information Sciences Institute CC-By 2.0 23 karma.isi.edu
  • 24. Karma Solves Feature Alignment CC-By 2.0 24USC Information Sciences Institute Provenance Domain Schema took ~30 minutes to align the output of the Stanford name extractor
  • 25. Feature Alignment Statistics • 5 contractors provided data • ~ 15 datasets • > 30 Karma models • > 200 million records • 1 hour processing in 20 node Hadoop cluster CC-By 2.0 25USC Information Sciences Institute
  • 26. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 26 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 27. Entity Resolution USC Information Sciences Institute CC-By 2.0 27 merging records that refer to the same entity missing data incorrect data scale (~50 million records) currently working on techniques to address
  • 28. Entity Resolutuion on Strong Attributes AdultService-1 Person-1 Offer-1 availableAt seller phone 619-319-7315 Santa Barbara hairColor red price 250/hour startDate 2014-12-07 eyeColor blue name Jessica itemProvided Offer-2 Person-2 availableAt Washington DC phone seller email price 250/hour startDate 2014-05-28 AdultService-2 eyeColor blue name Jessica itemProvided USC Information Sciences Institute CC-By 2.0 28
  • 29. Linking Using Text Similarity E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S USC Information Sciences Institute CC-By 2.0 29
  • 30. Linking Using Image Similarity CC-By 2.0 30USC Information Sciences Institute 100 Million Images Technology: Deep Learning
  • 32. Unsupervised Collective Entity Resolution USC Information Sciences Institute CC-By 2.0 32
  • 33. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 33 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 34. Graph Construction USC Information Sciences Institute CC-By 2.0 34 assembling the data for efficient query & analysis - ElasticSearch: scalable, efficient query - graph databases: network analytics - NoSQL: scalable analytics - bulk loading: massive data imports - real-time updates: live, changing data
  • 35. Elastic Search Data Model Adult Service Offer Person Phone Web Page USC Information Sciences Institute CC-By 2.0 35
  • 36. Indexing for High Performance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch USC Information Sciences Institute CC-By 2.0 36
  • 37. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 37 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 38.
  • 39.
  • 40. DIG Deployment for Human Trafficking USC Information Sciences Institute CC-By 2.0 40 - 100 million Web pages - Live updates (~5,000 pages/hour) - ElasticSearch database (7 nodes) - Hadoop workflows (20 nodes) - District Attorney - Law Enforcement - NGOs
  • 42. DIG Applications Human Trafficking large, real users Material Science Research 70,000 paper abstracts (built in 1 week) Arms Trafficking Identify illegal sales Patent Trolls Identify patent trolls Cyber Attacks Predict cyber attacks from dark web data CC-By 2.0 42USC Information Sciences Institute
  • 43. Conclusions • Complete tool-chain to build domain-specific knowledge graphs • Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc. • Scales to ~100 million pages, ~3 billion facts • Deployed to law enforcement USC Information Sciences Institute CC-By 2.0 43
  • 44. Questions? dig.isi.edu Open Source, Apache 2 License CC-By 2.0 44USC Information Sciences Institute

Notas del editor

  1. Karma offers suggestions on how to do the mapping
  2. Simplest kind of linking we do – linking based on strong, explicit attributes (phones, emails, websites, etc.) So person-1 and person-2 might be the same person … but can we find more attributes to improve our confidence …
  3. Estimating text similarity is challenging – here we are emphasizing stylometric similarity; map->n-grams->jacquard similarity
  4. Clever scheme for storing pair-wise similarities in a database that can be updated incrementally (so we can bypass hashing that leverages elastic search w/lucene)
  5. Why is linking significant in this domain? Slide shows why.
  6. There is some clever tricks We produce json documents rooted on the classes we care about .. Contain enough of the graph-neighborhood so that keyword queries can work so that I can search for an adultservice using a phone number even though the phone number is really part of the seller. Or search all offers that have the same phone number. Basically copying over some content.