SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
BUILDING BETTER
KNOWLEDGE
GRAPHS
THROUGH SOCIAL
COMPUTING
Elena Simperl
University of Southampton, UK
@esimperl
Erasmus University Rotterdam
November 16th 2018
OVERVIEW
Knowledge graphs have
become a critical AI resource
We study them as socio-
technical constructs
Our research
 Explores the links between social and
technical qualities of knowledge graphs
 Proposes methods and tools to make
knowledge graphs better
Picture from https://medium.com/@sderymail/challenges-of-knowledge-graph-part-1-d9ffe9e35214
IN THIS TALK
Effects of editing behaviour and
community make-up on the quality
of knowledge graph
Crowdsourcing methods to enhance
knowledge graphs
EXAMPLE: DBPEDIA
Community project, extracts structured data from Wikipedia
Consistent, centrally defined ontology; support for 125
languages; represents 4.5M items
Open licence
RDF exports, connected to Linked Open Data Cloud
EXAMPLE: WIKIDATA
Wikipedia project creating a knowledge graph
collaboratively
20k active users
52M items, no ‘explicit’ ontology
Open licence
RDF exports, connected to Linked Open Data Cloud
‘ONTOLOGIES ARE US’
Piscopo, A., Phethean, C., & Simperl, E. (2017). What Makes a
Good Collaborative Knowledge Graph: Group Composition and
Quality in Wikidata. International Conference on Social
Informatics, 305-322, Springer.
Piscopo, A., & Simperl, E. (2018). Who Models the World?:
Collaborative Ontology Creation and User Roles in
Wikidata. Proceedings of the ACM on Human-Computer
Interaction, 2(CSCW), 141.
BACKGROUND
Wikidata editors have varied tenure and
interests
Editors and editing behaviour impact
outcomes
 Group composition can have multiple effects
 Tenure and interest diversity can increase outcome
quality and group productivity
 Different editors groups focus on different types of
activities
Chen, J., Ren, Y., Riedl, J.: The effects of diversity on group productivityand member withdrawalin online volunteer groups. In: Proceedingsof the 28th international
conference on human factors in computing systems - CHI ’10. p. 821. ACM Press, New York, USA (2010)
FIRST STUDY: ITEM QUALITY
Analysed the edit history of items
Corpus of 5k items, whose quality has been
manually assessed (5 levels)*
Edit history focused on community make-up
Community is defined as set of editors of item
Considered features from group diversity
literature and Wikidata-specific aspects
*https://www.wikidata.org/wiki/Wikidata:Item_quality
RESEARCH HYPOTHESES
Activity Outcome
H1 Bots edits Item quality
H2 Bot-human interaction Item quality
H3 Anonymous edits Item quality
H4 Tenure diversity Item quality
H5 Interest diversity Item quality
DATA AND METHODS
Ordinal regression analysis, trained four models
Dependent variable: 5k labelled Wikidata items
Independent variables
 Proportion of bot edits
 Bot human edit proportion
 Proportion of anonymous edits
 Tenure diversity: Coefficient of variation
 Interest diversity: User editing matrix
Control variables: group size, item age
RESULTS
ALL HYPOTHESES SUPPORTED
H1
H2
H3 H4
H5
SUMMARY AND IMPLICATIONS
The more is
not always
the merrier
01
Bot edits are
key for quality,
but bots and
humans are
better
02
Registered
editors have
a positive
impact
Diversity
matters
04
Encourage
registration
01
Identify further
areas for bot
editing
02
Design effective
human-bot
workflows
03
Suggest items
to edit based
on tenure and
interests
04
03
SECOND STUDY: ONTOLOGY QUALITY
Analysed the Wikidata ontology and its
edit context
Defined as the graph of all items linked through
P31 (instance of) & P279 (subclass of)
Calculated evolution of quality metrics and
editing activity over time and the links between
them
Based on features from literature on ontology
evaluation and community-driven ontology
engineering
DATA AND METHODS
Wikidata dumps from March 2013 (creation of P279)
to September 2017
 Analysed data in 55 monthly time frames
Literature survey to defined Wikidata ontology
quality framework
Clustering to identify ontology editor roles
Lagged multiple regression to link roles and ontology
features
 Dependent variable: Changes in ontology quality across time
 Independent variables: number of edits by different roles
 Control variables: Bot and anonymous edits
ONTOLOGY QUALITY: METRICS
Based on 7 ontology evaluation frameworks
Compiled structural metrics that can be determined from the dumps
15
Indicator Description Indicator Description
noi Number of instances ap; mp Average and median
population
noc Number of classes rr Relationship richness
norc Number of root classes ir, mr Inheritance and median
richness
nolc Number of leaf classes cr Class richness
nop Number of properties ad, md, maxd Average, median, and max
explicit depth
Sicilia, M. A., Rodríguez, D., García-Barriocanal, E., & Sánchez-Alonso, S. (2012). Empirical findings on ontology metrics. Expert Systems with
Applications, 39(8), 6706-6711.
ONTOLOGY QUALITY: RESULTS
LARGE ONTOLOGY, UNEVEN QUALITY
>1.5M classes, ~4000 properties
No of classes increases at same rate as
overall no of items, likely due to users
incorrectly using P31 & P279
ap and cr decrease over time (several classes
are either without instances or sub-classes or
both)
ir & maxd increase over time (part of the
Wikidata ontology is distributed vertically)
16
EDITOR ROLES: METHODS
K-means, features based on previous studies
Analysis by yearly cohort
17
Feature Description Feature Description
# edits Total number of edits per month. # property edits Total number of edits on
Properties in a month.
# ontology edits Number of edits on classes. # taxonomy
edits
Number of edits on P31 and
P279 statements.
# discussion
edits
Number of edits on talk pages. p batch edits Number of edits done through
automated tools.
# modifying
edits
Number of revisions on
previously existing statements.
item diversity Proportion between number of
edits and number of items edited.
admin True if user in an admin user
group, false otherwise.
lower admin True if user in a user group
with enhanced user rights,
false otherwise.
EDITOR ROLES: RESULTS
190,765 unique editors over 55 months (783k
total)
18k editors active for 10+ months
2 clusters, obtained using gap statistic (tested
2≥k≥8)
Leaders: more active minority (~1%), higher
number of contributions to ontology, engaged
within the community
Contributors: less active, lower number of
contributions to ontology and lower proportion of
batch edits
18
EDITOR ROLES: RESULTS
People who joined the project early tend to be
more active & are more likely to become leaders
Levels of activity of leaders decrease over time
(alternatively, people move on to different tasks)
19
RESEARCH HYPOTHESES
H1 Higher levels of leader activity are negatively correlated to
number of classes (noc), number of root classes (norc), and
number of leaf classes (nolc)
H2 Higher levels of leader activity are positively correlated to
inheritance richness (ir), average population (ap), and average
depth (ad)
20
ROLES & ONTOLOGY: RESULTS
H1 not supported
H2 partially supported
Only inheritance richness (ir) and average depth (ad)
related significantly with leader edits (p<0.01)
Bot edits significantly and positively affect the number of
subclasses and instances per class (ir & ap) (p<0.05)
21
SUMMARY AND IMPLICATIONS
Creating ontologies still a challenging task
Size of the ontology renders existing automatic quality
assessment methods unfeasible
Broader curation efforts are needed: large number of
empty classes
Editor roles less well articulated than in other ontology
engineering projects
Possible decline in motivation after several months
NOBODY KNOWS
EVERYTHING, BUT
EVERYBODY KNOWS
SOMETHING
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D.,
Flöck, F., & Lehmann, J. (2016). Detecting Linked Data
quality issues via crowdsourcing: A DBpedia
study. Semantic Web Journal, 1-34.
23
BACKGROUND
Varying quality of Linked Data sources
Detecting and correcting errors may require manual
inspection
Different crowds are more or less motivated (or
skilled) to undertake specific aspects of this work
We propose a scalable way to carry out this work
dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.
Contest
LD Experts
Difficult task
Final prize
Find Verify
Microtasks
Workers
Easy task
Micropayments
TripleCheckMate MTurk
Incorrect object
Incorrect data type
Incorrect outlink
Object
values
Data types Interlinks
Linked Data
experts
0.7151 0.8270 0.1525
MTurk
(majority voting)
0.8977 0.4752 0.9412
Results: Precision
Approach MTurk interfaces Findings
Use the right
crowd for the
right task
Experts detect a range
of issues, but will not
invest additional effort
Turkers can carry out the
three tasks and are
exceptionally good at
data comparisons
ALL ROADS LEAD TO
ROME
Bu, Q., Simperl, E., Zerr, S., & Li, Y. (2016). Using
microtasks to crowdsource DBpedia entity classification:
A study in workflow design. Semantic Web Journal, 1-
18
26
THREE WORKFLOWS
TO ADD MISSING
ITEM TYPES
Free associations
Validating the machine
Exploring the DBpedia ontology
Findings
 Shortlists are easy & fast
 Popular classes are not enough
 Alternative ways to explore the taxonomy
 Freedom comes with a price
 Unclassified entities might be unclassifiable
 Different human data interfaces
 Working at the basic level of abstraction achieves
greatest precision
 But when given the freedom to choose, users suggest more specific
classes
4.58M
things
SUMMARY OF FINDINGS
Social computing offer a useful lens to study knowledge
graphs
Social fabric of graphs affect quality
Crowdsourcing methods can be used to curate and
enhance knowledge graphs
BUILDING
BETTER
KNOWLEDGE
GRAPHS
THROUGH
SOCIAL
COMPUTING
• Bu, Q., Simperl, E., Zerr, S., & Li, Y. (2016). Using
microtasks to crowdsource DBpedia entity classification: A
study in workflow design. Semantic Web Journal, 1-18
• Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck,
F., & Lehmann, J. (2016). Detecting Linked Data quality
issues via crowdsourcing: A DBpedia study. Semantic Web
Journal, 1-34.
• Piscopo, A., Phethean, C., & Simperl, E. (2017). What
Makes a Good Collaborative Knowledge Graph: Group
Composition and Quality in Wikidata. International
Conference on Social Informatics, 305-322, Springer.
• Piscopo, A., & Simperl, E. (2018). Who Models the
World?: Collaborative Ontology Creation and User Roles
in Wikidata. Proceedings of the ACM on Human-Computer
Interaction, 2(CSCW), 141.

Más contenido relacionado

La actualidad más candente

sigmod-keynote.pdf
sigmod-keynote.pdfsigmod-keynote.pdf
sigmod-keynote.pdf
ssuser56e850
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4j
Neo4j
 
Activity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedActivity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn Feed
Bodla Kumar
 

La actualidad más candente (20)

2022-08-13_cogsec_defcon.pptx
2022-08-13_cogsec_defcon.pptx2022-08-13_cogsec_defcon.pptx
2022-08-13_cogsec_defcon.pptx
 
Design Thinking _ Complete Templates.pptx
Design Thinking _ Complete Templates.pptxDesign Thinking _ Complete Templates.pptx
Design Thinking _ Complete Templates.pptx
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...
 
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
Combating Fake News: Combating Fake News: A Data Management and Mining Perspe...
 
UX STRAT USA, Peter Merholz, "My Journey with Experience Strategy"
UX STRAT USA, Peter Merholz, "My Journey with Experience Strategy"UX STRAT USA, Peter Merholz, "My Journey with Experience Strategy"
UX STRAT USA, Peter Merholz, "My Journey with Experience Strategy"
 
Dark Patterns: User Interfaces Designed to Trick People (Presented at UX Brig...
Dark Patterns: User Interfaces Designed to Trick People (Presented at UX Brig...Dark Patterns: User Interfaces Designed to Trick People (Presented at UX Brig...
Dark Patterns: User Interfaces Designed to Trick People (Presented at UX Brig...
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
 
sigmod-keynote.pdf
sigmod-keynote.pdfsigmod-keynote.pdf
sigmod-keynote.pdf
 
How to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4jHow to Build a Fraud Detection Solution with Neo4j
How to Build a Fraud Detection Solution with Neo4j
 
CanSecWest_cogsec_course_01_introduction.pdf
CanSecWest_cogsec_course_01_introduction.pdfCanSecWest_cogsec_course_01_introduction.pdf
CanSecWest_cogsec_course_01_introduction.pdf
 
DeepWalk: Online Learning of Representations
DeepWalk: Online Learning of RepresentationsDeepWalk: Online Learning of Representations
DeepWalk: Online Learning of Representations
 
Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
How Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge GraphHow Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge Graph
 
GANs and Applications
GANs and ApplicationsGANs and Applications
GANs and Applications
 
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxAstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Activity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn FeedActivity Ranking in LinkedIn Feed
Activity Ranking in LinkedIn Feed
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 

Similar a Building better knowledge graphs through social computing

One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
Elena Simperl
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
William Gunn
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
20211a05p7
 

Similar a Building better knowledge graphs through social computing (20)

What Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineeringWhat Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineering
 
What Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineeringWhat Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineering
 
Who models the world? Collaborative ontology creation and user roles in Wikidata
Who models the world? Collaborative ontology creation and user roles in WikidataWho models the world? Collaborative ontology creation and user roles in Wikidata
Who models the world? Collaborative ontology creation and user roles in Wikidata
 
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
 
Quality and collaboration in Wikidata
Quality and collaboration in WikidataQuality and collaboration in Wikidata
Quality and collaboration in Wikidata
 
Loops of humans and bots in Wikidata
Loops of humans and bots in WikidataLoops of humans and bots in Wikidata
Loops of humans and bots in Wikidata
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School20080509 Friday Food Manchester United Business School
20080509 Friday Food Manchester United Business School
 
Metadata 2020 Vivo Conference 2018
Metadata 2020 Vivo Conference 2018 Metadata 2020 Vivo Conference 2018
Metadata 2020 Vivo Conference 2018
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?
 
A gentle introduction to relational learning
A gentle introduction to relational learning A gentle introduction to relational learning
A gentle introduction to relational learning
 

Más de Elena Simperl

Más de Elena Simperl (20)

This talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
 
Knowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generationKnowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generation
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
Ten myths about knowledge graphs.pdf
Ten myths about knowledge graphs.pdfTen myths about knowledge graphs.pdf
Ten myths about knowledge graphs.pdf
 
Data commons and their role in fighting misinformation.pdf
Data commons and their role in fighting misinformation.pdfData commons and their role in fighting misinformation.pdf
Data commons and their role in fighting misinformation.pdf
 
Are our knowledge graphs trustworthy?
Are our knowledge graphs trustworthy?Are our knowledge graphs trustworthy?
Are our knowledge graphs trustworthy?
 
Crowdsourcing and citizen engagement for people-centric smart cities
Crowdsourcing and citizen engagement for people-centric smart citiesCrowdsourcing and citizen engagement for people-centric smart cities
Crowdsourcing and citizen engagement for people-centric smart cities
 
Pie chart or pizza: identifying chart types and their virality on Twitter
Pie chart or pizza: identifying chart types and their virality on TwitterPie chart or pizza: identifying chart types and their virality on Twitter
Pie chart or pizza: identifying chart types and their virality on Twitter
 
High-value datasets: from publication to impact
High-value datasets: from publication to impactHigh-value datasets: from publication to impact
High-value datasets: from publication to impact
 
The story of Data Stories
The story of Data StoriesThe story of Data Stories
The story of Data Stories
 
The human face of AI: how collective and augmented intelligence can help sol...
The human face of AI:  how collective and augmented intelligence can help sol...The human face of AI:  how collective and augmented intelligence can help sol...
The human face of AI: how collective and augmented intelligence can help sol...
 
Qrowd and the city: designing people-centric smart cities
Qrowd and the city: designing people-centric smart citiesQrowd and the city: designing people-centric smart cities
Qrowd and the city: designing people-centric smart cities
 
Qrowd and the city
Qrowd and the cityQrowd and the city
Qrowd and the city
 
Inclusive cities: a crowdsourcing approach
Inclusive cities: a crowdsourcing approachInclusive cities: a crowdsourcing approach
Inclusive cities: a crowdsourcing approach
 
The data we want
The data we wantThe data we want
The data we want
 
Data stories
Data storiesData stories
Data stories
 
Making transport smarter, leveraging the human factor
Making transport smarter, leveraging the human factorMaking transport smarter, leveraging the human factor
Making transport smarter, leveraging the human factor
 
Data storytelling
Data storytelling Data storytelling
Data storytelling
 
Beyond monetary incentives: experiments with paid microtasks
Beyond monetary incentives: experiments with paid microtasksBeyond monetary incentives: experiments with paid microtasks
Beyond monetary incentives: experiments with paid microtasks
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Building better knowledge graphs through social computing

  • 1. BUILDING BETTER KNOWLEDGE GRAPHS THROUGH SOCIAL COMPUTING Elena Simperl University of Southampton, UK @esimperl Erasmus University Rotterdam November 16th 2018
  • 2. OVERVIEW Knowledge graphs have become a critical AI resource We study them as socio- technical constructs Our research  Explores the links between social and technical qualities of knowledge graphs  Proposes methods and tools to make knowledge graphs better Picture from https://medium.com/@sderymail/challenges-of-knowledge-graph-part-1-d9ffe9e35214
  • 3. IN THIS TALK Effects of editing behaviour and community make-up on the quality of knowledge graph Crowdsourcing methods to enhance knowledge graphs
  • 4. EXAMPLE: DBPEDIA Community project, extracts structured data from Wikipedia Consistent, centrally defined ontology; support for 125 languages; represents 4.5M items Open licence RDF exports, connected to Linked Open Data Cloud
  • 5. EXAMPLE: WIKIDATA Wikipedia project creating a knowledge graph collaboratively 20k active users 52M items, no ‘explicit’ ontology Open licence RDF exports, connected to Linked Open Data Cloud
  • 6. ‘ONTOLOGIES ARE US’ Piscopo, A., Phethean, C., & Simperl, E. (2017). What Makes a Good Collaborative Knowledge Graph: Group Composition and Quality in Wikidata. International Conference on Social Informatics, 305-322, Springer. Piscopo, A., & Simperl, E. (2018). Who Models the World?: Collaborative Ontology Creation and User Roles in Wikidata. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 141.
  • 7. BACKGROUND Wikidata editors have varied tenure and interests Editors and editing behaviour impact outcomes  Group composition can have multiple effects  Tenure and interest diversity can increase outcome quality and group productivity  Different editors groups focus on different types of activities Chen, J., Ren, Y., Riedl, J.: The effects of diversity on group productivityand member withdrawalin online volunteer groups. In: Proceedingsof the 28th international conference on human factors in computing systems - CHI ’10. p. 821. ACM Press, New York, USA (2010)
  • 8. FIRST STUDY: ITEM QUALITY Analysed the edit history of items Corpus of 5k items, whose quality has been manually assessed (5 levels)* Edit history focused on community make-up Community is defined as set of editors of item Considered features from group diversity literature and Wikidata-specific aspects *https://www.wikidata.org/wiki/Wikidata:Item_quality
  • 9. RESEARCH HYPOTHESES Activity Outcome H1 Bots edits Item quality H2 Bot-human interaction Item quality H3 Anonymous edits Item quality H4 Tenure diversity Item quality H5 Interest diversity Item quality
  • 10. DATA AND METHODS Ordinal regression analysis, trained four models Dependent variable: 5k labelled Wikidata items Independent variables  Proportion of bot edits  Bot human edit proportion  Proportion of anonymous edits  Tenure diversity: Coefficient of variation  Interest diversity: User editing matrix Control variables: group size, item age
  • 12. SUMMARY AND IMPLICATIONS The more is not always the merrier 01 Bot edits are key for quality, but bots and humans are better 02 Registered editors have a positive impact Diversity matters 04 Encourage registration 01 Identify further areas for bot editing 02 Design effective human-bot workflows 03 Suggest items to edit based on tenure and interests 04 03
  • 13. SECOND STUDY: ONTOLOGY QUALITY Analysed the Wikidata ontology and its edit context Defined as the graph of all items linked through P31 (instance of) & P279 (subclass of) Calculated evolution of quality metrics and editing activity over time and the links between them Based on features from literature on ontology evaluation and community-driven ontology engineering
  • 14. DATA AND METHODS Wikidata dumps from March 2013 (creation of P279) to September 2017  Analysed data in 55 monthly time frames Literature survey to defined Wikidata ontology quality framework Clustering to identify ontology editor roles Lagged multiple regression to link roles and ontology features  Dependent variable: Changes in ontology quality across time  Independent variables: number of edits by different roles  Control variables: Bot and anonymous edits
  • 15. ONTOLOGY QUALITY: METRICS Based on 7 ontology evaluation frameworks Compiled structural metrics that can be determined from the dumps 15 Indicator Description Indicator Description noi Number of instances ap; mp Average and median population noc Number of classes rr Relationship richness norc Number of root classes ir, mr Inheritance and median richness nolc Number of leaf classes cr Class richness nop Number of properties ad, md, maxd Average, median, and max explicit depth Sicilia, M. A., Rodríguez, D., García-Barriocanal, E., & Sánchez-Alonso, S. (2012). Empirical findings on ontology metrics. Expert Systems with Applications, 39(8), 6706-6711.
  • 16. ONTOLOGY QUALITY: RESULTS LARGE ONTOLOGY, UNEVEN QUALITY >1.5M classes, ~4000 properties No of classes increases at same rate as overall no of items, likely due to users incorrectly using P31 & P279 ap and cr decrease over time (several classes are either without instances or sub-classes or both) ir & maxd increase over time (part of the Wikidata ontology is distributed vertically) 16
  • 17. EDITOR ROLES: METHODS K-means, features based on previous studies Analysis by yearly cohort 17 Feature Description Feature Description # edits Total number of edits per month. # property edits Total number of edits on Properties in a month. # ontology edits Number of edits on classes. # taxonomy edits Number of edits on P31 and P279 statements. # discussion edits Number of edits on talk pages. p batch edits Number of edits done through automated tools. # modifying edits Number of revisions on previously existing statements. item diversity Proportion between number of edits and number of items edited. admin True if user in an admin user group, false otherwise. lower admin True if user in a user group with enhanced user rights, false otherwise.
  • 18. EDITOR ROLES: RESULTS 190,765 unique editors over 55 months (783k total) 18k editors active for 10+ months 2 clusters, obtained using gap statistic (tested 2≥k≥8) Leaders: more active minority (~1%), higher number of contributions to ontology, engaged within the community Contributors: less active, lower number of contributions to ontology and lower proportion of batch edits 18
  • 19. EDITOR ROLES: RESULTS People who joined the project early tend to be more active & are more likely to become leaders Levels of activity of leaders decrease over time (alternatively, people move on to different tasks) 19
  • 20. RESEARCH HYPOTHESES H1 Higher levels of leader activity are negatively correlated to number of classes (noc), number of root classes (norc), and number of leaf classes (nolc) H2 Higher levels of leader activity are positively correlated to inheritance richness (ir), average population (ap), and average depth (ad) 20
  • 21. ROLES & ONTOLOGY: RESULTS H1 not supported H2 partially supported Only inheritance richness (ir) and average depth (ad) related significantly with leader edits (p<0.01) Bot edits significantly and positively affect the number of subclasses and instances per class (ir & ap) (p<0.05) 21
  • 22. SUMMARY AND IMPLICATIONS Creating ontologies still a challenging task Size of the ontology renders existing automatic quality assessment methods unfeasible Broader curation efforts are needed: large number of empty classes Editor roles less well articulated than in other ontology engineering projects Possible decline in motivation after several months
  • 23. NOBODY KNOWS EVERYTHING, BUT EVERYBODY KNOWS SOMETHING Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. (2016). Detecting Linked Data quality issues via crowdsourcing: A DBpedia study. Semantic Web Journal, 1-34. 23
  • 24. BACKGROUND Varying quality of Linked Data sources Detecting and correcting errors may require manual inspection Different crowds are more or less motivated (or skilled) to undertake specific aspects of this work We propose a scalable way to carry out this work dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.
  • 25. Contest LD Experts Difficult task Final prize Find Verify Microtasks Workers Easy task Micropayments TripleCheckMate MTurk Incorrect object Incorrect data type Incorrect outlink Object values Data types Interlinks Linked Data experts 0.7151 0.8270 0.1525 MTurk (majority voting) 0.8977 0.4752 0.9412 Results: Precision Approach MTurk interfaces Findings Use the right crowd for the right task Experts detect a range of issues, but will not invest additional effort Turkers can carry out the three tasks and are exceptionally good at data comparisons
  • 26. ALL ROADS LEAD TO ROME Bu, Q., Simperl, E., Zerr, S., & Li, Y. (2016). Using microtasks to crowdsource DBpedia entity classification: A study in workflow design. Semantic Web Journal, 1- 18 26
  • 27. THREE WORKFLOWS TO ADD MISSING ITEM TYPES Free associations Validating the machine Exploring the DBpedia ontology Findings  Shortlists are easy & fast  Popular classes are not enough  Alternative ways to explore the taxonomy  Freedom comes with a price  Unclassified entities might be unclassifiable  Different human data interfaces  Working at the basic level of abstraction achieves greatest precision  But when given the freedom to choose, users suggest more specific classes 4.58M things
  • 28. SUMMARY OF FINDINGS Social computing offer a useful lens to study knowledge graphs Social fabric of graphs affect quality Crowdsourcing methods can be used to curate and enhance knowledge graphs
  • 29. BUILDING BETTER KNOWLEDGE GRAPHS THROUGH SOCIAL COMPUTING • Bu, Q., Simperl, E., Zerr, S., & Li, Y. (2016). Using microtasks to crowdsource DBpedia entity classification: A study in workflow design. Semantic Web Journal, 1-18 • Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. (2016). Detecting Linked Data quality issues via crowdsourcing: A DBpedia study. Semantic Web Journal, 1-34. • Piscopo, A., Phethean, C., & Simperl, E. (2017). What Makes a Good Collaborative Knowledge Graph: Group Composition and Quality in Wikidata. International Conference on Social Informatics, 305-322, Springer. • Piscopo, A., & Simperl, E. (2018). Who Models the World?: Collaborative Ontology Creation and User Roles in Wikidata. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 141.