Slides for a presentation at the Fifth International Workshop on Mining Scientific Publications @ JCDL 2016
Paper: http://mirror.dlib.org/dlib/september16/herrmannova/09herrmannova.html
Handwritten Text Recognition for manuscripts and early printed texts
An Analysis of the Microsoft Academic Graph
1. 1/32
An Analysis of the Microsoft Academic Graph
Drahomira Herrmannova (@robodasha)
&
Petr Knoth (@petrknoth)
KMi, The Open University
2. 2/32
Introduction
• To understand the strengths and limitations of
the Microsoft Academic Graph (MAG) for
applying it to scholarly communication tasks
• We study the characteristics of the dataset
and perform a correlation analysis with other
similar datasets
3. 3/32
Questions
• How complete/sparse are the data?
• How many of the graph entities have all
associated metadata fields populated and how
reliable they are?
• How well are the data
conflated/disambiguated?
4. 4/32
Dataset
• Heterogeneous graph comprised of more than
120 million publications and the related
authors, venues, organizations, and fields of
study
• The largest publicly available dataset of
scholarly publications
• The largest dataset of open citation data
6. 6/32
External datasets used
• CORE (Connecting Repositories)
• Mendeley
• Webometrics Ranking of World Universities
• Scimago Journal and Country Rank
8. 8/32
Publication age
• Publication dates from MAG compared with
CORE and Mendeley data
• Intersection found using DOI
Unique DOIs in the MAG 35,569,305
Unique DOIs in CORE 2,673,592
Intersection MAG/CORE 1,690,668
Intersection MAG/CORE/Mendeley 1,314,854
Intersection without missing data 1,258,611
9. 9/32
Publication age
• Compared using two methods
– Spearman's rho correlation coefficient
– Cumulative distribution function of the difference
between the publication years in the different
datasets
12. 12/32
Authors and affiliations
• Publications linked to author and affiliation
entities
• All publications linked to one or more authors,
however 105,980,107 (~83%) publications not
linked to any affiliation
13. 13/32
Authors and affiliations
Mean number of authors per paper 2.66
Max authors per paper 6,530
Mean number of papers per author 2.94
Max number of papers per author 153,915
Mean number of collaborators 116.93
Max number of collaborators 3,661,912
Number of papers with affiliation 20,928,914
Mean number of affiliations per paper 0.23
Max number of affiliations per paper 181
14. 14/32
Authors and affiliations
• Paper with most authors: ”Sunday, 26 August
2012"
• Author with most papers: ”united vertical
media gmbh"
15. 15/32
Journals and conferences
• Papers linked to publication venues
• Of all papers in MAG (over 126 million), more
than 51 million (~40%) are linked to a journal
and 1,7 million to a conference entity
16. 16/32
Fields of study
• FoS in MAG organised hierarchically into four
levels (0-3)
– 47,989 at level 3
– 1,966 at level 2
– 293 at level 1
– 18 at level 0
• Over 41 million papers are linked to one or
more fields of study (~33%)
19. 19/32
Citation network
• We study the network by
– looking at the citation distribution, to see whether
it is consistent with previous studies
– Compare the citations received by two types of
entities in the graph with citations from external
datasets
• Why?
– To understand the quality of the citation data (not
to rank universities or journals)
20. 20/32
Citation network
• 528,682,289 internal citations
• Significant portion of papers disconnected
from the graph
Total number of papers 126,909,021
Papers with zero references 96,850,699
Papers with zero citations 89,647,949
Papers with zero references and citations 80,166,717
Mean citation per paper 4.17
Mean citation per ”connected” paper 11.31
21. 21/32
Citation network
• Comparison of university and journal citation
data found in MAG with the Ranking Web of
Universities (RWoU) and the Scimago Journal
& Country Rank (SJCR) citation data
• Two comparison methods
– Size of overlap of the top university/journal lists
– Pearson’s and Spearman’s correlation (calculated
on matching items)
22. 22/32
Citation network
• Matched 1,255 universities between MAG and
RWoU (2,105 in total), and 13,050 journals
between MAG and SJRC (22,878 in total)
• 4 common journals in among the top 10
• 54 among the top 100
• 677 among the top 1000 and 1407 among the
top 2000
25. 25/32
Citation network
• To quantify how much do the lists differ, we
created histograms of the differences between
the ranks in the MAG and in the external lists
• To produce the histograms
– Sorted the data by number of citations found in
the external dataset
– For top 100/1000 universities/journals created a
histogram of absolute difference between rank in
MAG and in external dataset
27. 27/32
Rank difference – top 100 universities
• University citation rank in the MAG differs by
more than 200 positions for about 20% of
universities in the top 100 of the Ranking Web
of Universities list
• The citation university rank differs by less than
25 positions for less than 40% of universities
across these two datasets
31. 31/32
Citation network
• Ranks of top universities differ on average by
163, with standard deviation of 185
• Ranks of top journals differ on average by
1,203 with standard deviation of 1,211
• Correlations calculated on matching items
Universities Journals
Pearson’s r 0.8773, p -> 0.0 0.8246, p -> 0.0
Spearman’s rho 0.8266, p -> 0.0 0.8973, p -> 0.0
32. 32/32
Conclusions
• MAG data correlate well with external datasets
• We have identified certain limitations as to the
completeness of links from publications to other
entities
• Existing university and journal rankings
(proprietary data) produce substantially different
results
– MAG is open and transparent at the level of individual
citations, it is possible to verify and better interpret
the citation data
• Currently the most comprehensive publicly
available dataset of its kind