Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts Prediction Based on Innovation-Adoption Priors

1.353 visualizaciones

Publicado el

The ontology engineering research community has focused for many years on supporting the creation, development and evolution of ontologies. Ontology forecasting, which aims at predicting semantic changes in an ontology, represents instead a new challenge. In this paper, we want to give a contribution to this novel endeavour by focusing on the task of forecasting semantic concepts in the research domain. Indeed, ontologies representing scientific disciplines contain only research topics that are already popular enough to be selected by human experts or automatic algorithms. They are thus unfit to support tasks which require the ability of describing and exploring the forefront of research, such as trend detection and horizon scanning. We address this issue by introducing the Semantic Innovation Forecast (SIF) model, which predicts new concepts of an ontology at time t+1 , using only data available at time t. Our approach relies on lexical innovation and adoption information extracted from historical data. We evaluated the SIF model on a very large dataset consisting of over one million scientific papers belonging to the Computer Science domain: the outcomes show that the proposed approach offers a competitive boost in mean average precision-at-ten compared to the baselines when forecasting over 5 years.

Publicado en: Ciencias
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts Prediction Based on Innovation-Adoption Priors

  1. 1. Amparo Elizabeth Cano Basave1, Francesco Osborne2, Angelo Salatino2 1 Aston University, United Kingdom 2 KMi, The Open University, United Kingdom EKAW 2016 Ontology Forecasting in Scientific Literature: Semantic Concepts Prediction based on Innovation-Adoption Priors
  2. 2. 22 Osborne, F., Motta, E. and Mulholland, P. Exploring scholarly data with Rexplore. International Semantic Web Conference 2013 technologies.kmi.open.ac.uk/rexplore/
  3. 3. The Computer Science Ontology 1 • Not fine-grained enough. – E.g., only 2 topics are classified under Semantic Web • Static, manually defined, hence prone to get obsolete very quickly. 3 Standard research areas taxonomies/classifications/ontologies such as ACM are not apt to the task. ACM 2012
  4. 4. The Computer Science Ontology (CSO) was automatically created and updated by applying the Klink-2 algorithm. Osborne, F. and Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In ISWC 2015. (2015) The Computer Science Ontology 2
  5. 5. • We automatically generated a version of CSO consisting of about 15,000 topics linked by about 70,000 semantic relationships. • It included very granular and low level research areas and it can be regularly updated by running Klink-2 on a new set of publications. • We also have different versions of CSO obtained by running Klink-2 on the set of documents up to a certain year. 5 The Computer Science Ontology 3 5 CSO 2012 CSO 2013 CSO 2014 CSO 2015 […]
  6. 6. A shared conceptualization “Ontologies are a formal, explicit specification of a shared conceptualization” (Studer et al., 1998) “The conceptualization should express a shared view between several parties, a consensus rather than an individual view“ (Guarino at al, 2009) “Ontologies are us: inseparable from the context of the community in which they are created and used.” (Mika, 2005) “Ontology Evolution is the timely adaptation of an ontology to the arisen changes and the consistent propagation of these changes to dependent artefacts.” (Stojanovic, 2004) 6
  7. 7. But what if we cannot wait for shared consensus? These ontologies reflect the past, and can only contain concepts that are already popular enough to be selected by experts or automatic methods. Hence, they hardly support tasks which involve the ability to describe emerging concepts, e.g.: • Exploring the forefront of research; • Trend detection; • Horizon scanning; • Producing smart analytics to inform business decision. 77
  8. 8. Ontology Forecasting Given an ontology in time t, a team of experts and/or a software consider a number of relevant knowledge sources and update the ontology by also including new concepts on which there will be (probably) a shared consensus in time t+1. For example, a forecasted ontology of research topics in 2000 may already include a new topic associated to the dynamics preluding to the “Semantic Web” (new collaborations between Knowleged Base Systems, AI and WWW) 8 […] t-n t-1 t t+1
  9. 9. Contributions – a first step towards ontology forecasting 1. We approach the novel task of ontology forecasting by predicting semantic concepts in the research domain. 2. We introduce metrics to analyse the linguistic and semantic progressiveness in scholarly data. 3. We propose Semantic Innovation Forecast (SIF) a novel weakly-supervised approach for the forecasting of emerging semantic concepts. 4. We evaluate our approach in a dataset of over 1 million documents in the Computer Science domain. – The proposed framework offers competitive boosts in mean average precision at ten for forecasts over 5 years. 9
  10. 10. Scopus (Computer Science) - # of publications 10 0 50000 100000 150000 200000 250000 1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7 NUMBE R OF ARTICLES YEAR
  11. 11. Scopus (Computer Science) – vocabulary size 11 0 20000 40000 60000 80000 100000 120000 140000 160000 1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7 VOCABULARY SIZE YEAR
  12. 12. Klink-2 Computer Science Ontology - # of classes 12
  13. 13. Linguistic Progressiveness Language innovation in a corpus refers to the introduction of novel patterns of language. We generate a language model per year using Katz back-off smoothing language model and analyzed differences between consecutive years by using the perplexity metric. 13 0 2E+10 4E+10 6E+10 8E+10 1E+11 1.2E+11 1.4E+11 1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7 PERPLEXITY YEAR
  14. 14. Linguistic Progressiveness We also perform a progressive analysis based on lexical innovation and lexical adoption. A large number of new words appear each year, but only few of them are adopted (i.e., still used in the following year). 14 0 10000 20000 30000 40000 50000 60000 70000 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7 NUMBER OF WORDS YEAR # of new words per year # of adopted words per year
  15. 15. Measure Linguistic Progressiveness 15 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7 LINGUISTIC PROGRESSIVENESS YEAR We introduce the linguistic progressiveness metric: 𝑳𝑷 𝒕 = 𝑳𝑨 𝒕 𝑳𝑰 𝒕
  16. 16. Innovation-Adoption Priors We assume that emerging topics will be associated with novel words, thus we compute priors in time t by considering innovative (LI) and adopted words (LA). A word prior is a probability distribution that expresses a word relevance to - in this case - being characteristic of innovative topics. We build the prior matrix by assigning a weight to each term in this vocabulary. – 0.7 if w ∈ LIt−2 and 0.9 if w ∈ LAt−1. Because our analysis shows that recently adopted words (LA) are more often associated with emerging topics than new words (LI). 16
  17. 17. Semantic Innovation Forecast (SIF) model SIF is a generative probabilistic topic model that takes in input a set of documents at year t and a set of historical priors and forecast topic word distributions representing new concepts in the ontology Ot+1. 17
  18. 18. Semantic Innovation Forecast (SIF) model 18 We use Collapsed Gibbs Sampling to infer the model parameters and topic assignments for a corpus at year t + 1 given observed documents at year t.
  19. 19. Evaluation We perform this task by applying our framework on the Scopus dataset for Computer Science ( > 1M publications). Each collection of documents in a year is randomly partitioned into three subsets: 20% is used to derive innovation priors, 40% training set, 40% testing set. We train a SIF model on year t using innovative priors computed for the two previous years (t-1 and t-2) and we use the SIF model to forecast semantic concepts at year t + 1. We then measure compute the cosine similarity between the predicted semantic concepts for t + 1 and the gold standard concepts for that year. We consider a concept correctly forecasted if the similarity with a GS concept is higher than 0.5. 19
  20. 20. Evaluation - Baselines We compare SIF against four baselines. For a year t forecasting for year t + 1: 1. LDA Topics (LDA) on the full training set. This setting makes no assumption over innovative/adopted lexicons. 2. LDA Innovative Topics (LDA-I); computes topics based on documents containing at least one word appearing in LIt. 3. LDA Adopted Topics (LDA-A); computes topics based only on documents containing at least one word appearing in LAt. 4. LDA Innovation/Adoption Topics (LDA-IA); computes topics based only on documents containing at least one word appearing in LIt or LAt. 20
  21. 21. Evaluation - Mean Average Precision @ 10 21 Year SIF LDA LDA-A LDA-I LDA-IA 2000 0.70 0.12 0.48 0 0.41 2002 0.87 0 0.82 0.64 0.75 2004 0.91 0 0.58 0.57 0.63 2006 0.87 0.31 0.78 0.84 0.69 2008 0.99 0.40 0.68 0.57 0.70 AVG 0.87 0.17 0.67 0.52 0.64
  22. 22. Conclusion It is possible to forecast reliably emerging semantic concepts if the ontology is associated with a large collection of document. The next challenge is to forecast new version of an ontology, that is to produce an ontology that includes all concepts and relationships that will be (probably) included in the new version. 22
  23. 23. Future works • Integration of explicit and latent semantics; • Including graph-structure information into the model; • Understanding how research topics are created and forecast topic trends. 23 Salatino, A.A., Osborne, F., Motta, E. (2016) How are topics born? Understanding the research dynamics preceding the emergence of new areas. PeerJ Preprints
  24. 24. Francesco Osborne Angelo SalatinoAmparo Cano Basave Elizabeth Cano-Basave, A. E., Osborne, F., Salatino, A.A. (2016) Ontology Forecasting in Scientific Literature: Semantic Concepts Prediction based on Innovation-Adoption Priors. EKAW 2016, Bologna, Italy Email: francesco.osborne@open.ac.uk Twitter: FraOsborne Site: people.kmi.open.ac.uk/francesco

×