STL: A similarity measure based on semantic and linguistic information
1. STL : A Similarity Measure Based on Semantic, Terminological and Linguistic Information Nitish Aggarwal joint work with Tobias Wunner, MihaelArcan DERI, NUI Galway firstname.lastname@deri.org Friday,19th Aug, 2011 DERI, Friday Meeting
2. Overview Motivation & Applications Why STL? Semantic Terminology Linguistic Evaluation Conclusion and future work 2
3. Motivation & Applications SemanticAnnotation Similarity between corpus data and ontology concepts SAP AG held €1615 million in short-term liquid assets (2009) “dbpedia:SAP_AG” “xEBR:LiquidAssets” at “dbpedia:year:2009” 3
4. SemanticSearch Similarity between Query and index object Motivation & Applications SAP liquid asset in 2010 Current asset of SAP last year “dbpedia:SAP_AG” “xEBR:liquid asset” at “dbpedia:year:2010” Net cash of SAP in 2010 SAP total amount received in 2010 4
6. Classical Approaches String Similarity Levenshteindistance, Dice Coefficient Corpus-based LSA, ESA, Google distance,Vector-Space Model Ontology-based Path distance, Information content Syntax Similarity Word-order, Part of Speech 6
7. Why STL? Semantic Semanticstructure and relations Terminology complex terms expressing the same concept Linguistic Phrase and dependency structure 7
8. STL Definition Linear combination of semantic, terminological and linguistic obtained by using a linear regression Formula used STL = w1*S + w2*T + w3*L + Constant w1, w2, w3 represent the contribution of each 8
9. Semantic WuPalmer 2*depth(MSCA) / depth(c1) + depth(c2) Resnik’s Information Content IC(c) = -log p(c) Intrinsic Information Content (Pirro09) Overcome the analysis of large corpora 9
10. Cont. Intrinsic information content(iIC) . where sub(c) is number of sub-concept of given concept c. Pirro_Similarity 10
11. Cont. MSCA subconcepts = 48 IC (TFA) = 0.32 Assets Subscribed Capital Unpaid Fixed Assets Current Assets Pirro_Sim = 0.33 Pirro_Sim =? Stocks Tangible Fixed Assets Amount Receivable subconcepts = 6 IC (AR) = 0.69 subconcepts = 9 IC (TFA) = 0.60 Amount Receivable [total] Amount Receivable with in one year Amount Receivable after more than one year Other Tangible Fixed Assets Property, Plant and Equipment Payments on account and asset in construction Furniture Fixture and Equipment Trade Debtors Other Fixture Land and Building Other Debtors Plant and Machinery Other Property, Plant and Equipment Property, Plant and Equipment [Total] 11
12. Limitation Does semantic structure reflect a good similarity? not necessarily e.g. In xEBR, parent-child relation for describing the layout of concepts “Work in progress” is not a type of asset, although both are linked via the parent-child relationship 12
13. Terminology Definition Common naming convention Ngram Vs subterms In financial domain, bigram ”Intangible Fixed” is a subtring of ”Other Intangible Fixed Assets” but not a subterm. Terminological similarity maximal subterm overlap 13
14. Cont. Trade Debts Payable After More Than One Year [[Trade][Debts]][Payable][After More Than One Year] [SAP:Payable] [Ifrs:After More Than One Year] [Investoword:Debt] [FinanceDict:Trade Debts] [Investopedia:Trade] Financial[Debts][Payable][After More Than One Year] Financial Debts Payable After More Than One Year 14
15. Multilingual Subterms Translatedsubterms Available in otherlanguages Advantage Reflect terminological similarities that may be available in one language but not in others. ”Property Plant and Equipment”@en ”Sachanlagen”@de ”Tangible Fixed Asset” @en 15
16. Linguistic Syntactic Information Beyond simple word order phrase structure Dependency structure Phrase structure Intangible fixed : adj adj > ?? Intangible fixed assets : adj adj n > NP Dependency structure Amounts receivable : N Adv : receive:mod, amounts:head Received amounts : V N : receive:mod, amounts:head 16
17. Evaluation Data Set xEBR finance vocabulary 269 terms (concept labels) 72,361(269*269) termpairs Benchmarks SimSem59: sample of 59 term pairs SimSem200 : sample of 200 term pairs (under construction) 17
19. Experiment Results (Simsem59) STL formula used STL = 0.1531 * S + 0.5218 * T + 0.1041 * L + 0.1791 Correlation between similarity scores & simsem59 Semantic Contribution Terminology Contribution Linguistic Contribution 19
20. Conclusion STL outperforms more traditional similarity measures Largest contribution by T (Terminological Analysis) Multilingual subterms performs better than monolingual 20
21. Future work Evaluation on larger data set and vocabularies (IFRS) 3000+ terms 9M term pairs richer set of linguistic operations “recognise” => “recognition” by derivation rule verb_lemma+"ion” Similarity between subterms “Staff Costs” and "Wages And Salaries" 21