This inaugural, Meetup talk, sponsored by the Knowledgent Group, discussed aspects of semantic processing, and emphasized using python for lexical semantics. Slides cite example code snippets for computing the relationships between words using the Natural Language Toolkit (NLTK) in Python. There is also a small overview of the technologies underlying the Semantic Web and text mining.
2. Discussion Topics
• Semantic Processing
– What is Semantics?
– What is Pragmatics?
• Lexical Semantics
– Computing Semantic Similarity
∗ WordNet
∗ Vector Space Modeling
• Ontology Basics
• Text Mining: Basics
1
3. Semantic Processing
• What is Semantics?
– Study of literal meanings of words and sentences
∗ Lexical Semantics - word meanings & word relations
– Sometimes stated formally using some logical form
∗ Example: ∀x∃yloves(x, y)
• What is Pragmatics?
– Study of language use and its situational contexts (discourse, deixis,
presupposition, etc.)
2
4. Lexical Semantics
WordNet: Description
• Word relation database
• Created by George Miller & Christiane Fellbaum (Miller, 1995; Fellbaum, 1998)
@ Princeton University
• Types of Relationships
Synonymy - word pair similarity
Antonymy - word pair dissimilarity
Meronymy - part-of relation
– Example: ’engine’ and ’car’
Hyponymy - subordinate relation between words (i.e., a type-of relation)
– Example: ’red’ is a hyponym of ’color’ (’red’ is a type of color)
Hypernymy - superordinate relation between words
3
5. – Example: ’color’ is a hypernym of ’red’
Question: What’s the relationship between a hyponym and a hypernym?
• 150K words w/ 115k synsets and approx. 200k word-sense pairs
4
6. Lexical Semantics
• Adapted from Python Text Processing with NLTK 2.0 Cookbook (Perkins,
2010)
>>> from nltk.corpus import wordnet as wn
>>> word_synset = wn.synsets(’cookbook’)[0]
>>> word_synset.name
’cookbook.n.01’
>>> word_synset.definition
’a book of receipes and cooking directions’
5
7. Lexical Semantics
• Antonymy:
>>> ga1 = wn.synset(’good.a.01’)
>>> ga1.definition
’having desirable or positive qualities especially those suitable
for a thing specified’
>>> bad = ga1.lemmas[0].antonyms()[0]
>>> bad.name
’bad’
>>> bad.synset.definition
’having undesirable or negative qualities’
6
9. Computing Similarity by WordNet
• Similarity by Path Length (see Perkins, 2010, p. 19)
>>> from nltk.corpus import wordnet as wn
>>> cb = wn.synset(’cookbook.n.01’)
>>> ib = wn.synset(’instruction_book.n.01’)
>>> cb.wup_similarity(ib) # Wu-Palmer Similarity
0.91666666666666663
• For path similarity explanations, see Jaganadhg (2010)
8
10. Advantages & Disadvantages
• Advantages
Quality: developed and maintained by researchers
Practice: applications can use WordNet
Software: SenseRelate (Perl) - http://senserelate.sourceforge.net
• Disadvantages
Coverage: technical terms may be missing
Irregularity: path lengths can be irregular across hierarchies
Relatedness: related terms may not be in the same hierarchies
Example: Tennis Problem
– ’player’, ’racquet’, ’ball’ and ’net’
9
11. Computing Word Similarity by Vector Space Modeling
• Computing Similarity from a Document Corpus
Goal: determine distributional properties of a word
Steps: In general...
– Create vector of size n for each word of interest
– Think of them as points in some n-dimensional space
– Use a similarity metric to compute distance
Algorithm: Brown et al. (1992)
– C(x) - vector with properties of x (context of ’x’)
– C(w) = #(w1), #(w2), ..., #(wk ) , where #(wi) is the number of times
wi followed w in a corpus
10
15. Similarity Measure: Euclidean
n
i=1 (xi
Euclidean |⃗ , ⃗ | = |⃗ − ⃗ | =
x y
x y
− yi )2
cosmonaut
astronaut
moon
car
truck
Soviet
1
0
0
1
1
American
0
1
0
1
1
spacewalking
1
1
0
0
0
red
0
0
0
1
1
full
0
0
1
0
0
old
0
0
0
1
1
•
•
•
euclidian(cosm, astr) =
(1 − 0)2 + (0 − 1)2 + (1 − 1)2 + (0 − 0)2 + (0 − 0)2 + (0 − 0)2
Figure 2: Euclidean Similarity Comparison from Collins (2007)
14
16. Cosine & Euclidean Similarity in Python
>>> import numpy as np
>>> from scipy.spatial import distance as dist
>>> cosm = np.array([1,0,1,0,0,0])
>>> astr = np.array([0,1,1,0,0,0])
>>> dist.cosine(cosm, astr)
1.0
>>> dist.euclidean(cosm, astr)
2.4494897427831779
15
17. Computing Word Similarity by Vector Space Modeling
• Advantages & Disadvantages
– Requires no database lookups
– Semantic similarity doesn’t imply synonymy, antonymy, meronymy, hyponymy,
hypernymy, etc.
16
18. Ontology Basics
• Semantic Web Technologies
–
–
–
–
Data Models
Ontology Language
Distributed Query Language
Applications
∗ Large knowledge bases
∗ Business Intelligence
17
20. Ontology Basics
• W3C Semantic Web
– RDF - Resource Description Framework
∗ Data model w/ identifiers and named relations b/t resource pairs
∗ Represented as directed graphs b/t resources and literal values
· Done w/ collections of triples
· triple: subject, predicate and object
1. Na’im Tyson born in 197x
2. Na’im Tyson works for Knowledgent
3. Knowledgent headquartered Warren
– SPARQL - SPARQL Protocol And RDF Query Language
∗ Query language of Semantic Web
∗ Queries RDF stores over HTTP
∗ Very similar to SQL
– Capturing Relationships
RDF Schema: Vocabulary (term definitions), Schema (class definitions) and
Taxonomies (defining hierarchies)
19
21. OWL: Expressive relation definitions (symmetry, transitivity, etc.)
RIF: Rules Interchange Form - representation for exchanging sets of logical
and business rules
20
22. Text Mining Basics
• What people think Text Mining is?
– Automated discovery of new previously unknown information, by
automatically extracting information from a usually amount of different
unstructured textual resources (Wasilewska, 2014)
21
23. Text Mining Basics
• What text mining really is?
Data Mining
Information Retrieval
Text Mining
Statistics
Web Mining
Computational Linguistics &
Natural Language Processing
Figure 4: Venn Diagram of Text Mining (Wasilewska, 2014).
22
24. Text Mining Basics
• A General Approach — ignore Process
Text Mining the cloud!
• Document Clustering
• Text Characteristics
Interpretation /
Evaluation
Data Mining /
Pattern Discovery
Attribute Selection
Text Transformation
(Attribute Generation)
Text Preprocessing
Text
Figure 5: General Approaches to Text Mining Process (Wasilewska, 2014).
23
25. Text Mining Basics
• Application - Document Clustering
Goal: Group large amounts of textual data
Techniques: High Level
– k-means - top down
∗ cluster documents into k groups using vectors and distance metric
– agglomerative hierarchical clustering - bottom up
∗ Start with each document being a single cluster
∗ Eventually all documents belong to the same cluster
∗ Documents represented as a hierarchy (dendogram)
Reference: Taming Text (see Ingersoll et al., 2013, chap. 6)
• Final Remarks
24
27. References
Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and
Jenifer C. Lai. Class-based n-gram models of natural language. Computational
Linguistics, 18:467–479, 1992.
Michael
Collins.
Lexical
Semantics:
Similarity
Measures
and
Clustering,
November
2007.
URL
http://www.cs.columbia.edu/∼mcollins/6864/slides/wordsim.4up.pdf.
Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris. Taming Text: How
to Find, Organize, and Manipulate It. Manning Publications Co., January 2013.
Jaganadhg. Wordnet sense similarity with nltk: some basics, October 2010. URL
http://jaganadhg.freeflux.net/blog/archive/tag/WSD/.
26
28. George A. Miller. Wordnet: A lexical database for english. Communications of the
ACM, 38(11):39–41, 1995.
Jason Perkins. Python Text Processing with NLTK 2.0 Cookbook. Packt
Publishing, 2010.
Anita Wasilewska. CSE 634 - Data Mining: Text Mining, January 2014. URL
http://www.cs.sunysb.edu/ cse634/presentations/TextMining.pdf.
27