This work proposes an approach to the construction of WordSpaces which takes into account temporal information. The proposed method is able to build a geometrical space considering several periods of time. This methodology enables the analysis of the time evolution of the meaning of a word. Exploiting this approach, we build a framework, called Temporal Random Indexing (TRI) that provides all the necessary tools for building WordSpaces and performing such linguistic analysis. We propose some examples of usage of our tool by analysing word meanings in two corpora: a collection of Italian books and English scientific papers about computational linguistics.
http://clic.humnet.unipi.it/proceedings/Proceedings-CLICit-2014.pdf
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
1. Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro
Department of Computer Science – University of Bari Aldo Moro
pierpaolo.basile@uniba.it
Prima conferenza italiana di Linguistica Computazionale
Pisa 9-10 Dicembre 2014
2. Marty, in 2015 people will surf on the web!!!
Surf!?!?! On the web!?!?!?
3. Distributional Semantic Models (DSM)
•Analysis of word-usage statistics over huge corpora
•Geometrical space of concepts
•Similar words are represented close in the space
4. DSM issue…
Corpus
DSM
Word Space
Word Space is a snapshot of word
co-occurrences over a linguistic corpus
5. …DSM issue…
Corpus1
Word Space1
Corpus2
Word Space2
Corpus3
Word Space3
Corpus4
Word Space4
Difficult to compare Word Spaces built on different corpora
6. Temporal DSM
Corpus1900
Word Space1
Corpus1920
Word Space2
Corpus1930
Word Space3
Corpus1940
Word Space4
Each corpus contains documents of a specific time period…
…Word Spaces are still not comparable
7. Temporal Random Indexing (TRI)
Corpus1900
RI Space1
Corpus1920
RI
Space2
Corpus1930
RI
Space3
Corpus1940
RI
Space4
Words (vectors) in different Word Spaces are comparable…
…comparison on different time periods is possible!
8. Random Indexing
Corpus
Vocabulary
Assign a Random Vector ci to each term
ci <1, 0, 0, 0, -1, 0, 1, 0, - 1, 0, 0, 0, 0, 0, 0>
RI
Space
A word vector svi is the sum of random vectors assigned to the co-occurring words
푠푣푖= 푐푖 −푚<푖<+푚푑∈퐶
Co-occurring words are defined as the set of m words that precede and follow wi
9. Temporal Random Indexing (TRI)…
Corpus
T1-T2
Vocabulary
Assign a Random Vector ci to each term
RI
Space T1-T2
Corpus
T2-T3
Corpus
T3-T4
Corpus
Tn-1-Tn
…
RI
Space T2-T3
RI
Space T3-T4
RI
Space
Tn-1-Tn
…
10. …Temporal Random Indexing
•For each Word Space Tk-Tk+1
–sum random vectors taking into account only documents dk in the period Tk-Tk+1
•A word wi has several semantic vectors (sv): one for each time period
•Vectors are comparable
푠푣푖,푇푘= 푐푖 −푚<푖<+푚푑푘∈퐶
11. TRI System
•TRI system1 performs Temporal Random Indexing
1.Build a RI space for each year
2.Merge RI spaces to create new time periods
3.Load RI space and fetch vectors
4.Combine vectors
5.Retrieve similar vectors
6.Extract and compare neighbourhood of words
1https://github.com/pippokill/tri
12. Evaluation
•Two case studies
1.Project Gutenberg (PG): 349 Italian books from 1810 to 1922
2.ACL Anthology Network dataset (ANN): 21,212 papers published by ACL from 1960 to 2014
•Goals
–Neighborhood: analyze the neighborhood of a word
–Semantic shift: analyze words that clearly change their semantics
13. PG Dataset
•Dataset split in two time periods
–Pre 1900
–Post 1900
•Analyze the neighbourhood: “patria”
•Semantic shift: “cinematografo”
14. PG: neighbourhood of “patria”
Pre 1900
Post 1900
Libertà
Libertà
Opera
Gloria
Pari
Giustizia
Comune
Comune
Gloria
Legge
Nostra
Pari
Causa
Virtù
Italia
Onore
Giustizia
Opera
Guerra
Popolo
15. PG: semantic shift of “cinematografo”
Tpost 1900: “cinematografo” strongly related to “sonoro”
푠푖푚(푠푣푐푖푛푒푚푎푡표푔푟푎푓표,푇푝푟푒1900,푠푣푐푖푛푒푚푎푡표푔푟푎푓표,푇푝표푠푡1900)=0.4
16. ANN Dataset
•Dataset split in decades
•Analyze the neighbourhood: “semantics”
•Semantic shift: “bioscience”, “unsupervised”
17. ANN: neighbourhood of “semantics”
1960-1969
1970-1979
1980-1989
1990-1999
2000-2009
2010-2014
linguistic
natural
syntax
syntax
syntax
syntax
theory
linguistic
natural
theory
theory
theory
semantic
semantic
general
interpretation
interpretation
interpretation
syntactic
theory
theory
general
description
description
natural
syntax
semantic
linguistic
meaning
complex
linguistic
language
syntactic
description
linguistic
meaning
distributional
processing
linguistic
complex
logical
linguistic
process
syntactic
interpretation
natural
complex
logical
computational
description
model
representation
representation
structures
syntax
analysis
description
logical
structures
representation
18. ANN: neighbourhood of “semantics”
1960-1969
1970-1979
1980-1989
1990-1999
2000-2009
2010-2014
linguistic
natural
syntax
syntax
syntax
syntax
theory
linguistic
natural
theory
theory
theory
semantic
semantic
general
interpretation
interpretation
interpretation
syntactic
theory
theory
general
description
description
natural
syntax
semantic
linguistic
meaning
complex
linguistic
language
syntactic
description
linguistic
meaning
distributional
processing
linguistic
complex
logical
linguistic
process
syntactic
interpretation
natural
complex
logical
computational
description
model
representation
representation
structures
syntax
analysis
description
logical
structures
representation
19. ANN: neighbourhood of “semantics”
1960-1969
1970-1979
1980-1989
1990-1999
2000-2009
2010-2014
linguistic
natural
syntax
syntax
syntax
syntax
theory
linguistic
natural
theory
theory
theory
semantic
semantic
general
interpretation
interpretation
interpretation
syntactic
theory
theory
general
description
description
natural
syntax
semantic
linguistic
meaning
complex
linguistic
language
syntactic
description
linguistic
meaning
distributional
processing
linguistic
complex
logical
linguistic
process
syntactic
interpretation
natural
complex
logical
computational
description
model
representation
representation
structures
syntax
analysis
description
logical
structures
representation
20. ANN: neighbourhood of “semantics”
1960-1969
1970-1979
1980-1989
1990-1999
2000-2009
2010-2014
linguistic
natural
syntax
syntax
syntax
syntax
theory
linguistic
natural
theory
theory
theory
semantic
semantic
general
interpretation
interpretation
interpretation
syntactic
theory
theory
general
description
description
natural
syntax
semantic
linguistic
meaning
complex
linguistic
language
syntactic
description
linguistic
meaning
distributional
processing
linguistic
complex
logical
linguistic
process
syntactic
interpretation
natural
complex
logical
computational
description
model
representation
representation
structures
syntax
analysis
description
logical
structures
representation
23. Conclusions
•Temporal Random Indexing
–build Word Spaces taking into account information about time
–developed and published an open-source framework
•Potentiality of our framework
–capture word usage changes over time