SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing 
Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro 
Department of Computer Science – University of Bari Aldo Moro 
pierpaolo.basile@uniba.it 
Prima conferenza italiana di Linguistica Computazionale 
Pisa 9-10 Dicembre 2014
Marty, in 2015 people will surf on the web!!! 
Surf!?!?! On the web!?!?!?
Distributional Semantic Models (DSM) 
•Analysis of word-usage statistics over huge corpora 
•Geometrical space of concepts 
•Similar words are represented close in the space
DSM issue… 
Corpus 
DSM 
Word Space 
Word Space is a snapshot of word 
co-occurrences over a linguistic corpus
…DSM issue… 
Corpus1 
Word Space1 
Corpus2 
Word Space2 
Corpus3 
Word Space3 
Corpus4 
Word Space4 
Difficult to compare Word Spaces built on different corpora
Temporal DSM 
Corpus1900 
Word Space1 
Corpus1920 
Word Space2 
Corpus1930 
Word Space3 
Corpus1940 
Word Space4 
Each corpus contains documents of a specific time period… 
…Word Spaces are still not comparable
Temporal Random Indexing (TRI) 
Corpus1900 
RI Space1 
Corpus1920 
RI 
Space2 
Corpus1930 
RI 
Space3 
Corpus1940 
RI 
Space4 
Words (vectors) in different Word Spaces are comparable… 
…comparison on different time periods is possible!
Random Indexing 
Corpus 
Vocabulary 
Assign a Random Vector ci to each term 
ci  <1, 0, 0, 0, -1, 0, 1, 0, - 1, 0, 0, 0, 0, 0, 0> 
RI 
Space 
A word vector svi is the sum of random vectors assigned to the co-occurring words 
푠푣푖= 푐푖 −푚<푖<+푚푑∈퐶 
Co-occurring words are defined as the set of m words that precede and follow wi
Temporal Random Indexing (TRI)… 
Corpus 
T1-T2 
Vocabulary 
Assign a Random Vector ci to each term 
RI 
Space T1-T2 
Corpus 
T2-T3 
Corpus 
T3-T4 
Corpus 
Tn-1-Tn 
… 
RI 
Space T2-T3 
RI 
Space T3-T4 
RI 
Space 
Tn-1-Tn 
…
…Temporal Random Indexing 
•For each Word Space Tk-Tk+1 
–sum random vectors taking into account only documents dk in the period Tk-Tk+1 
•A word wi has several semantic vectors (sv): one for each time period 
•Vectors are comparable 
푠푣푖,푇푘= 푐푖 −푚<푖<+푚푑푘∈퐶
TRI System 
•TRI system1 performs Temporal Random Indexing 
1.Build a RI space for each year 
2.Merge RI spaces to create new time periods 
3.Load RI space and fetch vectors 
4.Combine vectors 
5.Retrieve similar vectors 
6.Extract and compare neighbourhood of words 
1https://github.com/pippokill/tri
Evaluation 
•Two case studies 
1.Project Gutenberg (PG): 349 Italian books from 1810 to 1922 
2.ACL Anthology Network dataset (ANN): 21,212 papers published by ACL from 1960 to 2014 
•Goals 
–Neighborhood: analyze the neighborhood of a word 
–Semantic shift: analyze words that clearly change their semantics
PG Dataset 
•Dataset split in two time periods 
–Pre 1900 
–Post 1900 
•Analyze the neighbourhood: “patria” 
•Semantic shift: “cinematografo”
PG: neighbourhood of “patria” 
Pre 1900 
Post 1900 
Libertà 
Libertà 
Opera 
Gloria 
Pari 
Giustizia 
Comune 
Comune 
Gloria 
Legge 
Nostra 
Pari 
Causa 
Virtù 
Italia 
Onore 
Giustizia 
Opera 
Guerra 
Popolo
PG: semantic shift of “cinematografo” 
Tpost 1900: “cinematografo” strongly related to “sonoro” 
푠푖푚(푠푣푐푖푛푒푚푎푡표푔푟푎푓표,푇푝푟푒1900,푠푣푐푖푛푒푚푎푡표푔푟푎푓표,푇푝표푠푡1900)=0.4
ANN Dataset 
•Dataset split in decades 
•Analyze the neighbourhood: “semantics” 
•Semantic shift: “bioscience”, “unsupervised”
ANN: neighbourhood of “semantics” 
1960-1969 
1970-1979 
1980-1989 
1990-1999 
2000-2009 
2010-2014 
linguistic 
natural 
syntax 
syntax 
syntax 
syntax 
theory 
linguistic 
natural 
theory 
theory 
theory 
semantic 
semantic 
general 
interpretation 
interpretation 
interpretation 
syntactic 
theory 
theory 
general 
description 
description 
natural 
syntax 
semantic 
linguistic 
meaning 
complex 
linguistic 
language 
syntactic 
description 
linguistic 
meaning 
distributional 
processing 
linguistic 
complex 
logical 
linguistic 
process 
syntactic 
interpretation 
natural 
complex 
logical 
computational 
description 
model 
representation 
representation 
structures 
syntax 
analysis 
description 
logical 
structures 
representation
ANN: neighbourhood of “semantics” 
1960-1969 
1970-1979 
1980-1989 
1990-1999 
2000-2009 
2010-2014 
linguistic 
natural 
syntax 
syntax 
syntax 
syntax 
theory 
linguistic 
natural 
theory 
theory 
theory 
semantic 
semantic 
general 
interpretation 
interpretation 
interpretation 
syntactic 
theory 
theory 
general 
description 
description 
natural 
syntax 
semantic 
linguistic 
meaning 
complex 
linguistic 
language 
syntactic 
description 
linguistic 
meaning 
distributional 
processing 
linguistic 
complex 
logical 
linguistic 
process 
syntactic 
interpretation 
natural 
complex 
logical 
computational 
description 
model 
representation 
representation 
structures 
syntax 
analysis 
description 
logical 
structures 
representation
ANN: neighbourhood of “semantics” 
1960-1969 
1970-1979 
1980-1989 
1990-1999 
2000-2009 
2010-2014 
linguistic 
natural 
syntax 
syntax 
syntax 
syntax 
theory 
linguistic 
natural 
theory 
theory 
theory 
semantic 
semantic 
general 
interpretation 
interpretation 
interpretation 
syntactic 
theory 
theory 
general 
description 
description 
natural 
syntax 
semantic 
linguistic 
meaning 
complex 
linguistic 
language 
syntactic 
description 
linguistic 
meaning 
distributional 
processing 
linguistic 
complex 
logical 
linguistic 
process 
syntactic 
interpretation 
natural 
complex 
logical 
computational 
description 
model 
representation 
representation 
structures 
syntax 
analysis 
description 
logical 
structures 
representation
ANN: neighbourhood of “semantics” 
1960-1969 
1970-1979 
1980-1989 
1990-1999 
2000-2009 
2010-2014 
linguistic 
natural 
syntax 
syntax 
syntax 
syntax 
theory 
linguistic 
natural 
theory 
theory 
theory 
semantic 
semantic 
general 
interpretation 
interpretation 
interpretation 
syntactic 
theory 
theory 
general 
description 
description 
natural 
syntax 
semantic 
linguistic 
meaning 
complex 
linguistic 
language 
syntactic 
description 
linguistic 
meaning 
distributional 
processing 
linguistic 
complex 
logical 
linguistic 
process 
syntactic 
interpretation 
natural 
complex 
logical 
computational 
description 
model 
representation 
representation 
structures 
syntax 
analysis 
description 
logical 
structures 
representation
…ANN: semantic shift “bioscience” 
bioscience 
extraterrestrial, extrasolar 
medline, bionlp, biomedi 
before 1990 
nowadays
…ANN: semantic shift “unsupervised” 
unsupervised 
observe, partition, selective 
supervised, disambiguation, probabilistic, algorithms, statistical 
before 1990 
nowadays
Conclusions 
•Temporal Random Indexing 
–build Word Spaces taking into account information about time 
–developed and published an open-source framework 
•Potentiality of our framework 
–capture word usage changes over time
That’s all folks! 
https://github.com/pippokill/tri

Más contenido relacionado

Similar a Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Presentación semántica
Presentación semánticaPresentación semántica
Presentación semántica
juanpgl
 

Similar a Analysing Word Meaning over Time by Exploiting Temporal Random Indexing (20)

Detecting semantic shift in large corpora by exploiting temporal random indexing
Detecting semantic shift in large corpora by exploiting temporal random indexingDetecting semantic shift in large corpora by exploiting temporal random indexing
Detecting semantic shift in large corpora by exploiting temporal random indexing
 
Diachronic Analysis
Diachronic AnalysisDiachronic Analysis
Diachronic Analysis
 
history-of-gfs.ppt
history-of-gfs.ppthistory-of-gfs.ppt
history-of-gfs.ppt
 
woodward-dissertation-main
woodward-dissertation-mainwoodward-dissertation-main
woodward-dissertation-main
 
6
66
6
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Augmented reality: what and how
Augmented reality: what and howAugmented reality: what and how
Augmented reality: what and how
 
Temporal Semantic Techniques for Text Analysis and Applications
Temporal Semantic Techniques for Text Analysis and ApplicationsTemporal Semantic Techniques for Text Analysis and Applications
Temporal Semantic Techniques for Text Analysis and Applications
 
Natural Language Inference: Logic from Humans
Natural Language Inference: Logic from HumansNatural Language Inference: Logic from Humans
Natural Language Inference: Logic from Humans
 
Data Consistency in OpenStreetMap
Data Consistency in OpenStreetMapData Consistency in OpenStreetMap
Data Consistency in OpenStreetMap
 
Idiosynchratic constructions in English and Spanish
Idiosynchratic constructions in English and SpanishIdiosynchratic constructions in English and Spanish
Idiosynchratic constructions in English and Spanish
 
IMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory CraneIMPACT Final Conference - Gregory Crane
IMPACT Final Conference - Gregory Crane
 
Ontologies and the humanities: some issues affecting the design of digital in...
Ontologies and the humanities: some issues affecting the design of digital in...Ontologies and the humanities: some issues affecting the design of digital in...
Ontologies and the humanities: some issues affecting the design of digital in...
 
Large-Scale Computational Research in Arts & Humanities
Large-Scale Computational Research in Arts & HumanitiesLarge-Scale Computational Research in Arts & Humanities
Large-Scale Computational Research in Arts & Humanities
 
From Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and BeyondFrom Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and Beyond
 
Tom Crane - An Introduction to IIIF
Tom Crane - An Introduction to IIIF Tom Crane - An Introduction to IIIF
Tom Crane - An Introduction to IIIF
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
Presentación semántica
Presentación semánticaPresentación semántica
Presentación semántica
 

Más de Pierpaolo Basile

Sst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaoloSst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaolo
Pierpaolo Basile
 
Word Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information AccessWord Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information Access
Pierpaolo Basile
 

Más de Pierpaolo Basile (19)

Diachronic analysis of entities by exploiting wikipedia page revisions
Diachronic analysis of entities by exploiting wikipedia page revisionsDiachronic analysis of entities by exploiting wikipedia page revisions
Diachronic analysis of entities by exploiting wikipedia page revisions
 
Come l'industria tecnologica ha cancellato le donne dalla storia
Come l'industria tecnologica ha cancellato le donne dalla storiaCome l'industria tecnologica ha cancellato le donne dalla storia
Come l'industria tecnologica ha cancellato le donne dalla storia
 
EVALITA 2018 NLP4FUN - Solving language games
EVALITA 2018 NLP4FUN - Solving language gamesEVALITA 2018 NLP4FUN - Solving language games
EVALITA 2018 NLP4FUN - Solving language games
 
Buon appetito! Analyzing Happiness in Italian Tweets
Buon appetito! Analyzing Happiness in Italian TweetsBuon appetito! Analyzing Happiness in Italian Tweets
Buon appetito! Analyzing Happiness in Italian Tweets
 
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Bi-directional LSTM-CNNs-CRF for Italian Sequence LabelingBi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
 
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
INSERT COIN - Storia dei videogame: da Spacewar a Street FighterINSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
 
QuestionCube DigithON 2017
QuestionCube DigithON 2017QuestionCube DigithON 2017
QuestionCube DigithON 2017
 
Diachronic Analysis of the Italian Language exploiting Google Ngram
Diachronic Analysis of the Italian Language exploiting Google NgramDiachronic Analysis of the Italian Language exploiting Google Ngram
Diachronic Analysis of the Italian Language exploiting Google Ngram
 
(Open) data hacking
(Open) data hacking(Open) data hacking
(Open) data hacking
 
La macchina più geek dell’universo The Turing Machine
La macchina più geek dell’universo The Turing MachineLa macchina più geek dell’universo The Turing Machine
La macchina più geek dell’universo The Turing Machine
 
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spaces
 
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
 
A Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesA Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional Spaces
 
Exploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringExploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question Answering
 
Sst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaoloSst evalita2011 basile_pierpaolo
Sst evalita2011 basile_pierpaolo
 
AI*IA 2012 PAI Workshop OTTHO
AI*IA 2012 PAI Workshop OTTHOAI*IA 2012 PAI Workshop OTTHO
AI*IA 2012 PAI Workshop OTTHO
 
Word Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information AccessWord Sense Disambiguation and Intelligent Information Access
Word Sense Disambiguation and Intelligent Information Access
 
Encoding syntactic dependencies by vector permutation
Encoding syntactic dependencies by vector permutationEncoding syntactic dependencies by vector permutation
Encoding syntactic dependencies by vector permutation
 

Último

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 

Último (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

  • 1. Analysing Word Meaning over Time by Exploiting Temporal Random Indexing Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro Department of Computer Science – University of Bari Aldo Moro pierpaolo.basile@uniba.it Prima conferenza italiana di Linguistica Computazionale Pisa 9-10 Dicembre 2014
  • 2. Marty, in 2015 people will surf on the web!!! Surf!?!?! On the web!?!?!?
  • 3. Distributional Semantic Models (DSM) •Analysis of word-usage statistics over huge corpora •Geometrical space of concepts •Similar words are represented close in the space
  • 4. DSM issue… Corpus DSM Word Space Word Space is a snapshot of word co-occurrences over a linguistic corpus
  • 5. …DSM issue… Corpus1 Word Space1 Corpus2 Word Space2 Corpus3 Word Space3 Corpus4 Word Space4 Difficult to compare Word Spaces built on different corpora
  • 6. Temporal DSM Corpus1900 Word Space1 Corpus1920 Word Space2 Corpus1930 Word Space3 Corpus1940 Word Space4 Each corpus contains documents of a specific time period… …Word Spaces are still not comparable
  • 7. Temporal Random Indexing (TRI) Corpus1900 RI Space1 Corpus1920 RI Space2 Corpus1930 RI Space3 Corpus1940 RI Space4 Words (vectors) in different Word Spaces are comparable… …comparison on different time periods is possible!
  • 8. Random Indexing Corpus Vocabulary Assign a Random Vector ci to each term ci  <1, 0, 0, 0, -1, 0, 1, 0, - 1, 0, 0, 0, 0, 0, 0> RI Space A word vector svi is the sum of random vectors assigned to the co-occurring words 푠푣푖= 푐푖 −푚<푖<+푚푑∈퐶 Co-occurring words are defined as the set of m words that precede and follow wi
  • 9. Temporal Random Indexing (TRI)… Corpus T1-T2 Vocabulary Assign a Random Vector ci to each term RI Space T1-T2 Corpus T2-T3 Corpus T3-T4 Corpus Tn-1-Tn … RI Space T2-T3 RI Space T3-T4 RI Space Tn-1-Tn …
  • 10. …Temporal Random Indexing •For each Word Space Tk-Tk+1 –sum random vectors taking into account only documents dk in the period Tk-Tk+1 •A word wi has several semantic vectors (sv): one for each time period •Vectors are comparable 푠푣푖,푇푘= 푐푖 −푚<푖<+푚푑푘∈퐶
  • 11. TRI System •TRI system1 performs Temporal Random Indexing 1.Build a RI space for each year 2.Merge RI spaces to create new time periods 3.Load RI space and fetch vectors 4.Combine vectors 5.Retrieve similar vectors 6.Extract and compare neighbourhood of words 1https://github.com/pippokill/tri
  • 12. Evaluation •Two case studies 1.Project Gutenberg (PG): 349 Italian books from 1810 to 1922 2.ACL Anthology Network dataset (ANN): 21,212 papers published by ACL from 1960 to 2014 •Goals –Neighborhood: analyze the neighborhood of a word –Semantic shift: analyze words that clearly change their semantics
  • 13. PG Dataset •Dataset split in two time periods –Pre 1900 –Post 1900 •Analyze the neighbourhood: “patria” •Semantic shift: “cinematografo”
  • 14. PG: neighbourhood of “patria” Pre 1900 Post 1900 Libertà Libertà Opera Gloria Pari Giustizia Comune Comune Gloria Legge Nostra Pari Causa Virtù Italia Onore Giustizia Opera Guerra Popolo
  • 15. PG: semantic shift of “cinematografo” Tpost 1900: “cinematografo” strongly related to “sonoro” 푠푖푚(푠푣푐푖푛푒푚푎푡표푔푟푎푓표,푇푝푟푒1900,푠푣푐푖푛푒푚푎푡표푔푟푎푓표,푇푝표푠푡1900)=0.4
  • 16. ANN Dataset •Dataset split in decades •Analyze the neighbourhood: “semantics” •Semantic shift: “bioscience”, “unsupervised”
  • 17. ANN: neighbourhood of “semantics” 1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014 linguistic natural syntax syntax syntax syntax theory linguistic natural theory theory theory semantic semantic general interpretation interpretation interpretation syntactic theory theory general description description natural syntax semantic linguistic meaning complex linguistic language syntactic description linguistic meaning distributional processing linguistic complex logical linguistic process syntactic interpretation natural complex logical computational description model representation representation structures syntax analysis description logical structures representation
  • 18. ANN: neighbourhood of “semantics” 1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014 linguistic natural syntax syntax syntax syntax theory linguistic natural theory theory theory semantic semantic general interpretation interpretation interpretation syntactic theory theory general description description natural syntax semantic linguistic meaning complex linguistic language syntactic description linguistic meaning distributional processing linguistic complex logical linguistic process syntactic interpretation natural complex logical computational description model representation representation structures syntax analysis description logical structures representation
  • 19. ANN: neighbourhood of “semantics” 1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014 linguistic natural syntax syntax syntax syntax theory linguistic natural theory theory theory semantic semantic general interpretation interpretation interpretation syntactic theory theory general description description natural syntax semantic linguistic meaning complex linguistic language syntactic description linguistic meaning distributional processing linguistic complex logical linguistic process syntactic interpretation natural complex logical computational description model representation representation structures syntax analysis description logical structures representation
  • 20. ANN: neighbourhood of “semantics” 1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2014 linguistic natural syntax syntax syntax syntax theory linguistic natural theory theory theory semantic semantic general interpretation interpretation interpretation syntactic theory theory general description description natural syntax semantic linguistic meaning complex linguistic language syntactic description linguistic meaning distributional processing linguistic complex logical linguistic process syntactic interpretation natural complex logical computational description model representation representation structures syntax analysis description logical structures representation
  • 21. …ANN: semantic shift “bioscience” bioscience extraterrestrial, extrasolar medline, bionlp, biomedi before 1990 nowadays
  • 22. …ANN: semantic shift “unsupervised” unsupervised observe, partition, selective supervised, disambiguation, probabilistic, algorithms, statistical before 1990 nowadays
  • 23. Conclusions •Temporal Random Indexing –build Word Spaces taking into account information about time –developed and published an open-source framework •Potentiality of our framework –capture word usage changes over time
  • 24. That’s all folks! https://github.com/pippokill/tri