SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
Lidia Grigorieva
The Institute of Informatics Problems of the Russian
Academy of Sciences (IPI RAN)
Root	!=	Stem
из — prefix
бир — root
а, тель, ниц — suffixes
а — ending
избирательниц — stem
Dimension	reduction
— dimension reduction is the process of reducing the
number of random variables in machine learning
tasks:
— Lemmatization –grouping together the inflected
forms of a word. LemmaGen; morpha; pymorphy2,
mystem...
— Stemming –reducing inflected words to their word
stem. The stem need not be identical to
the morphological root of the word. Snowball;
Lovins; Porter; nltk.stem.* ...
— Root Extraction – reducing derivates to their root.,
i.e. meaning.
Lemmatization
Mapping from text-word to lemma
Text-word to Lemma
мыла мыть (verb)
wash
мыло(noun)
soap
Stemming
Mapping from text-word to stem (excluding
endings)
21
лесистый лесист
лесник лесник
лесничество лесничеств
лесничий леснич
лесной лесн
to
5
3
5
to
Root	extraction
Mapping from lemma to meaning
лесистый лес
лесник лес
лесничество лес
лесничий лес
лесной лес
5
1
to
Realization
— Neural Networks algorithm
— Train data – 749 cases
— Cross validation – 84 cases (10%)
— Test data – 93 cases
— Accuracy ~0.7
Tasks
— plagiarism;
— paraphrase detection;
— textual similarity;
— semantic disambiguation;
— topic model;
— text classification;
— text clusterization;
— question answering systems;
— building semantic graphs (entities, links and
relationship between them);
References
— РацибурскаяЛ.В. Словарь уникальных морфем
современногорусского языка М.: Флинта: Наука, 2009. — 160
с.
— Аванесов Р.И., Ожегов С.И. Морфемно-орфографический
словарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ:
Астрель, 2002. — 704 с.
— Тихонов А.Н. Морфемно-орфографический словарь русского
языка, 2002.
— Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского
языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с.
— http://old.kpfu.ru/infres/slovar1/begall.htm
— http://snowball.tartarus.org/algorithms/russian/stemmer.html,
http://snowballstem.org/demo.html
Effective Paraphrase Expansion in Addressing
Lexical Variability
Vasily Konovalov, Meni Adler, Ido Dagan
Department of Computer Science
Bar-Ilan University, Israel
The 5th conference on Artificial Intelligence and Natural
Language
Problem
Lexical Variability
From Negochat negotiation dialogue corpus:
‘Reject’: “I disagree”, “I reject your proposal”, “it’s not
accepted”.
‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”.
‘Offer’: “I offer you a salary of 60,000 USD”, “How about the
programmer position”, “I propose you a pension of 10%”.
Solution
Translation-based paraphrase expansion
PL
MT1 MT2
SENTENCE PARAPHRASE
Google Yandex
Our research questions
◮ What is the ‘best’ performing language? Why is it actually
the ‘best’ one?
◮ What is the ‘best’ performing combination of MT engines?
Our research settings
Languages: Portuguese, French, German, Hebrew, Russian,
Arabic, Finish, Chinese, Hungarian.
MT engines: Google Translate API, Microsoft Translator Text
API, Yandex Translate API.
Our findings
◮ Among tested languages Hungarian is the ‘best’ performing
one.
◮ The performance of a language correlates well with the
averaged smoothed BLEU.
◮ A language that generates the most lexically dissimilar
paraphrases is the ‘best’ performing language.
◮ The differences between MT engines are insignificant
according to the averaged smoothed BLEU and are not
reflected in evaluation.
◮ The language family relations are reflected in averaged
smoothed BLEU.
Come and see our poster
RESEARCHING
QUANTITATIVE
CHARACTERISTICS OF
SHORT TEXTS: SCIENTIFIC,
NEWS, USE WRITINGS
■ For data analysis, we used several texts
collection.
■ For scientific texts: Collection from the conference
Dialogue (to 2003-2006), and Corpus Linguistics.
■ For news: Collection is made up of mass media
short articles such as: Lenta.ru, the Russian
newspaper, RBC, Independent Newspaper, and
Kompyulenta.
■ To research writings from Unified State
Examination we created several collections,
”reference”, which contains writings written by
experts, and the second written by students.
■ For research we selected the most representative
characteristics: entropy, readability, lexical
diversity, verbal, autosem(all words, except for the
service parts of speech), and frequencies (the
ratio of the first hundred of the most frequent
words of the Russian language, to all words in the
text).
0
2
4
6
8
10
12
14
USE expert USE students News Scientific
Entropy
0
0,05
0,1
0,15
0,2
0,25
USE expert USE students News Scientific
Readability
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
USE expert USE students News Scientific
Lexical Diversity
0,136
0,138
0,14
0,142
0,144
0,146
0,148
0,15
0,152
0,154
USE expert USE students News Scientific
Verbal
0,68
0,7
0,72
0,74
0,76
0,78
0,8
USE expert USE students News Scientific
Autosem
0
0,05
0,1
0,15
0,2
0,25
0,3
USE expert USE students News Scientific
Frequencies
Building a Lexicon-Based
Lemmatizer for Old Irish
Oksana Dereza
oksana.dereza@gmail.com
Old Irish: Grammar
• Changes can occur to any part of the word
o beginning: mutations
o middle: infixed pronouns
o end: flections
caraid ‘he / she / it loves’
rob-car-si ‘she has loved you’
• Very differently looking forms in a paradigm (esp. verbal)
do-beir ‘gives, brings’
ní t(h)abair ‘does not give, bring’
Old Irish: Orthography
• Inconsistent use of length marks
• Mutations are not always shown in writing
• Complex verb forms can be spelled either with or without a hyphen or a whitespace
• In later texts there are mute vowels to indicate the quality (broad / slender) of consonants
next to them
⇨ a great number of possible spellings for every form
Consonant b c d f g l m n p r s t
Mutated
consonant
bh ch dh fh gh ll mh nn ph rr sh th
mb gc nd ḟ ng l-l mm bp ṡ dt
cc ḟh m-m ss
bhf ts
s-s
Data
• Dictionary of the Irish Language (DIL)
43,345 entries ⇨ 79,140 unique forms
• Corpus
125 texts, 831,280 tokens
• Gold standard
50 random sentences from the test corpus, 840 tokens
• Not only classical Old Irish
The corpus covers VII-XVI centuries
Problems
• DIL covers only ~ 41% of
unique forms in the corpus
• Many contracted forms, but
no unified system of
contractions
• Inconsistent use of markup
and punctuation
caraid
Cite this: eDIL s.v. caraid
or dil.ie/8212
Forms: -carim, -cairim,
caraim, -caraim, -caru, -
cari, carid, caraid, -cara,
carthai, caras, charas,
caris, carthar, -charam,
carait, charaíd, -carat,
cartae, cardda, carda,
carde, cartar, carad,
caram, carid, -carid, -
carad, carad, carthae, -
chartais, carddais, cardáis,
care, -charae, -carae, cara,
-rochra, -chara, cara, -
carat, -carad, -charad,
cechar, -cechra, -cechra,
cechras, -chechrat, -
cechrainn, carais, carois, -
cair, carsait, carsat,
charus, rob-car-si, ro-car,
arro-car, char, rondob-
carsam-ni, charsat,
charsad, ros-carsat, serc,
carthain, carthi
weak vb. with reduplicated fut. on
analogy of canaid ( Thurn. Gramm.
402 ). Ind. pres. 1 s. -carim, Wb. 5c7
. -cairim, 23c12 . caraim, Thes. ii
293.16 . -caraim, Ml. 79d1 . -caru,
Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s.
carid, Wb. 25d5 . caraid , Ml. 75c4 . -
cara, Wb. 27d9 . With suff. pron. 3 s.
m. carthai, Fráech 10 . Rel. caras,
Wb. 25c19 . Ml. 91b17 . charas, 30c3
. caris, Thes. ii 247.4 . Pass. rel.
carthar, Ml. 75c4 . Sg. 193b3 . 196b4
.. <…>
(a) loves (persons): nád carad som
Iudeiu, Wb. 4d17 . carad uir
mulierem, 22c19 . carsus fiadhu,
Snedg. u. Mac R. 11.5 . rot charus ar
th'airscélaib I have fallen in love
with thee, LU 6084 (TBC). nít
charadar nít tágedar, TBC 2032 = -
chara, LU 5797 . car do chomnesam
amal no-t-cara fén = dilige
proximum, PH 5837 . gé no
charfuinn fiche fear, KMMisc. 362.7
. a fhir Chola charuid mná `beloved
of women', Sc.G. St. iv 62 § 10 . ní
charabh bean tsean ná óg, Dánta Gr.
78.11 . <…>
Lemmatizer
• Two methods for OOV-words
o Baseline: return a demutated form
o Predict a lemma using modified Damerau-Levenshtein
distance
• Disambiguation
o For homonymous forms, the lemma with the highest lexical
probability is chosen
o Lemma probability equals the sum of probabilities of its forms,
and form probability is its frequency count in the corpus
Predicting lemmas for OOV-words
• Generate all possible strings on edit distance 1 and 2
• Check them up in the dictionary
• Add real words to candidate list
• Filter candidates by the first character
“If the unknown word starts with a vowel, the candidate should also start
with a vowel, and if the unknown word starts with a consonant, the
candidate should start with the same consonant”
• The lemma of the candidate with the highest lexical probability (i.e.
frequency count in the corpus) is taken as a lemma for the unknown word
Evaluation
Lexicon Forms ‘Recall’
DIL forms only 79,140 74.7 %
DIL + 1000 most frequent OOV-words 80,206 80.0 %
! 4,889 homonymous forms
Baseline Predicted lemmas
Lemmatized correctly 483 / 840 552 / 840
Accuracy 57,50 % 65,71 %
Evaluation
Tokens 840
Known words 654
Unknown words 186
Lemmatized correctly 552
Lemmas predicted for unknown words 157
Predicted correctly 84
Predicted incorrectly 68
Several lemmas predicted including the
correct one, but the wrong one is chosen
5
~ 60 % of lemmas are predicted correctly
Token Best candidate
from closest
dictionary forms
Best candidate’s
lemma
Chosen lemma
+ eólais eólas eólas eólas
+ fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich
+ cheast ceist ceist ceist
* déa dia dá, de, do, día de
+ bréithir bréthir bríathar bríathar
– n-uaill aill aile, aill, all, aille aile
– chuain cain cain, canaid, cani,
caingen
canaid
– christ ceist ceist ceist
– caeme caíme caíme caíme
– chniss cliss cles cles
Predicted lemmas
Source Code & Corpora
Source code
https://github.com/ancatmara/old_irish_lemmatizer
Texts
https://github.com/ancatmara/old_irish_corpora
Extraction of Social
Networks from Literary Text
Tsygankova Viktoria,
National Research University
Higher School of Economics, Moscow
NovelGraphs
a tool for automatic annotation
of texts and for extracting social
networks of characters from text,
where nodes represent
characters and edges are
relations between them.
It can also analyze structural
balance of the resulting graphs.
prince paradox
duke de valentinois
henry wotton
narborough
borgia
filippo
hallward
louis xii
lady henry
erskine
adrian
gian maria visconti
romeo
gray
mercutio
ruxton
Example graph of the “Picture
of Dorian Gray” by Oscar Wilde
Example graph of the “Study
in Scarlet”
by A. Conan Doyle
lestrade
gregson
murcher
rance
holmes
narrator
eph stangerson
Example graph of the “Study
in Scarlet” by A. Conan Doyle
with sentiment
Example graph of the “Picture
of Dorian Gray” by Oscar Wilde
with sentient
Conclusions
  A tool NovelGraphs was created for
English-language literary fiction, which
uses a new approach of extracting characters
and connections between them.
  Nodes represent characters found in the text,
and edges connect them to other characters
with whom they interact.
  At the moment, combinations of extractors and
aggregators detect characters better than
interactions between them.
  Analysis of structural balance identifies key
passages of the text that correspond to the
minima and maxima on the balance plot.
Thanks for watching!
Are the results of your corpus
research really reliable?
Getting automatic result analysis on
GICR.
Tatiana Shavrina, Daniil Selegey
AINL FRUCT, SPb, 12.11.2016
Big Corpora Problem:
1. Billions of words, mostly coming from
social media
2. Getting just the IPM and search
results in KWIC format doesn’t tell
you if the results are biased
3. A lot of metatext attributes – URLs,
doc IDs, author IDs, region, gender,
genre etc. – all are potential source
of bias
Users need corpus tools to see all statistics of the
search area to check for homogeneity with the
whole corpus.
Our solution:
Search results analysis right in the interface!
See you at our
Demo stand!

Más contenido relacionado

Destacado

Pure Visibility Ppc Judo
Pure Visibility Ppc JudoPure Visibility Ppc Judo
Pure Visibility Ppc JudoJon Gatrell
 
Petrine
PetrinePetrine
Petrineeka
 
YOCard v4.1
YOCard v4.1YOCard v4.1
YOCard v4.1yocard
 
The Preoccupation of All Things
The Preoccupation of All ThingsThe Preoccupation of All Things
The Preoccupation of All ThingsAlvin Reyes
 
eVize 2007 - Atestace informačních systémů veřejné správy
eVize 2007 - Atestace informačních systémů veřejné správyeVize 2007 - Atestace informačních systémů veřejné správy
eVize 2007 - Atestace informačních systémů veřejné správyEquica
 
Larsine
LarsineLarsine
Larsineeka
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - KeynoteMichael Chaize
 
Ria2010 keynote développeurs
Ria2010 keynote développeursRia2010 keynote développeurs
Ria2010 keynote développeursMichael Chaize
 
Flex presentation for Paris Android User group PAUG
Flex presentation for Paris Android User group PAUGFlex presentation for Paris Android User group PAUG
Flex presentation for Paris Android User group PAUGMichael Chaize
 

Destacado (15)

Pure Visibility Ppc Judo
Pure Visibility Ppc JudoPure Visibility Ppc Judo
Pure Visibility Ppc Judo
 
Uka S Art No Music Ii
Uka S Art No Music IiUka S Art No Music Ii
Uka S Art No Music Ii
 
Petrine
PetrinePetrine
Petrine
 
Artigo Caso de Uso
Artigo Caso de UsoArtigo Caso de Uso
Artigo Caso de Uso
 
YOCard v4.1
YOCard v4.1YOCard v4.1
YOCard v4.1
 
retrospectiva 2007
retrospectiva 2007retrospectiva 2007
retrospectiva 2007
 
The Preoccupation of All Things
The Preoccupation of All ThingsThe Preoccupation of All Things
The Preoccupation of All Things
 
Tasty Beef
Tasty BeefTasty Beef
Tasty Beef
 
eVize 2007 - Atestace informačních systémů veřejné správy
eVize 2007 - Atestace informačních systémů veřejné správyeVize 2007 - Atestace informačních systémů veřejné správy
eVize 2007 - Atestace informačních systémů veřejné správy
 
Larsine
LarsineLarsine
Larsine
 
Flex and the city in London - Keynote
Flex and the city in London - KeynoteFlex and the city in London - Keynote
Flex and the city in London - Keynote
 
Ria2010 keynote développeurs
Ria2010 keynote développeursRia2010 keynote développeurs
Ria2010 keynote développeurs
 
Innovation manifesto v04
Innovation manifesto v04Innovation manifesto v04
Innovation manifesto v04
 
Web2 KM
Web2 KMWeb2 KM
Web2 KM
 
Flex presentation for Paris Android User group PAUG
Flex presentation for Paris Android User group PAUGFlex presentation for Paris Android User group PAUG
Flex presentation for Paris Android User group PAUG
 

Similar a Extracting Social Networks from Literary Texts

Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...
Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...
Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...AIST
 
Semantic vs. Statistic Language Model Expansion
Semantic vs. Statistic Language Model ExpansionSemantic vs. Statistic Language Model Expansion
Semantic vs. Statistic Language Model ExpansionYuval Krymolowski
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a CorpusSvetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a CorpusSvetlin Nakov
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
 
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistMariana Romanyshyn
 
Svetlin Nakov - Cognate or False Friend? Ask the Web!
Svetlin Nakov - Cognate or False Friend? Ask the Web!Svetlin Nakov - Cognate or False Friend? Ask the Web!
Svetlin Nakov - Cognate or False Friend? Ask the Web!Svetlin Nakov
 
Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02Mohammed Attia
 
Arabic Morphology Using Only Finite State Operations -Review
Arabic Morphology Using Only Finite State Operations -ReviewArabic Morphology Using Only Finite State Operations -Review
Arabic Morphology Using Only Finite State Operations -ReviewLushanthan Sivaneasharajah
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Steve Rowe
 
Sltu12
Sltu12Sltu12
Sltu12tihtow
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 

Similar a Extracting Social Networks from Literary Texts (20)

Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...
Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...
Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrain...
 
Semantic vs. Statistic Language Model Expansion
Semantic vs. Statistic Language Model ExpansionSemantic vs. Statistic Language Model Expansion
Semantic vs. Statistic Language Model Expansion
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a CorpusSvetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguist
 
Svetlin Nakov - Cognate or False Friend? Ask the Web!
Svetlin Nakov - Cognate or False Friend? Ask the Web!Svetlin Nakov - Cognate or False Friend? Ask the Web!
Svetlin Nakov - Cognate or False Friend? Ask the Web!
 
Fsmnlp presentation 02
Fsmnlp presentation 02Fsmnlp presentation 02
Fsmnlp presentation 02
 
Arabic Morphology Using Only Finite State Operations -Review
Arabic Morphology Using Only Finite State Operations -ReviewArabic Morphology Using Only Finite State Operations -Review
Arabic Morphology Using Only Finite State Operations -Review
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Sslis
SslisSslis
Sslis
 
Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...
 
Sltu12
Sltu12Sltu12
Sltu12
 
Arabic spell checkers
Arabic spell  checkersArabic spell  checkers
Arabic spell checkers
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 

Más de Lidia Pivovarova

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Lidia Pivovarova
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classificationLidia Pivovarova
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesLidia Pivovarova
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текстаLidia Pivovarova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovLidia Pivovarova
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...Lidia Pivovarova
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyLidia Pivovarova
 

Más de Lidia Pivovarova (20)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

Último

AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness1hk20is002
 
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's SurvivalHarry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survivalkevin8smith
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docxUlahVanessaBasa
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
HEMATOPOIESIS - formation of blood cells
HEMATOPOIESIS - formation of blood cellsHEMATOPOIESIS - formation of blood cells
HEMATOPOIESIS - formation of blood cellsSachinSuresh44
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxjana861314
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyHemantThakare8
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsCreative-Biolabs
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 

Último (20)

AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness
 
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's SurvivalHarry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
HEMATOPOIESIS - formation of blood cells
HEMATOPOIESIS - formation of blood cellsHEMATOPOIESIS - formation of blood cells
HEMATOPOIESIS - formation of blood cells
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiology
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative Biolabs
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 

Extracting Social Networks from Literary Texts

  • 1. Lidia Grigorieva The Institute of Informatics Problems of the Russian Academy of Sciences (IPI RAN)
  • 2. Root != Stem из — prefix бир — root а, тель, ниц — suffixes а — ending избирательниц — stem
  • 3. Dimension reduction — dimension reduction is the process of reducing the number of random variables in machine learning tasks: — Lemmatization –grouping together the inflected forms of a word. LemmaGen; morpha; pymorphy2, mystem... — Stemming –reducing inflected words to their word stem. The stem need not be identical to the morphological root of the word. Snowball; Lovins; Porter; nltk.stem.* ... — Root Extraction – reducing derivates to their root., i.e. meaning.
  • 4. Lemmatization Mapping from text-word to lemma Text-word to Lemma мыла мыть (verb) wash мыло(noun) soap
  • 5. Stemming Mapping from text-word to stem (excluding endings) 21 лесистый лесист лесник лесник лесничество лесничеств лесничий леснич лесной лесн to 5 3 5 to
  • 6. Root extraction Mapping from lemma to meaning лесистый лес лесник лес лесничество лес лесничий лес лесной лес 5 1 to
  • 7. Realization — Neural Networks algorithm — Train data – 749 cases — Cross validation – 84 cases (10%) — Test data – 93 cases — Accuracy ~0.7
  • 8. Tasks — plagiarism; — paraphrase detection; — textual similarity; — semantic disambiguation; — topic model; — text classification; — text clusterization; — question answering systems; — building semantic graphs (entities, links and relationship between them);
  • 9. References — РацибурскаяЛ.В. Словарь уникальных морфем современногорусского языка М.: Флинта: Наука, 2009. — 160 с. — Аванесов Р.И., Ожегов С.И. Морфемно-орфографический словарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ: Астрель, 2002. — 704 с. — Тихонов А.Н. Морфемно-орфографический словарь русского языка, 2002. — Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с. — http://old.kpfu.ru/infres/slovar1/begall.htm — http://snowball.tartarus.org/algorithms/russian/stemmer.html, http://snowballstem.org/demo.html
  • 10. Effective Paraphrase Expansion in Addressing Lexical Variability Vasily Konovalov, Meni Adler, Ido Dagan Department of Computer Science Bar-Ilan University, Israel The 5th conference on Artificial Intelligence and Natural Language
  • 11. Problem Lexical Variability From Negochat negotiation dialogue corpus: ‘Reject’: “I disagree”, “I reject your proposal”, “it’s not accepted”. ‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”. ‘Offer’: “I offer you a salary of 60,000 USD”, “How about the programmer position”, “I propose you a pension of 10%”.
  • 12. Solution Translation-based paraphrase expansion PL MT1 MT2 SENTENCE PARAPHRASE Google Yandex
  • 13. Our research questions ◮ What is the ‘best’ performing language? Why is it actually the ‘best’ one? ◮ What is the ‘best’ performing combination of MT engines?
  • 14. Our research settings Languages: Portuguese, French, German, Hebrew, Russian, Arabic, Finish, Chinese, Hungarian. MT engines: Google Translate API, Microsoft Translator Text API, Yandex Translate API.
  • 15. Our findings ◮ Among tested languages Hungarian is the ‘best’ performing one. ◮ The performance of a language correlates well with the averaged smoothed BLEU. ◮ A language that generates the most lexically dissimilar paraphrases is the ‘best’ performing language. ◮ The differences between MT engines are insignificant according to the averaged smoothed BLEU and are not reflected in evaluation. ◮ The language family relations are reflected in averaged smoothed BLEU.
  • 16. Come and see our poster
  • 18. ■ For data analysis, we used several texts collection. ■ For scientific texts: Collection from the conference Dialogue (to 2003-2006), and Corpus Linguistics. ■ For news: Collection is made up of mass media short articles such as: Lenta.ru, the Russian newspaper, RBC, Independent Newspaper, and Kompyulenta. ■ To research writings from Unified State Examination we created several collections, ”reference”, which contains writings written by experts, and the second written by students.
  • 19. ■ For research we selected the most representative characteristics: entropy, readability, lexical diversity, verbal, autosem(all words, except for the service parts of speech), and frequencies (the ratio of the first hundred of the most frequent words of the Russian language, to all words in the text).
  • 20. 0 2 4 6 8 10 12 14 USE expert USE students News Scientific Entropy
  • 21. 0 0,05 0,1 0,15 0,2 0,25 USE expert USE students News Scientific Readability
  • 22. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 USE expert USE students News Scientific Lexical Diversity
  • 24. 0,68 0,7 0,72 0,74 0,76 0,78 0,8 USE expert USE students News Scientific Autosem
  • 25. 0 0,05 0,1 0,15 0,2 0,25 0,3 USE expert USE students News Scientific Frequencies
  • 26. Building a Lexicon-Based Lemmatizer for Old Irish Oksana Dereza oksana.dereza@gmail.com
  • 27. Old Irish: Grammar • Changes can occur to any part of the word o beginning: mutations o middle: infixed pronouns o end: flections caraid ‘he / she / it loves’ rob-car-si ‘she has loved you’ • Very differently looking forms in a paradigm (esp. verbal) do-beir ‘gives, brings’ ní t(h)abair ‘does not give, bring’
  • 28. Old Irish: Orthography • Inconsistent use of length marks • Mutations are not always shown in writing • Complex verb forms can be spelled either with or without a hyphen or a whitespace • In later texts there are mute vowels to indicate the quality (broad / slender) of consonants next to them ⇨ a great number of possible spellings for every form Consonant b c d f g l m n p r s t Mutated consonant bh ch dh fh gh ll mh nn ph rr sh th mb gc nd ḟ ng l-l mm bp ṡ dt cc ḟh m-m ss bhf ts s-s
  • 29. Data • Dictionary of the Irish Language (DIL) 43,345 entries ⇨ 79,140 unique forms • Corpus 125 texts, 831,280 tokens • Gold standard 50 random sentences from the test corpus, 840 tokens • Not only classical Old Irish The corpus covers VII-XVI centuries
  • 30. Problems • DIL covers only ~ 41% of unique forms in the corpus • Many contracted forms, but no unified system of contractions • Inconsistent use of markup and punctuation caraid Cite this: eDIL s.v. caraid or dil.ie/8212 Forms: -carim, -cairim, caraim, -caraim, -caru, - cari, carid, caraid, -cara, carthai, caras, charas, caris, carthar, -charam, carait, charaíd, -carat, cartae, cardda, carda, carde, cartar, carad, caram, carid, -carid, - carad, carad, carthae, - chartais, carddais, cardáis, care, -charae, -carae, cara, -rochra, -chara, cara, - carat, -carad, -charad, cechar, -cechra, -cechra, cechras, -chechrat, - cechrainn, carais, carois, - cair, carsait, carsat, charus, rob-car-si, ro-car, arro-car, char, rondob- carsam-ni, charsat, charsad, ros-carsat, serc, carthain, carthi weak vb. with reduplicated fut. on analogy of canaid ( Thurn. Gramm. 402 ). Ind. pres. 1 s. -carim, Wb. 5c7 . -cairim, 23c12 . caraim, Thes. ii 293.16 . -caraim, Ml. 79d1 . -caru, Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s. carid, Wb. 25d5 . caraid , Ml. 75c4 . - cara, Wb. 27d9 . With suff. pron. 3 s. m. carthai, Fráech 10 . Rel. caras, Wb. 25c19 . Ml. 91b17 . charas, 30c3 . caris, Thes. ii 247.4 . Pass. rel. carthar, Ml. 75c4 . Sg. 193b3 . 196b4 .. <…> (a) loves (persons): nád carad som Iudeiu, Wb. 4d17 . carad uir mulierem, 22c19 . carsus fiadhu, Snedg. u. Mac R. 11.5 . rot charus ar th'airscélaib I have fallen in love with thee, LU 6084 (TBC). nít charadar nít tágedar, TBC 2032 = - chara, LU 5797 . car do chomnesam amal no-t-cara fén = dilige proximum, PH 5837 . gé no charfuinn fiche fear, KMMisc. 362.7 . a fhir Chola charuid mná `beloved of women', Sc.G. St. iv 62 § 10 . ní charabh bean tsean ná óg, Dánta Gr. 78.11 . <…>
  • 31. Lemmatizer • Two methods for OOV-words o Baseline: return a demutated form o Predict a lemma using modified Damerau-Levenshtein distance • Disambiguation o For homonymous forms, the lemma with the highest lexical probability is chosen o Lemma probability equals the sum of probabilities of its forms, and form probability is its frequency count in the corpus
  • 32. Predicting lemmas for OOV-words • Generate all possible strings on edit distance 1 and 2 • Check them up in the dictionary • Add real words to candidate list • Filter candidates by the first character “If the unknown word starts with a vowel, the candidate should also start with a vowel, and if the unknown word starts with a consonant, the candidate should start with the same consonant” • The lemma of the candidate with the highest lexical probability (i.e. frequency count in the corpus) is taken as a lemma for the unknown word
  • 33. Evaluation Lexicon Forms ‘Recall’ DIL forms only 79,140 74.7 % DIL + 1000 most frequent OOV-words 80,206 80.0 % ! 4,889 homonymous forms Baseline Predicted lemmas Lemmatized correctly 483 / 840 552 / 840 Accuracy 57,50 % 65,71 %
  • 34. Evaluation Tokens 840 Known words 654 Unknown words 186 Lemmatized correctly 552 Lemmas predicted for unknown words 157 Predicted correctly 84 Predicted incorrectly 68 Several lemmas predicted including the correct one, but the wrong one is chosen 5 ~ 60 % of lemmas are predicted correctly
  • 35. Token Best candidate from closest dictionary forms Best candidate’s lemma Chosen lemma + eólais eólas eólas eólas + fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich + cheast ceist ceist ceist * déa dia dá, de, do, día de + bréithir bréthir bríathar bríathar – n-uaill aill aile, aill, all, aille aile – chuain cain cain, canaid, cani, caingen canaid – christ ceist ceist ceist – caeme caíme caíme caíme – chniss cliss cles cles Predicted lemmas
  • 36. Source Code & Corpora Source code https://github.com/ancatmara/old_irish_lemmatizer Texts https://github.com/ancatmara/old_irish_corpora
  • 37. Extraction of Social Networks from Literary Text Tsygankova Viktoria, National Research University Higher School of Economics, Moscow
  • 38. NovelGraphs a tool for automatic annotation of texts and for extracting social networks of characters from text, where nodes represent characters and edges are relations between them. It can also analyze structural balance of the resulting graphs.
  • 39. prince paradox duke de valentinois henry wotton narborough borgia filippo hallward louis xii lady henry erskine adrian gian maria visconti romeo gray mercutio ruxton Example graph of the “Picture of Dorian Gray” by Oscar Wilde
  • 40. Example graph of the “Study in Scarlet” by A. Conan Doyle lestrade gregson murcher rance holmes narrator eph stangerson
  • 41. Example graph of the “Study in Scarlet” by A. Conan Doyle with sentiment
  • 42. Example graph of the “Picture of Dorian Gray” by Oscar Wilde with sentient
  • 43. Conclusions   A tool NovelGraphs was created for English-language literary fiction, which uses a new approach of extracting characters and connections between them.   Nodes represent characters found in the text, and edges connect them to other characters with whom they interact.   At the moment, combinations of extractors and aggregators detect characters better than interactions between them.   Analysis of structural balance identifies key passages of the text that correspond to the minima and maxima on the balance plot.
  • 45. Are the results of your corpus research really reliable? Getting automatic result analysis on GICR. Tatiana Shavrina, Daniil Selegey AINL FRUCT, SPb, 12.11.2016
  • 46. Big Corpora Problem: 1. Billions of words, mostly coming from social media 2. Getting just the IPM and search results in KWIC format doesn’t tell you if the results are biased 3. A lot of metatext attributes – URLs, doc IDs, author IDs, region, gender, genre etc. – all are potential source of bias Users need corpus tools to see all statistics of the search area to check for homogeneity with the whole corpus.
  • 47. Our solution: Search results analysis right in the interface!
  • 48. See you at our Demo stand!