SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
OpenWordnet-PT: A Project Report
Alexandre Rademaker1,5 Valeria de Paiva2 Gerard de Melo3
Livy Maria Real Coelho4 Maria Gatti5
FGV/EMAp
Nunance Comm.
Tsinghua University
UFP
IBM Research

February 2, 2014
Why we started openWordnet-PT?

We need a Portuguese Wordnet for our work, but none of the previous
projects is openly available.

Aren’t all wordnets open?
Getulio Vargas Foundation (FGV)

Brazilian higher education and
research institution founded in 1944.
It offers regular courses of
Economics, Business Administration,
Law, Social Sciences and Applied
Mathematics. Its original goal was to
train people for the country’s public
and private-sector management.
Considered a top-5 policymaker
think-tank worldwide.
http://portal.fgv.br
CPDOC - Center of Brazilian Contemporary History
A major center for teaching and researching in the Social Sciences and
Contemporary History located in Rio de Janeiro. It holds:
Personal Archives (Acessus) ≈ 200 archives, up to 1,8M docs or
5.2M pages (700K digitalized), among text (handwritten and
printed), letters, memos, diaries, images and videos.
Oral History Program (PHO) A huge set of testimonies (in audio
and video) consisting of more than 2K interviews, which correspond
to up to 6K hours of recordings. 90% in digital format. Almost all
transcribed. Limit access, not online.
Brazilian Historical Biographic Dictionary (DHBB) 7,5K entries,
6,5K are of biographical and 1K related to institutions, events and
concepts of interest for the Brazilian history after 1930. Carefully
revised entries by researchers. Few metadata.
The Long Run Project

Joint project between CPDOC and EMAp (Mathematical School);
Enrich the structure (semantics) of CPDOC data;
Open and expose CPDOC’s data and architecture making it more
maintainable and dynamic;
Uniform and integrated data treatment (standards and interlinks
between collections).
NLP of CPDOC’s data

Linking to dbpedia (Presidents of Brazil, presidents of the Senate,
political parties etc)
NLP and text mining of DHBB entries: (1) proper names; (2) word
sense disambiguation using the openWordnet-PT; and (3) named
entity recognition and creation of links between DHBB entries.
133,036 proper names identified (some few mistakes). Potentially
entities (people, locations, organizations etc)
Use grammars, lexical resources, formal ontologies, and logical tools
to reason about knowledge obtained from processing text in
Portuguese: QA, Knowledge Extraction, Computational Semantics
(KB, KR and ATP).
NLP of CPDOC’s data (cont.)
Previous Portuguese Wordnets
WordNet.PT e WordNet.PT Global (P. Marrafa) since 1999, part of
EuroWordNet, 19K expressions, manually curated, online consulting
only, some domains.
MWN.PT - MultiWordnet of Portuguese (A. Branco), since 2008, part
of MWN, over 17,200 manually validated concepts/synsets, not free.
WN.Br (B. Dias da Silva) since 2000, not open, not available online.
REBECA system (LREC 2010) only for “wheeled vehicles” domain,
not clear the diff from Adam1 , based on WN.Pr 2.0. Some names
confusion WordNet.br 2 and TEP 3 .
More recently, Onto.PT 4 .

1

pease2009formal.
http://www.nilc.icmc.usp.br/wordnetbr/
3
http://www.nilc.icmc.usp.br/tep2/
4
http://ontopt.dei.uc.pt
2
OpenWordnet-PT: What?

Leverage EuroWordNet, MultiWordNet, Global WordNet experience.
Recruited Gerard de Melo for project. Leverage YAGO, UWN/Menta
experience. A large-scale multilingual lexical knowledge base built
using statistical methods, transforming WordNet into a massively
multilingual resource.
Portuguese “projection” of UWN/Menta is the basis of automated
version of a OpenWordNet-PT, publicly available.
The basis

Princeton WordNet 3.0 used to obtain English glosses and English
terms for each synset.
The unreleased 2010-12 version UWN and MENTA provided
candidate terms in Portuguese, few candidate glosses in PT (from
Wikipedia), and candidate terms in Spanish.
The EuroWordNet base concept list (5000 bc.xml) provides the base
concept numbers. The core concepts are also considered.
The original file was mapped from WordNet 2.0 to 3.0 using the
mappings from WN-Map. When multiple mappings for a WordNet
2.0 synset existed, all possible WordNet 3.0 synsets were kept.
OpenWordnet-PT: the method
a two-tiered methodology: high precision for the more frequent words
of the language, but also high to cover a wide range of words in the
long tail.
Translation dictionaries to map the English members of a synset to
possible Portuguese translation candidates. To disambiguate and
choose the correct translations, feature vectors for possible
translations are created by computing graph-based statistics in the
graph of words, translations, and synsets. Monolingual wordnets and
parallel corpora used to enrich this graph. Statistical learning
techniques used to iteratively refine this information and build an
output graph connecting Portuguese words to synsets.
Wikipedia pages are then linked to relevant WordNet synsets by
learning from similar graph-based features as well as gloss similarity
scores.
OpenWordnet-PT: the method (cont.)

To have high precision for the most important concepts of a
language, rely on human annotators.
Set of 4689 “Common Base Concepts” from GWA.
2,498 manually entered sense-word pairs as well as an additional
1,299 manually written Portuguese synset glosses. Native speakers,
but not linguists. Plenty of errors.
Results
Good and bad cases: capitalized items, plurals, duplicates (6K words diff
only in upper/lower case), a few gender issues, missing items (true lexical
gaps?) etc. Easy and hard cases.
RDF Representation
Interoperability between wordnets. Linked Data and Semantic Web
standards such as RDF and OWL.
The emergence of Linked Data projects for lexical and reasoning
resources make OpenWN-PT encoded and distributed in RDF/OWL.
Standards allow both data model and data in the same format. Tools
including databases (triple stores) with SQL-like query interfaces
(SPARQL). Schema Free.
Standard W3C encoding of WordNet in RDF since 20065 .
OpenWN-PT is modelled after and fully interoperable with Princeton
WordNet. Our own lisp parser 6 .
Part of a large ecosystem of compatible resources, including domain
identifiers and mappings to Wikipedia.

5
6

wn-rdf.
https://github.com/arademaker/wordnet2rdf
RDF Representation (cont.)
One can easily find Portuguese equivalents for specific English word senses
and vice versa. See http://bit.ly/1aPxd7J.
URIs for name resources

http://arademaker.github.com/wn30/schema/ (instead of
http://purl.org/vocabularies/princeton/wn30/ or
http://www.w3.org/2006/03/wn/wn20/schema/ or
http://wordnet.princeton.edu/wn20/schema/)
http://arademaker.github.com/wn30/instances/
http://arademaker.github.com/wn30-br/instances/
We are still thinking in better and stable URIs!
Progress Report
Checking is much easier than starting from scratch.
But long and tedious work to check even the initial 5k synsets
suggested by GWA (not done, yet!), let alone all synsets in
OpenWN-PT.
Necessary? YES! Lexical gaps of all sorts.
But resource is being used.
Improving the resource: new data from Bond7 and some manual
additions (NOMLEX-BR project).

synsets
words
senses

7

bond-foster:2013:ACL2013.

2011
41,810
52,220
68,285

2013
43,895
54,125
74,054

increase
5%
3%
8%
Synsets missing PT words by type
Synsets missing PT words by lexicographer File
See http://bit.ly/1fm6fUC.
lexFile
adj.ppl
verb.competition
noun.possession
verb.creation
adv.all
.
.
.
noun.phenomenon
noun.feeling
noun.object
noun.location
noun.Tops

total PT
5
100
271
184
979
.
.
.

total Pr
60
459
1061
694
3621
.
.
.

percent
8
22
26
27
27
.
.
.

324
223
908
2096
51

641
428
1545
3209
51

51
52
59
65
100
Use cases: FreeLing8

Word Sense Disambiguation via
FreeLing 3.0 An Open Source
Suite of Language Analyzers.
OpenWN-PT has been
incorporated into FreeLing.
A given Portuguese text can
automatically be annotated with
word senses

8

freeling.
Use Cases: Sentiment Analysis

Sentiment Analysis, using tweets
about 2013 Confederation Coup
games.
OpenWN-PT and SentiWordNet
to compare/develop the
MachineLearning-based
sentiment analysis integrated
into IBM InfoSphere Streams
(ISS) platform.
1 million tweets, 4 friendly
matches Brazilian team in 2013,
7 classes of positivity
IBM Research Brazil Project.
Use cases: Nomlex-BR
Extension of OpenWN-PT aims at incorporating links to connect
deverbal nouns with their corresponding verbs.
We have created over 2,000 entries integrated into OpenWN-PT, will
facilitate their use for linguistic research as well as information
extraction
Incorporating NOMLEX-BR data into OpenWN-PT has shown itself
useful in pinpointing some issues with the coherence and richness of
OpenWN-PT.
the word abasement corresponds in NOMLEX to the verb abase,
and thus we would like a similar correspondence between the
Portuguese noun aviltamento and the verb aviltar (suggested
translations). OpenWN-PT simply has two synsets “humilhar,
abaixar” and “humilhar, rebaixar”. The more common verb humilhar
is repeated, while the uncommon aviltar was left out.
More about Nomlex-BR in the last day of GWC 2014!
Miscellaneous Experiments: adding antonoym relations
OpenWordnet-PT: accuracy
But how good are these entries? How to measure? How to improve?
Following9 , from 6 relations (hypernymOf, memberHolonymOf,
instanceOf, substanceHolonymOf, entails and causes) we randomly
picked 30 pairs of synsets and then random words from each synset.
From 180 sentences, 150 sentences marked as correct (83% of the
sentences), 17 marked as wrong (one of the two words used to fill the
template is probably placed in a wrong synset), and 13 marked as
dubious.
More experiments must be done. E.g. remove trivial pairs with same
words.
Some data mining could help. Synsets with an uncommonly high
number of senses or words with an unexpected number of senses
should be reviewed.
9

cruse1986.
Conclusion

We discussed the implementation and some applications of
OpenWordNet-PT, an open Wordnet for Brazilian Portuguese.
Recent improvements include better coverage and nominalization
links connecting nouns and verbs.
Used in high-throughput commercial system, cultural heritage project,
hopefully more soo.
Freely available from
http://github.com/arademaker/openWordnet-PT/ and a
SPARQL Endpoint at http://logics.emap.fgv.br:10035.
Browsing via Open Multilingual Wordnet is fun.
Next steps
We are developing our own web interface for browsing and
collaborative editing. Most important pending issue!
First finish translating the “core” synsets in the Princeton WordNet
to Portuguese.
Finish to embed Nomlex-BR into OpenWN-PT (anchor floating
words, http://bit.ly/1aQdpkr).
Adding the Portuguese terms that satisfy different relations?
Since we have a first target corpus, DHBB, we can also calculate
word frequency to prioritize expansion of the OpenWN-PT and go
back to the ontology building.
Use and test the accuracy of the resource! More applications!
OpenVerbNet-PT?
FOIS 2014 10 Workshop, “Logics and Ontologies for
non-English NLP”. Website coming soon.
10

http://fois2014.inf.ufes.br/
Thanks! Obrigado!

Más contenido relacionado

La actualidad más candente

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)ThennarasuSakkan
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processingCreditas
 
EDF2012 Mariana Damova - Factforge
EDF2012   Mariana Damova - FactforgeEDF2012   Mariana Damova - Factforge
EDF2012 Mariana Damova - FactforgeEuropean Data Forum
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeRajpootBhatti5
 
An overview of Portuguese WordNets
An overview of Portuguese WordNetsAn overview of Portuguese WordNets
An overview of Portuguese WordNetsAlexandre Rademaker
 
Matching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesMatching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesIJwest
 
Representing Translations on the Semantic Web
Representing Translations on the Semantic WebRepresenting Translations on the Semantic Web
Representing Translations on the Semantic WebOscar Corcho
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetAlexandre Rademaker
 
Ontologies and Semantics for Portuguese
Ontologies and Semantics for PortugueseOntologies and Semantics for Portuguese
Ontologies and Semantics for PortugueseValeria de Paiva
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerCOMRADES project
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
What can corpus software do? Routledge chpt 11
 What can corpus software do? Routledge chpt 11 What can corpus software do? Routledge chpt 11
What can corpus software do? Routledge chpt 11RajpootBhatti5
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpusThennarasuSakkan
 
Open-source machine translation for Icelandic: the Apertium platform as an o...
Open-source machine translation for Icelandic:
 the Apertium platform as an o...Open-source machine translation for Icelandic:
 the Apertium platform as an o...
Open-source machine translation for Icelandic: the Apertium platform as an o...Forcada Mikel
 

La actualidad más candente (20)

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
 
EDF2012 Mariana Damova - Factforge
EDF2012   Mariana Damova - FactforgeEDF2012   Mariana Damova - Factforge
EDF2012 Mariana Damova - Factforge
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
 
An overview of Portuguese WordNets
An overview of Portuguese WordNetsAn overview of Portuguese WordNets
An overview of Portuguese WordNets
 
Matching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesMatching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sources
 
Representing Translations on the Semantic Web
Representing Translations on the Semantic WebRepresenting Translations on the Semantic Web
Representing Translations on the Semantic Web
 
Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
 
Ontologies and Semantics for Portuguese
Ontologies and Semantics for PortugueseOntologies and Semantics for Portuguese
Ontologies and Semantics for Portuguese
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
 
OWN-PT: Taking Stock
OWN-PT: Taking Stock OWN-PT: Taking Stock
OWN-PT: Taking Stock
 
5 rdfs
5 rdfs5 rdfs
5 rdfs
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
What can corpus software do? Routledge chpt 11
 What can corpus software do? Routledge chpt 11 What can corpus software do? Routledge chpt 11
What can corpus software do? Routledge chpt 11
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
 
Python NLTK
Python NLTKPython NLTK
Python NLTK
 
Open-source machine translation for Icelandic: the Apertium platform as an o...
Open-source machine translation for Icelandic:
 the Apertium platform as an o...Open-source machine translation for Icelandic:
 the Apertium platform as an o...
Open-source machine translation for Icelandic: the Apertium platform as an o...
 
Ivan Derganskyi
Ivan DerganskyiIvan Derganskyi
Ivan Derganskyi
 
Fact forge20 edf
Fact forge20 edfFact forge20 edf
Fact forge20 edf
 

Similar a OpenWordnet-PT: A Project Report

Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingValeria de Paiva
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
 
Embedding Nomlex-BR into OpenWN-PT
Embedding Nomlex-BR into OpenWN-PTEmbedding Nomlex-BR into OpenWN-PT
Embedding Nomlex-BR into OpenWN-PTValeria de Paiva
 
FrameNet development for Latvian
FrameNet development for LatvianFrameNet development for Latvian
FrameNet development for LatvianNormunds Grūzītis
 
Enabling Language Resources to Expose Translations as Linked Data on the Web
Enabling Language Resources to Expose Translations as Linked Data on the WebEnabling Language Resources to Expose Translations as Linked Data on the Web
Enabling Language Resources to Expose Translations as Linked Data on the WebJorge Gracia
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftSebastian Hellmann
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataDave Lewis
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologySteven Miller
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Finding out Noisy Patterns for Relation Extraction of Bangla Sentences
Finding out Noisy Patterns for Relation Extraction of Bangla SentencesFinding out Noisy Patterns for Relation Extraction of Bangla Sentences
Finding out Noisy Patterns for Relation Extraction of Bangla Sentenceskevig
 
Virtual libraries in research funding agencies an open source approach to dis...
Virtual libraries in research funding agencies an open source approach to dis...Virtual libraries in research funding agencies an open source approach to dis...
Virtual libraries in research funding agencies an open source approach to dis...Guilherme GM
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESijnlc
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsRichard Littauer
 
Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...IMPACT Centre of Competence
 
Making topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizingMaking topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizingLars Marius Garshol
 

Similar a OpenWordnet-PT: A Project Report (20)

Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese Understanding
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for Portuguese
 
Embedding Nomlex-BR into OpenWN-PT
Embedding Nomlex-BR into OpenWN-PTEmbedding Nomlex-BR into OpenWN-PT
Embedding Nomlex-BR into OpenWN-PT
 
FrameNet development for Latvian
FrameNet development for LatvianFrameNet development for Latvian
FrameNet development for Latvian
 
Enabling Language Resources to Expose Translations as Linked Data on the Web
Enabling Language Resources to Expose Translations as Linked Data on the WebEnabling Language Resources to Expose Translations as Linked Data on the Web
Enabling Language Resources to Expose Translations as Linked Data on the Web
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and Terminology
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
ijcai11
ijcai11ijcai11
ijcai11
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Finding out Noisy Patterns for Relation Extraction of Bangla Sentences
Finding out Noisy Patterns for Relation Extraction of Bangla SentencesFinding out Noisy Patterns for Relation Extraction of Bangla Sentences
Finding out Noisy Patterns for Relation Extraction of Bangla Sentences
 
Virtual libraries in research funding agencies an open source approach to dis...
Virtual libraries in research funding agencies an open source approach to dis...Virtual libraries in research funding agencies an open source approach to dis...
Virtual libraries in research funding agencies an open source approach to dis...
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Php packages
Php packagesPhp packages
Php packages
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...Applying Linked Open Data to a digital library: best practices and lessons le...
Applying Linked Open Data to a digital library: best practices and lessons le...
 
Pargram2011
Pargram2011Pargram2011
Pargram2011
 
Making topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizingMaking topic maps from Subject Headings for linking and organizing
Making topic maps from Subject Headings for linking and organizing
 

Más de Alexandre Rademaker

On the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal LogicOn the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal LogicAlexandre Rademaker
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTAlexandre Rademaker
 
A linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archivesA linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archivesAlexandre Rademaker
 
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...Alexandre Rademaker
 
On the proof theory for Description Logics
On the proof theory for Description LogicsOn the proof theory for Description Logics
On the proof theory for Description LogicsAlexandre Rademaker
 
A database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF storesA database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF storesAlexandre Rademaker
 
Intuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal ReasoningIntuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal ReasoningAlexandre Rademaker
 
Is it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQIIs it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQIAlexandre Rademaker
 

Más de Alexandre Rademaker (9)

On the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal LogicOn the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal Logic
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
 
A linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archivesA linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archives
 
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
 
On the proof theory for Description Logics
On the proof theory for Description LogicsOn the proof theory for Description Logics
On the proof theory for Description Logics
 
A database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF storesA database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF stores
 
Intuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal ReasoningIntuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal Reasoning
 
First Order Logic
First Order LogicFirst Order Logic
First Order Logic
 
Is it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQIIs it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQI
 

Último

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

OpenWordnet-PT: A Project Report

  • 1. OpenWordnet-PT: A Project Report Alexandre Rademaker1,5 Valeria de Paiva2 Gerard de Melo3 Livy Maria Real Coelho4 Maria Gatti5 FGV/EMAp Nunance Comm. Tsinghua University UFP IBM Research February 2, 2014
  • 2. Why we started openWordnet-PT? We need a Portuguese Wordnet for our work, but none of the previous projects is openly available. Aren’t all wordnets open?
  • 3. Getulio Vargas Foundation (FGV) Brazilian higher education and research institution founded in 1944. It offers regular courses of Economics, Business Administration, Law, Social Sciences and Applied Mathematics. Its original goal was to train people for the country’s public and private-sector management. Considered a top-5 policymaker think-tank worldwide. http://portal.fgv.br
  • 4. CPDOC - Center of Brazilian Contemporary History A major center for teaching and researching in the Social Sciences and Contemporary History located in Rio de Janeiro. It holds: Personal Archives (Acessus) ≈ 200 archives, up to 1,8M docs or 5.2M pages (700K digitalized), among text (handwritten and printed), letters, memos, diaries, images and videos. Oral History Program (PHO) A huge set of testimonies (in audio and video) consisting of more than 2K interviews, which correspond to up to 6K hours of recordings. 90% in digital format. Almost all transcribed. Limit access, not online. Brazilian Historical Biographic Dictionary (DHBB) 7,5K entries, 6,5K are of biographical and 1K related to institutions, events and concepts of interest for the Brazilian history after 1930. Carefully revised entries by researchers. Few metadata.
  • 5. The Long Run Project Joint project between CPDOC and EMAp (Mathematical School); Enrich the structure (semantics) of CPDOC data; Open and expose CPDOC’s data and architecture making it more maintainable and dynamic; Uniform and integrated data treatment (standards and interlinks between collections).
  • 6. NLP of CPDOC’s data Linking to dbpedia (Presidents of Brazil, presidents of the Senate, political parties etc) NLP and text mining of DHBB entries: (1) proper names; (2) word sense disambiguation using the openWordnet-PT; and (3) named entity recognition and creation of links between DHBB entries. 133,036 proper names identified (some few mistakes). Potentially entities (people, locations, organizations etc) Use grammars, lexical resources, formal ontologies, and logical tools to reason about knowledge obtained from processing text in Portuguese: QA, Knowledge Extraction, Computational Semantics (KB, KR and ATP).
  • 7. NLP of CPDOC’s data (cont.)
  • 8. Previous Portuguese Wordnets WordNet.PT e WordNet.PT Global (P. Marrafa) since 1999, part of EuroWordNet, 19K expressions, manually curated, online consulting only, some domains. MWN.PT - MultiWordnet of Portuguese (A. Branco), since 2008, part of MWN, over 17,200 manually validated concepts/synsets, not free. WN.Br (B. Dias da Silva) since 2000, not open, not available online. REBECA system (LREC 2010) only for “wheeled vehicles” domain, not clear the diff from Adam1 , based on WN.Pr 2.0. Some names confusion WordNet.br 2 and TEP 3 . More recently, Onto.PT 4 . 1 pease2009formal. http://www.nilc.icmc.usp.br/wordnetbr/ 3 http://www.nilc.icmc.usp.br/tep2/ 4 http://ontopt.dei.uc.pt 2
  • 9. OpenWordnet-PT: What? Leverage EuroWordNet, MultiWordNet, Global WordNet experience. Recruited Gerard de Melo for project. Leverage YAGO, UWN/Menta experience. A large-scale multilingual lexical knowledge base built using statistical methods, transforming WordNet into a massively multilingual resource. Portuguese “projection” of UWN/Menta is the basis of automated version of a OpenWordNet-PT, publicly available.
  • 10. The basis Princeton WordNet 3.0 used to obtain English glosses and English terms for each synset. The unreleased 2010-12 version UWN and MENTA provided candidate terms in Portuguese, few candidate glosses in PT (from Wikipedia), and candidate terms in Spanish. The EuroWordNet base concept list (5000 bc.xml) provides the base concept numbers. The core concepts are also considered. The original file was mapped from WordNet 2.0 to 3.0 using the mappings from WN-Map. When multiple mappings for a WordNet 2.0 synset existed, all possible WordNet 3.0 synsets were kept.
  • 11. OpenWordnet-PT: the method a two-tiered methodology: high precision for the more frequent words of the language, but also high to cover a wide range of words in the long tail. Translation dictionaries to map the English members of a synset to possible Portuguese translation candidates. To disambiguate and choose the correct translations, feature vectors for possible translations are created by computing graph-based statistics in the graph of words, translations, and synsets. Monolingual wordnets and parallel corpora used to enrich this graph. Statistical learning techniques used to iteratively refine this information and build an output graph connecting Portuguese words to synsets. Wikipedia pages are then linked to relevant WordNet synsets by learning from similar graph-based features as well as gloss similarity scores.
  • 12. OpenWordnet-PT: the method (cont.) To have high precision for the most important concepts of a language, rely on human annotators. Set of 4689 “Common Base Concepts” from GWA. 2,498 manually entered sense-word pairs as well as an additional 1,299 manually written Portuguese synset glosses. Native speakers, but not linguists. Plenty of errors.
  • 13. Results Good and bad cases: capitalized items, plurals, duplicates (6K words diff only in upper/lower case), a few gender issues, missing items (true lexical gaps?) etc. Easy and hard cases.
  • 14. RDF Representation Interoperability between wordnets. Linked Data and Semantic Web standards such as RDF and OWL. The emergence of Linked Data projects for lexical and reasoning resources make OpenWN-PT encoded and distributed in RDF/OWL. Standards allow both data model and data in the same format. Tools including databases (triple stores) with SQL-like query interfaces (SPARQL). Schema Free. Standard W3C encoding of WordNet in RDF since 20065 . OpenWN-PT is modelled after and fully interoperable with Princeton WordNet. Our own lisp parser 6 . Part of a large ecosystem of compatible resources, including domain identifiers and mappings to Wikipedia. 5 6 wn-rdf. https://github.com/arademaker/wordnet2rdf
  • 15. RDF Representation (cont.) One can easily find Portuguese equivalents for specific English word senses and vice versa. See http://bit.ly/1aPxd7J.
  • 16. URIs for name resources http://arademaker.github.com/wn30/schema/ (instead of http://purl.org/vocabularies/princeton/wn30/ or http://www.w3.org/2006/03/wn/wn20/schema/ or http://wordnet.princeton.edu/wn20/schema/) http://arademaker.github.com/wn30/instances/ http://arademaker.github.com/wn30-br/instances/ We are still thinking in better and stable URIs!
  • 17. Progress Report Checking is much easier than starting from scratch. But long and tedious work to check even the initial 5k synsets suggested by GWA (not done, yet!), let alone all synsets in OpenWN-PT. Necessary? YES! Lexical gaps of all sorts. But resource is being used. Improving the resource: new data from Bond7 and some manual additions (NOMLEX-BR project). synsets words senses 7 bond-foster:2013:ACL2013. 2011 41,810 52,220 68,285 2013 43,895 54,125 74,054 increase 5% 3% 8%
  • 18. Synsets missing PT words by type
  • 19. Synsets missing PT words by lexicographer File See http://bit.ly/1fm6fUC. lexFile adj.ppl verb.competition noun.possession verb.creation adv.all . . . noun.phenomenon noun.feeling noun.object noun.location noun.Tops total PT 5 100 271 184 979 . . . total Pr 60 459 1061 694 3621 . . . percent 8 22 26 27 27 . . . 324 223 908 2096 51 641 428 1545 3209 51 51 52 59 65 100
  • 20. Use cases: FreeLing8 Word Sense Disambiguation via FreeLing 3.0 An Open Source Suite of Language Analyzers. OpenWN-PT has been incorporated into FreeLing. A given Portuguese text can automatically be annotated with word senses 8 freeling.
  • 21. Use Cases: Sentiment Analysis Sentiment Analysis, using tweets about 2013 Confederation Coup games. OpenWN-PT and SentiWordNet to compare/develop the MachineLearning-based sentiment analysis integrated into IBM InfoSphere Streams (ISS) platform. 1 million tweets, 4 friendly matches Brazilian team in 2013, 7 classes of positivity IBM Research Brazil Project.
  • 22. Use cases: Nomlex-BR Extension of OpenWN-PT aims at incorporating links to connect deverbal nouns with their corresponding verbs. We have created over 2,000 entries integrated into OpenWN-PT, will facilitate their use for linguistic research as well as information extraction Incorporating NOMLEX-BR data into OpenWN-PT has shown itself useful in pinpointing some issues with the coherence and richness of OpenWN-PT. the word abasement corresponds in NOMLEX to the verb abase, and thus we would like a similar correspondence between the Portuguese noun aviltamento and the verb aviltar (suggested translations). OpenWN-PT simply has two synsets “humilhar, abaixar” and “humilhar, rebaixar”. The more common verb humilhar is repeated, while the uncommon aviltar was left out. More about Nomlex-BR in the last day of GWC 2014!
  • 23. Miscellaneous Experiments: adding antonoym relations
  • 24. OpenWordnet-PT: accuracy But how good are these entries? How to measure? How to improve? Following9 , from 6 relations (hypernymOf, memberHolonymOf, instanceOf, substanceHolonymOf, entails and causes) we randomly picked 30 pairs of synsets and then random words from each synset. From 180 sentences, 150 sentences marked as correct (83% of the sentences), 17 marked as wrong (one of the two words used to fill the template is probably placed in a wrong synset), and 13 marked as dubious. More experiments must be done. E.g. remove trivial pairs with same words. Some data mining could help. Synsets with an uncommonly high number of senses or words with an unexpected number of senses should be reviewed. 9 cruse1986.
  • 25. Conclusion We discussed the implementation and some applications of OpenWordNet-PT, an open Wordnet for Brazilian Portuguese. Recent improvements include better coverage and nominalization links connecting nouns and verbs. Used in high-throughput commercial system, cultural heritage project, hopefully more soo. Freely available from http://github.com/arademaker/openWordnet-PT/ and a SPARQL Endpoint at http://logics.emap.fgv.br:10035. Browsing via Open Multilingual Wordnet is fun.
  • 26. Next steps We are developing our own web interface for browsing and collaborative editing. Most important pending issue! First finish translating the “core” synsets in the Princeton WordNet to Portuguese. Finish to embed Nomlex-BR into OpenWN-PT (anchor floating words, http://bit.ly/1aQdpkr). Adding the Portuguese terms that satisfy different relations? Since we have a first target corpus, DHBB, we can also calculate word frequency to prioritize expansion of the OpenWN-PT and go back to the ontology building. Use and test the accuracy of the resource! More applications! OpenVerbNet-PT? FOIS 2014 10 Workshop, “Logics and Ontologies for non-English NLP”. Website coming soon. 10 http://fois2014.inf.ufes.br/