SlideShare a Scribd company logo
1 of 48
Download to read offline
NLP Research Today …
Models, models, models, models
(preferably deep ones)
1felipe@BUCC 2017
But User Matters too
• Not much reflected in current evaluation
protocols
• e.g. how useful is a model that can identify the
translation of frequent words at an accuracy
of 40% ?
• Smart evaluations do exist though
• (Sharoff et al., 2006) for translating difficult MWEs
2felipe@BUCC 2017
3felipe@BUCC 2017
TransSearch
•  Users are tolerant to alignment problems
•  We implemented a detector of erroneous alignments but decided
not to integrate it
•  We tested a number of alignment models but used one of the
simplest (IBM model 2 with contiguity constraint)
•  Users care more about translation diversity in the top
candidate list
•  so clustering translations is of better use for a user
•  See (Bourdaillet et al., 2010)
4felipe@BUCC 2017
5felipe@BUCC 2017
TransType
•  Keystroke saving is a poor measure of the usefulness of
the tool
•  often, users do not watch the screen… (Langlais et al., 2002)
•  Cognitive load for reading completions
•  predicting when a user will find a completion useful is a plus
(Foster et al., 2002)
•  Users do not like repetitive errors
•  adapting online is important (Nepveu et al., 2004)
See (Gonzales-Rubio et al., 2012) for advances in targeted
mediated interactive MT
6felipe@BUCC 2017
And Data Matters too
• Often the main issue
• once you have data, use your ``best hammer’’
• e.g.
• domain specific MT
• organizing a dataflow
• Know-how in handling Users and Data is in need
in practical settings
• but not much rewarded academically
7felipe@BUCC 2017
Of course …
• Better models are important !
• e.g. NMT (Sutskever et al., 2014; Cho et al.,
2014; Bahdanau et al., 2014)
• simpler (single model)
• better results
• (Bentivogli et al., 2016; Isabelle et al.; 2017)
• good properties for further improvements
• multilingual
• continuously learning
• multi-tasking, etc.
• And have the potential to impact users
8felipe@BUCC 2017
Plan
  Data Management (Acquisition and Organization)
•  What has been done
•  What (I feel) is still missing
•  Domain specific bilingual parallel and comparable corpora
•  Assessing the quality/usefulness of a pair of documents
  Parallel Material Extraction from a Comparable
Corpus
•  Document pairs
•  Sentence pairs
•  Word pairs
9felipe@BUCC 2017
Data Management Overview
Web
felipe@BUCC 2017 10
crawling
dispatching
parallel quasi-
comparable
comparable
sentence
alignment
parallel pair
extraction
parallel fragment
extraction
parallel material
update
Input
•  seed terms/urls
•  documents of
interest (mono or
bilingual)
Objective
•  e.g. improving a
generic SMT
engine
Data Management: What has been done
•  Web2ParallelDoc/Sent
•  BITS (Ma & Liberman, 1999)
•  a sensible pipeline, leveraging lexicons, cognates and a list of webdomains
•  PTMiner (Chen & Nie, 2000), STRAND (Resnik & Smith, 2003)
•  rule-based URL matching, HTML-tag and lexicon-based pairing
•  (Nadeau & Foster, 2004; Fry 2005)
•  specific to newswire feeds
•  WebMining (Tomas et al., 2005)
•  pipeline which only requires a list of bilingual websites (dictionaries are trained
online)
•  (Fukushima, et al., 2006)
•  leveraging a dictionary2graph algorithm, and paying attention to speed issues
•  BABYLON (Mohler and Mihalcea, 2008)
•  low-density language oriented, a pipeline which input is a source document
•  BITEXTOR (Espla-Gomis et Forcada, 2010)
•  pipeline of URL- and content-based matching
•  See (Kulkarni, 2012) for a survey
11felipe@BUCC 2017
Data Management: What has been done
•  Notable Large-scale Efforts
•  (Callisson-Burch et al., 2009)
•  10^9 word parallel corpus crawled from specific URLs, URL-based pairing,
sentence alignment (Moore 2002) + cleaning: 105M sentence pairs
•  (Uszkoreit et al., 2010)
•  parallel document detection using MT and near-duplicate detection
•  2.5B webpages crawled from the Web + 1.5M public-domain books
•  24 hours on a 2000-node cluster
•  (Ture & Lin, 2012)
•  English-German parallel sentences extracted from Wikipedia tackling the
cross-lingual pairwise similarity problem without heuristics
•  (Smith et al., 2013)
•  Open-source extension of STRAND, impressive deployment on the
CommonCrawl corpus (32TB), using Amazon’s elastic map-reduce
12felipe@BUCC 2017
Data Management: Issues
•  No systematic comparison of those systems
•  Not much sensibility to specific domains
•  Crawling
•  which input? manual set of webdomains / urls / terms ?
•  iterative or not ?
•  concerns: efficiency, coverage, resources expected
•  Dispatching
•  grade a pair of documents (Li and Gaussier, 2010; Fung and Cheung,
2004; Babych and Hartley, 2014)
•  measure of the usefulness of a resource (Barker and Gaizauskas,
2012)
•  Bootstrapping
•  model/lexicon update (Fung and Cheung, 2004)
felipe@BUCC 2017 13
• Experiments in those directions would lead to :
• a better know-how
•  valuable in practical (industry) settings
• a shared repository of domain specific collections
•  as OPUS for parallel texts (Tiedmann, 2012)
• (potentially) a reference bilingual collection where
all parallel units are marked
•  would help measuring progress
•  ease reproducibility
felipe@BUCC 2017 14
Data Management: Issues
About Domain Specific CC
•  Domain specific crawling (monolingual)
•  WebBootCat (Baroni and Bernardini, 2004)
•  TerminoWeb (Barrière and Agbago, 2006)
•  Babouk (De Groc, 2011)
•  (Azoulay, 2017) considering only pdf files
•  Specialized (bilingual) comparable corpora acquisition
•  See the ACCURAT project (Pinnis et al., 2012; Aker et al., 2012)
•  METRICC (Alonso et al., 2012)
•  Not just a matter of quantity (Morin et al., 2007)
15felipe@BUCC 2017
Plan
  Data Management (Acquisition and Organization)
•  What has been done
•  What (I feel) is still missing
•  Domain specific bilingual parallel and comparable corpora
•  Assessing the quality/usefulness of a pair of documents
  Parallel Material Extraction from a Comparable
Corpus
•  Document pairs
•  Sentence pairs
•  Word pairs
16felipe@BUCC 2017
Identifying parallel documents
•  Has received some attention (see previous slides)
•  (Enright and Kondrak, 2010)
•  nb of shared hapax words
•  (Uszkoreit et al., 2010)
•  machine translation + monolingual duplicates detection
•  (Patry and Langlais, 2011)
•  light doc. representation + IR + classifier
•  Often just one component of a pipeline which is not
evaluated as such
17felipe@BUCC 2017
Paradocs (Patry and Langlais, 2011)
•  Light way of representing a document:
•  sequence of numerical entities
•  sequence of hapax words (of at least 4 symbols)
•  sequence of punctuations . ! ? ( ) :
as in (Nadeau and Foster 2004)
•  Avoid Cartesian product thanks to IR
•  no need for a bilingual lexicon
•  Train a classifier to recognize parallel documents
•  normalized edit-distance between each sequence representation
•  count of each entity (numerals, punctuations, hapaxes)
18felipe@BUCC 2017
Paradocs: Avoiding the Cartesian product
Europarl setting
- 6000 bitexts per language pair
- Dutch2English
§  nodoc: no document returned
by Lucene
§  nogood: no good document
returned (n=20)
Ø  >40% of failure for very short
documents (<10 sentences)
Ø  15% for long documents (>64
sentences)
19felipe@BUCC 2017
WMT 2016 Shared Task on
Bilingual Document Alignment
•  Goal: better understanding best practices
•  11 teams, 21 systems submitted (+ 1 baseline)
•  Easy entrance task:
•  train: 49 webdomains crawled (ex: virtualhospice.ca)
•  only HTML pages considered
•  text pre-extracted, duplicates removed, language pre-identified (en-fr),
machine translation provided
•  test: 203 other webdomains
•  evaluation:
•  retrieving 2400 pairs of documents known parallel (within webdomains)
•  strict rule: a target document should be proposed at most once
20felipe@BUCC 2017
RALI’s participation
(Jakubina and Langlais, 2016)
•  motivation: simplicity !
•  no use of MT, eventually even no lexicon-based translation
•  an IR approach (Lucene-based) comparing/combining
•  IR (monolingual)
•  CLIR (based on a bilingual lexicon for « translating »)
•  a simple but efficient URL-based IR (tokenizing URLs)
•  badLuc ended up with a recall of 79.3%
•  the best system recorded 95%
•  the organizers’ baseline 59.8%
21felipe@BUCC 2017
Results on train
Strategy @1
text monolingual
default 6.4
default+tok 35.4
best (w/o length) 64.9
best (w length) 76.6
text bilingual
best (w length) 83.3
url
baseline WMT16 67.9
best 80.1
badLuc
w/o post-treat 88.6
w post-treat 92.1
•  Playing with metaparameters helps a lot
•  Applying a length filter also helps a lot
•  Involving translation is a must
•  even with a simple lexicon (which
covers ~half of the words in the
collection)
•  Our URL variant is performing
impressively well
•  outperforms the baseline
•  useful on short documents
•  Combining both indexes (text and urls)
helps
•  Post-filtering is a plus
22felipe@BUCC 2017
Plan
  Data Management (Acquisition and Organization)
•  What has been done
•  What (I feel) is still missing
•  Domain specific bilingual parallel and comparable corpora
•  Assessing the quality/usefulness of a pair of documents
  Parallel Material Extraction from a Comparable
Corpus
•  Document pairs
•  Sentence pairs
•  Word pairs
23felipe@BUCC 2017
MUNT (Bérard, 2014)
A reimplementation of (Munteanu and Marcu, 2005) with
features borrowed from (Smith et al., 2010)
24felipe@BUCC 2017
MUNT (Bérard, 2014)
• classifier (logistic regression)
•  31 features
•  length based, alignment-based features, fertility, etc.
• pre-filter (selecting interesting sentence pairs)
•  ratio of sentence length no greater than 2
•  at least 50% of tokens with alignment in the other side
•  removes ~98% of the Cartesian product !!
25felipe@BUCC 2017
MUNT on Wikipedia
•  Done before (Smith et al., 2010)
•  Wikipedia dump (2014)
felipe@BUCC 2017 26
#articles #paired #sent. #tokens
en 4.5M 919k 29.3M 630.8M
fr 1.5M 919k 16.8M 354.5M
•  Configuration (default)
-  lexicon trained with GIZA++ on 100k sentence pairs of Europarl
-  classifier trained on 800 positive and 800 negative examples of
news data, threeshold: 0.8
MUNT on Wikipedia
felipe@BUCC 2017 27
Manually evaluated 500 (random) sentence pairs
-  parallel
-  quasi-parallel (at least partial)
-  not parallel
At the cellular level, the nervous system
is defined by the presence of a special
type of cell, called the neuron, also
known as a ”nerve cell” .
À l’échelle cellulaire, le système nerveux est
défini par la présence de cellules hautement
spécialisées appelées neurones, qui ont la
capacité, très particulière, de véhiculer un
signal électrochimique .
MUNT on Wikipedia
felipe@BUCC 2017 28
grade %
parallel 71
quasi-parallel 15
not parallel 14
We tried something odd:
•  We applied Yasa (Lamraoui and Langlais, 2013) on Wikipedia article
pairs (pretending they were parallel)
•  Asked MUNT to classify the sentence pairs identified
grade %
parallel 48
quasi-parallel 16
not parallel 36
-  11M sentence pairs identified by Yasa
-  1.6M kept parallel by MUNT
-  1.3M once duplicate removed
-  86% of sentence pairs are (quasi-)parallel
-  much faster !
-  15 hours (on a cluster of 8 nodes)
-  MUNT detected 2.61M sentence pairs
-  2.26M once duplicates removed
-  1.92M after removing sentences shorter than 4 words
-  64% of sentence pairs are (quasi-)parallel
Munt on Wikipedia
felipe@BUCC 2017 29
•  50 comparable Wiki article
pairs manually aligned at
the sentence level (Rebout
and Langlais, 2014)
•  measured the performance
of YASA and MUNT on
those articles
•  MUNT has a better
precision, but a lower
recall
npara # parallel sentences
nfr (nen) # sentences in fr (en) doc.
BUCC 2017 Shared Task
(Zweigenbaum et al., 2016)
•  Detecting parallel sentences in a large text collection
•  2 sets of monolingual Wikipedia sentences (2014 dumps):
•  1.4M French sentences
•  1.9M English sentences
•  + 17k parallel sentences from News Commentary (v9)
•  Evaluation: precision, recall, F1
•  pros
•  no metadata (text-based)
•  Cartesian product is large (with few positive examples)
•  Smartness in inserting parallel sentences (to avoid simple solutions)
•  cons
•  Artificial task
•  True parallel sentences in Wikipedia EN-FR are not known
30felipe@BUCC 2017
RALI’s participation
31felipe@BUCC 2017
Will be presented this afternoon (Grégoire and Langlais, 2017)
RALI’s participation
•  Training: Europarl v7 French-English
•  first 500K sentence pairs
•  negative sampling: random selection of sentence pairs
•  Test: newstest2012 (out-domain)
•  1000 first sentence pairs + noise
•  Pre-processing
•  maximum sentence length: 80
•  tokenization with Moses’ toolkit, lowercased
•  mapping digits to 0 (e.g. 1982 -> 0000)
32felipe@BUCC 2017
RALI’s participation
•  Embedding-based filter for avoiding the Cartesian product
•  word-based embeddings computed with BILBOWA (Gouws, 2015)
•  sentence representation = average word embeddings
•  40 first target sentences for each source one
•  Not paying attention to ``details’’ (only model matters, right?)
•  digit preprocessing
•  random negative sampling at training does not match testing condition
•  The next slides summarize what we have learnt after our
participation to BUCC 2017 (new material)
model precision recall F1
12.1 70.9 20.7 official
33felipe@BUCC 2017
Influence of the decision threeshold
•  Cartesian product
•  1M ex., 1k positive
•  BIRNN trained with 7
negative ex.
•  MUNT trained with a
balanced corpus
Precision Recall F1 ρ
BiRNN 83.0 69.6 75.7 0.99
MUNT 31.8 24.1 27.7 0.99
34felipe@BUCC 2017
Influence of the decision threeshold
•  Pre-filtering
•  8053 ex., 1k positive
•  BIRNN trained with 7
negative ex.
•  MUNT trained with a
balanced corpus
Precision Recall F1 ρ
BiRNN 91.0 62.4 74.0 0.97
MUNT 73.3 57.0 64.1 0.91
35felipe@BUCC 2017
Influence of Post-filtering
•  Each decision taken independently
•  A src sent. may be associated to several target ones, and vice versa
²  post-filtering (greedy algorithm, Hungarian algo. too slow)
huge boost in precision
at a small recall loss for
both approaches
36felipe@BUCC 2017
Plan
  Data Management (Acquisition and Organization)
•  What has been done
•  What (I feel) is still missing
•  Domain specific bilingual parallel and comparable corpora
•  Assessing the quality/usefulness of a pair of documents
  Parallel Material Extraction from a Comparable
Corpus
•  Document pairs
•  Sentence pairs
•  Word pairs
37felipe@BUCC 2017
Bilingual Lexicon Induction
•  Received a lot of attention
•  Pioneering works: (Rapp, 1995; Fung, 1995)
•  See (Sharoff et al., 2013)
•  Revisited as a way to measure the quality of word
embeddings
•  Seminal work of (Mikolov et al., 2013)
•  Comprehensive comparisons
•  (Levy et al., 2014, 2017; Upadhyay et al., 2016)
38felipe@BUCC 2017
Bilingual Lexicon Induction
(Jakubina & Langlais; 2016)
•  We thoroughly revisited those approaches:
•  (Rapp, 1995)
•  (Mikolov et al. 2013)
•  training the projection matrix with the toolkit of (Dinu and Baroni, 2015)
•  (Faruqui and Dyer, 2014)
•  and a few others, but without success
•  investigating their meta-parameters
•  paying attention to the frequency of the terms
•  after (Pekar et al, 2006)
•  showing their complementarity
39felipe@BUCC 2017
Experiments
•  Wikipedia (dumps of June 2013)
•  EN: 7.3M token forms (1.2G tokens)
•  FR: 3.6M token forms (330M tokens)
•  Test sets
•  Wiki≤25 English words occurring at most 25 times in Wiki-EN
•  6.8M such tokens (92%)
•  Wiki>25 English words seen more than 25 times in Wiki-EN
•  Euro-5-6k top frequent 5000 to 6000 words of WMT2011
Frequency
min max avg cov (%)
Wiki≤25 1 25 10 100.0
Wiki>25 27 19.4k 2.8k 100.0
Euro5-6k 1 2.6M 33.6k 87.3
40felipe@BUCC 2017
Metaparameters explored
•  Rapp
•  window size (3,7,15,31)
•  association measure (LLR, discontinuous odd-ratios)
•  projecting UNK words (yes, no)
•  similarity measure (fixed, cosine similarity)
•  seed lexicon (fixed, in-house 107k entries)
•  Word2Vec
•  skip-gram versus continuous bag-of-word
•  negative sampling (5 or 10) versus hierarchical softmax
•  window size (6,10,20,30)
•  dimension (from 50 up to 250 for SKG and 200 for CBOW)
•  seed lexicon (2k-low, 5-high, 5k-rand)
•  Faruqui & Dyer
•  ratio (0.5, 0.8 and 1.0)
•  fixed dimension (the best one for Word2Vec)
41felipe@BUCC 2017
No adhoc filtering as usually done, so very time and memory
challenging for the Rapp approach
(Prochasson and Fung, 2011) this work
-  ~20k document pairs ~700k ones
-  target voc. 128k words (nouns) 3M words
We did apply a few filters:
-  context vectors: 1000 top-ranked words
-  50k first occurrences of a source term
Best variant (per test set)
@1 @5 @20
Wiki>25 (ed@1 19.3)
Rapp 20.0 33.0 43.0
Miko 17.0 32.6 41.6
Faru 13.3 26.0 33.3
Wiki≤25 (ed@1 17.6)
Rapp 2.6 4.3 7.3
Miko 1.6 4.6 10.6
Faru 1.6 2.6 5.0
Euro5-6k (ed@1 8.0)
Rapp 16.6 31.8 41.2
Miko 42.0 59.0 67.8
Faru 30.6 47.7 59.8
42felipe@BUCC 2017
Reranking Candidate Translations
(Jakubina and Langlais, 2016)
•  Reranking shown useful in a number of settings
(Delpech et al., 2012; Harastani et al., 2013; Kontonatsios et al., 2014)
•  Trained a reranker (random forest) with RankLib
•  700 terms for training, remaining 300 terms for testing
•  3-fold cross-validation
•  Light features for each pair (s,t):
•  frequency-based features
•  freq of s,t and their difference
•  string-based features
•  length of s and t, their difference, their ratio, edit distance(s,t)
•  rank-based features
•  score and rank of t in the native list
•  number of lists in which t appears (when several n-best lists are considered)
43felipe@BUCC 2017
Reranking Individual n-best Lists
Individual 1-Reranked
@1 @5 @20 @1 @5 @20
Wiki>25
Rapp 20.0 33.0 43.0 36.3 48.8 53.8
Miko 17.0 32.6 41.6 38.1 49.0 54.3
Faru 13.3 26.0 33.3 34.3 44.0 47.9
Wiki≤25
Rapp 2.6 4.3 7.3 8.6 9.4 10.2
Miko 1.6 4.6 10.6 16.6 19.0 20.1
Faru 1.6 2.6 5.0 7.9 8.7 8.9
Euro5-6k
Rapp 16.6 31.8 41.2 34.6 48.6 51.9
Miko 42.0 59.0 67.8 47.0 68.1 73.0
Faru 30.6 47.7 59.8 41.2 58.0 66.0
44felipe@BUCC 2017
Reranking several n-best Lists
1-Reranked n-Reranked
@1 @5 @20 @1 @5 @20
Wiki>25
Rapp 36.3 48.8 53.8
Miko 38.1 49.0 54.3 R+M 43.3 58.4 62.4
Faru 34.3 44.0 47.9 R+M+F 45.6 59.6 64.0
Wiki≤25
Rapp 8.6 9.4 10.2
Miko 16.6 19.0 20.1 R+M 18.9 22.0 23.6
Faru 7.9 8.7 8.9 R+M+F 21.3 24.4 25.7
Euro5-6k
Rapp 34.6 48.6 51.9
Miko 47.0 68.1 73.0 R+M 49.5 68.7 76.1
Faru 41.2 58.0 66.0 R+M+F 47.6 68.5 76.2
45felipe@BUCC 2017
One-Slide Wrap up
• Hyp: Model-centric research somehow hides the
value of:
•  know-how in managing bilingual data (parallel and
comparable)
•  evaluation protocols involving real users (or proxies)
• Better handling data is part of the game
• We should learn from users
felipe@BUCC 2017 46
Thank you for your attention
felipe@BUCC 2017 47

More Related Content

What's hot

CORE Analytics Dashboard
CORE Analytics DashboardCORE Analytics Dashboard
CORE Analytics Dashboardpetrknoth
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Saeedeh Shekarpour
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataPolytechnic University of Bari
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesFabio Benedetti
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari
 
The future of scholarly communications professionals
The future of scholarly communications professionalsThe future of scholarly communications professionals
The future of scholarly communications professionalsNancy Pontika
 
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...Fabio Benedetti
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
 
IOTA OpenURL Quality @ 2011 UKSG Conference
IOTA OpenURL Quality @ 2011 UKSG ConferenceIOTA OpenURL Quality @ 2011 UKSG Conference
IOTA OpenURL Quality @ 2011 UKSG ConferenceRafal Kasprowski
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 

What's hot (16)

Data wrangling week2
Data wrangling week2Data wrangling week2
Data wrangling week2
 
Data Wrangling Week 7
Data Wrangling Week 7Data Wrangling Week 7
Data Wrangling Week 7
 
CORE Analytics Dashboard
CORE Analytics DashboardCORE Analytics Dashboard
CORE Analytics Dashboard
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open DataSSSW 2013 - Feeding Recommender Systems with Linked Open Data
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data Sources
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
The future of scholarly communications professionals
The future of scholarly communications professionalsThe future of scholarly communications professionals
The future of scholarly communications professionals
 
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
IOTA OpenURL Quality @ 2011 UKSG Conference
IOTA OpenURL Quality @ 2011 UKSG ConferenceIOTA OpenURL Quality @ 2011 UKSG Conference
IOTA OpenURL Quality @ 2011 UKSG Conference
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 

Viewers also liked

Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Tobias Wunner
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsAlberto Simões
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationwebLyzard technology
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Adrien Barbaresi
 
Parallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaParallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaHaithem Afli
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van braziliëJan-Willem Lammens
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconİrem Tümer
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Association for Computational Linguistics
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Association for Computational Linguistics
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesEstelle Delpech
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in Englishteflang
 

Viewers also liked (17)

Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-Sierra
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation Patterns
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and Evaluation
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...
 
Parallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaParallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corpora
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van brazilië
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexicon
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Similar to Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research

GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
 
Application of domain engineering to generate customized information dashboards
Application of domain engineering to generate customized information dashboardsApplication of domain engineering to generate customized information dashboards
Application of domain engineering to generate customized information dashboardsFranciscoJosGarcaPea1
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaNizar Ghoula
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesAshis Kumar Chanda
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016
Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016
Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016IXIASOFT
 

Similar to Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research (20)

GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
394 wade word2007-ssp2008
394 wade word2007-ssp2008394 wade word2007-ssp2008
394 wade word2007-ssp2008
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
 
Application of domain engineering to generate customized information dashboards
Application of domain engineering to generate customized information dashboardsApplication of domain engineering to generate customized information dashboards
Application of domain engineering to generate customized information dashboards
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Presentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_GhoulaPresentation ASLIB 2014_Ghoula
Presentation ASLIB 2014_Ghoula
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Day3 edupub tokyo_idpf
Day3 edupub tokyo_idpfDay3 edupub tokyo_idpf
Day3 edupub tokyo_idpf
 
Doktorantūras semināra 3. prezentācija
Doktorantūras semināra 3. prezentācijaDoktorantūras semināra 3. prezentācija
Doktorantūras semināra 3. prezentācija
 
COPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob DaveyCOPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob Davey
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016
Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016
Metrics for continual improvements - Nolwenn Kerzreho LavaconDublin2016
 

More from Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 

More from Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Recently uploaded

Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 

Recently uploaded (20)

Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research

  • 1.
  • 2. NLP Research Today … Models, models, models, models (preferably deep ones) 1felipe@BUCC 2017
  • 3. But User Matters too • Not much reflected in current evaluation protocols • e.g. how useful is a model that can identify the translation of frequent words at an accuracy of 40% ? • Smart evaluations do exist though • (Sharoff et al., 2006) for translating difficult MWEs 2felipe@BUCC 2017
  • 5. TransSearch •  Users are tolerant to alignment problems •  We implemented a detector of erroneous alignments but decided not to integrate it •  We tested a number of alignment models but used one of the simplest (IBM model 2 with contiguity constraint) •  Users care more about translation diversity in the top candidate list •  so clustering translations is of better use for a user •  See (Bourdaillet et al., 2010) 4felipe@BUCC 2017
  • 7. TransType •  Keystroke saving is a poor measure of the usefulness of the tool •  often, users do not watch the screen… (Langlais et al., 2002) •  Cognitive load for reading completions •  predicting when a user will find a completion useful is a plus (Foster et al., 2002) •  Users do not like repetitive errors •  adapting online is important (Nepveu et al., 2004) See (Gonzales-Rubio et al., 2012) for advances in targeted mediated interactive MT 6felipe@BUCC 2017
  • 8. And Data Matters too • Often the main issue • once you have data, use your ``best hammer’’ • e.g. • domain specific MT • organizing a dataflow • Know-how in handling Users and Data is in need in practical settings • but not much rewarded academically 7felipe@BUCC 2017
  • 9. Of course … • Better models are important ! • e.g. NMT (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2014) • simpler (single model) • better results • (Bentivogli et al., 2016; Isabelle et al.; 2017) • good properties for further improvements • multilingual • continuously learning • multi-tasking, etc. • And have the potential to impact users 8felipe@BUCC 2017
  • 10. Plan   Data Management (Acquisition and Organization) •  What has been done •  What (I feel) is still missing •  Domain specific bilingual parallel and comparable corpora •  Assessing the quality/usefulness of a pair of documents   Parallel Material Extraction from a Comparable Corpus •  Document pairs •  Sentence pairs •  Word pairs 9felipe@BUCC 2017
  • 11. Data Management Overview Web felipe@BUCC 2017 10 crawling dispatching parallel quasi- comparable comparable sentence alignment parallel pair extraction parallel fragment extraction parallel material update Input •  seed terms/urls •  documents of interest (mono or bilingual) Objective •  e.g. improving a generic SMT engine
  • 12. Data Management: What has been done •  Web2ParallelDoc/Sent •  BITS (Ma & Liberman, 1999) •  a sensible pipeline, leveraging lexicons, cognates and a list of webdomains •  PTMiner (Chen & Nie, 2000), STRAND (Resnik & Smith, 2003) •  rule-based URL matching, HTML-tag and lexicon-based pairing •  (Nadeau & Foster, 2004; Fry 2005) •  specific to newswire feeds •  WebMining (Tomas et al., 2005) •  pipeline which only requires a list of bilingual websites (dictionaries are trained online) •  (Fukushima, et al., 2006) •  leveraging a dictionary2graph algorithm, and paying attention to speed issues •  BABYLON (Mohler and Mihalcea, 2008) •  low-density language oriented, a pipeline which input is a source document •  BITEXTOR (Espla-Gomis et Forcada, 2010) •  pipeline of URL- and content-based matching •  See (Kulkarni, 2012) for a survey 11felipe@BUCC 2017
  • 13. Data Management: What has been done •  Notable Large-scale Efforts •  (Callisson-Burch et al., 2009) •  10^9 word parallel corpus crawled from specific URLs, URL-based pairing, sentence alignment (Moore 2002) + cleaning: 105M sentence pairs •  (Uszkoreit et al., 2010) •  parallel document detection using MT and near-duplicate detection •  2.5B webpages crawled from the Web + 1.5M public-domain books •  24 hours on a 2000-node cluster •  (Ture & Lin, 2012) •  English-German parallel sentences extracted from Wikipedia tackling the cross-lingual pairwise similarity problem without heuristics •  (Smith et al., 2013) •  Open-source extension of STRAND, impressive deployment on the CommonCrawl corpus (32TB), using Amazon’s elastic map-reduce 12felipe@BUCC 2017
  • 14. Data Management: Issues •  No systematic comparison of those systems •  Not much sensibility to specific domains •  Crawling •  which input? manual set of webdomains / urls / terms ? •  iterative or not ? •  concerns: efficiency, coverage, resources expected •  Dispatching •  grade a pair of documents (Li and Gaussier, 2010; Fung and Cheung, 2004; Babych and Hartley, 2014) •  measure of the usefulness of a resource (Barker and Gaizauskas, 2012) •  Bootstrapping •  model/lexicon update (Fung and Cheung, 2004) felipe@BUCC 2017 13
  • 15. • Experiments in those directions would lead to : • a better know-how •  valuable in practical (industry) settings • a shared repository of domain specific collections •  as OPUS for parallel texts (Tiedmann, 2012) • (potentially) a reference bilingual collection where all parallel units are marked •  would help measuring progress •  ease reproducibility felipe@BUCC 2017 14 Data Management: Issues
  • 16. About Domain Specific CC •  Domain specific crawling (monolingual) •  WebBootCat (Baroni and Bernardini, 2004) •  TerminoWeb (Barrière and Agbago, 2006) •  Babouk (De Groc, 2011) •  (Azoulay, 2017) considering only pdf files •  Specialized (bilingual) comparable corpora acquisition •  See the ACCURAT project (Pinnis et al., 2012; Aker et al., 2012) •  METRICC (Alonso et al., 2012) •  Not just a matter of quantity (Morin et al., 2007) 15felipe@BUCC 2017
  • 17. Plan   Data Management (Acquisition and Organization) •  What has been done •  What (I feel) is still missing •  Domain specific bilingual parallel and comparable corpora •  Assessing the quality/usefulness of a pair of documents   Parallel Material Extraction from a Comparable Corpus •  Document pairs •  Sentence pairs •  Word pairs 16felipe@BUCC 2017
  • 18. Identifying parallel documents •  Has received some attention (see previous slides) •  (Enright and Kondrak, 2010) •  nb of shared hapax words •  (Uszkoreit et al., 2010) •  machine translation + monolingual duplicates detection •  (Patry and Langlais, 2011) •  light doc. representation + IR + classifier •  Often just one component of a pipeline which is not evaluated as such 17felipe@BUCC 2017
  • 19. Paradocs (Patry and Langlais, 2011) •  Light way of representing a document: •  sequence of numerical entities •  sequence of hapax words (of at least 4 symbols) •  sequence of punctuations . ! ? ( ) : as in (Nadeau and Foster 2004) •  Avoid Cartesian product thanks to IR •  no need for a bilingual lexicon •  Train a classifier to recognize parallel documents •  normalized edit-distance between each sequence representation •  count of each entity (numerals, punctuations, hapaxes) 18felipe@BUCC 2017
  • 20. Paradocs: Avoiding the Cartesian product Europarl setting - 6000 bitexts per language pair - Dutch2English §  nodoc: no document returned by Lucene §  nogood: no good document returned (n=20) Ø  >40% of failure for very short documents (<10 sentences) Ø  15% for long documents (>64 sentences) 19felipe@BUCC 2017
  • 21. WMT 2016 Shared Task on Bilingual Document Alignment •  Goal: better understanding best practices •  11 teams, 21 systems submitted (+ 1 baseline) •  Easy entrance task: •  train: 49 webdomains crawled (ex: virtualhospice.ca) •  only HTML pages considered •  text pre-extracted, duplicates removed, language pre-identified (en-fr), machine translation provided •  test: 203 other webdomains •  evaluation: •  retrieving 2400 pairs of documents known parallel (within webdomains) •  strict rule: a target document should be proposed at most once 20felipe@BUCC 2017
  • 22. RALI’s participation (Jakubina and Langlais, 2016) •  motivation: simplicity ! •  no use of MT, eventually even no lexicon-based translation •  an IR approach (Lucene-based) comparing/combining •  IR (monolingual) •  CLIR (based on a bilingual lexicon for « translating ») •  a simple but efficient URL-based IR (tokenizing URLs) •  badLuc ended up with a recall of 79.3% •  the best system recorded 95% •  the organizers’ baseline 59.8% 21felipe@BUCC 2017
  • 23. Results on train Strategy @1 text monolingual default 6.4 default+tok 35.4 best (w/o length) 64.9 best (w length) 76.6 text bilingual best (w length) 83.3 url baseline WMT16 67.9 best 80.1 badLuc w/o post-treat 88.6 w post-treat 92.1 •  Playing with metaparameters helps a lot •  Applying a length filter also helps a lot •  Involving translation is a must •  even with a simple lexicon (which covers ~half of the words in the collection) •  Our URL variant is performing impressively well •  outperforms the baseline •  useful on short documents •  Combining both indexes (text and urls) helps •  Post-filtering is a plus 22felipe@BUCC 2017
  • 24. Plan   Data Management (Acquisition and Organization) •  What has been done •  What (I feel) is still missing •  Domain specific bilingual parallel and comparable corpora •  Assessing the quality/usefulness of a pair of documents   Parallel Material Extraction from a Comparable Corpus •  Document pairs •  Sentence pairs •  Word pairs 23felipe@BUCC 2017
  • 25. MUNT (Bérard, 2014) A reimplementation of (Munteanu and Marcu, 2005) with features borrowed from (Smith et al., 2010) 24felipe@BUCC 2017
  • 26. MUNT (Bérard, 2014) • classifier (logistic regression) •  31 features •  length based, alignment-based features, fertility, etc. • pre-filter (selecting interesting sentence pairs) •  ratio of sentence length no greater than 2 •  at least 50% of tokens with alignment in the other side •  removes ~98% of the Cartesian product !! 25felipe@BUCC 2017
  • 27. MUNT on Wikipedia •  Done before (Smith et al., 2010) •  Wikipedia dump (2014) felipe@BUCC 2017 26 #articles #paired #sent. #tokens en 4.5M 919k 29.3M 630.8M fr 1.5M 919k 16.8M 354.5M •  Configuration (default) -  lexicon trained with GIZA++ on 100k sentence pairs of Europarl -  classifier trained on 800 positive and 800 negative examples of news data, threeshold: 0.8
  • 28. MUNT on Wikipedia felipe@BUCC 2017 27 Manually evaluated 500 (random) sentence pairs -  parallel -  quasi-parallel (at least partial) -  not parallel At the cellular level, the nervous system is defined by the presence of a special type of cell, called the neuron, also known as a ”nerve cell” . À l’échelle cellulaire, le système nerveux est défini par la présence de cellules hautement spécialisées appelées neurones, qui ont la capacité, très particulière, de véhiculer un signal électrochimique .
  • 29. MUNT on Wikipedia felipe@BUCC 2017 28 grade % parallel 71 quasi-parallel 15 not parallel 14 We tried something odd: •  We applied Yasa (Lamraoui and Langlais, 2013) on Wikipedia article pairs (pretending they were parallel) •  Asked MUNT to classify the sentence pairs identified grade % parallel 48 quasi-parallel 16 not parallel 36 -  11M sentence pairs identified by Yasa -  1.6M kept parallel by MUNT -  1.3M once duplicate removed -  86% of sentence pairs are (quasi-)parallel -  much faster ! -  15 hours (on a cluster of 8 nodes) -  MUNT detected 2.61M sentence pairs -  2.26M once duplicates removed -  1.92M after removing sentences shorter than 4 words -  64% of sentence pairs are (quasi-)parallel
  • 30. Munt on Wikipedia felipe@BUCC 2017 29 •  50 comparable Wiki article pairs manually aligned at the sentence level (Rebout and Langlais, 2014) •  measured the performance of YASA and MUNT on those articles •  MUNT has a better precision, but a lower recall npara # parallel sentences nfr (nen) # sentences in fr (en) doc.
  • 31. BUCC 2017 Shared Task (Zweigenbaum et al., 2016) •  Detecting parallel sentences in a large text collection •  2 sets of monolingual Wikipedia sentences (2014 dumps): •  1.4M French sentences •  1.9M English sentences •  + 17k parallel sentences from News Commentary (v9) •  Evaluation: precision, recall, F1 •  pros •  no metadata (text-based) •  Cartesian product is large (with few positive examples) •  Smartness in inserting parallel sentences (to avoid simple solutions) •  cons •  Artificial task •  True parallel sentences in Wikipedia EN-FR are not known 30felipe@BUCC 2017
  • 32. RALI’s participation 31felipe@BUCC 2017 Will be presented this afternoon (Grégoire and Langlais, 2017)
  • 33. RALI’s participation •  Training: Europarl v7 French-English •  first 500K sentence pairs •  negative sampling: random selection of sentence pairs •  Test: newstest2012 (out-domain) •  1000 first sentence pairs + noise •  Pre-processing •  maximum sentence length: 80 •  tokenization with Moses’ toolkit, lowercased •  mapping digits to 0 (e.g. 1982 -> 0000) 32felipe@BUCC 2017
  • 34. RALI’s participation •  Embedding-based filter for avoiding the Cartesian product •  word-based embeddings computed with BILBOWA (Gouws, 2015) •  sentence representation = average word embeddings •  40 first target sentences for each source one •  Not paying attention to ``details’’ (only model matters, right?) •  digit preprocessing •  random negative sampling at training does not match testing condition •  The next slides summarize what we have learnt after our participation to BUCC 2017 (new material) model precision recall F1 12.1 70.9 20.7 official 33felipe@BUCC 2017
  • 35. Influence of the decision threeshold •  Cartesian product •  1M ex., 1k positive •  BIRNN trained with 7 negative ex. •  MUNT trained with a balanced corpus Precision Recall F1 ρ BiRNN 83.0 69.6 75.7 0.99 MUNT 31.8 24.1 27.7 0.99 34felipe@BUCC 2017
  • 36. Influence of the decision threeshold •  Pre-filtering •  8053 ex., 1k positive •  BIRNN trained with 7 negative ex. •  MUNT trained with a balanced corpus Precision Recall F1 ρ BiRNN 91.0 62.4 74.0 0.97 MUNT 73.3 57.0 64.1 0.91 35felipe@BUCC 2017
  • 37. Influence of Post-filtering •  Each decision taken independently •  A src sent. may be associated to several target ones, and vice versa ²  post-filtering (greedy algorithm, Hungarian algo. too slow) huge boost in precision at a small recall loss for both approaches 36felipe@BUCC 2017
  • 38. Plan   Data Management (Acquisition and Organization) •  What has been done •  What (I feel) is still missing •  Domain specific bilingual parallel and comparable corpora •  Assessing the quality/usefulness of a pair of documents   Parallel Material Extraction from a Comparable Corpus •  Document pairs •  Sentence pairs •  Word pairs 37felipe@BUCC 2017
  • 39. Bilingual Lexicon Induction •  Received a lot of attention •  Pioneering works: (Rapp, 1995; Fung, 1995) •  See (Sharoff et al., 2013) •  Revisited as a way to measure the quality of word embeddings •  Seminal work of (Mikolov et al., 2013) •  Comprehensive comparisons •  (Levy et al., 2014, 2017; Upadhyay et al., 2016) 38felipe@BUCC 2017
  • 40. Bilingual Lexicon Induction (Jakubina & Langlais; 2016) •  We thoroughly revisited those approaches: •  (Rapp, 1995) •  (Mikolov et al. 2013) •  training the projection matrix with the toolkit of (Dinu and Baroni, 2015) •  (Faruqui and Dyer, 2014) •  and a few others, but without success •  investigating their meta-parameters •  paying attention to the frequency of the terms •  after (Pekar et al, 2006) •  showing their complementarity 39felipe@BUCC 2017
  • 41. Experiments •  Wikipedia (dumps of June 2013) •  EN: 7.3M token forms (1.2G tokens) •  FR: 3.6M token forms (330M tokens) •  Test sets •  Wiki≤25 English words occurring at most 25 times in Wiki-EN •  6.8M such tokens (92%) •  Wiki>25 English words seen more than 25 times in Wiki-EN •  Euro-5-6k top frequent 5000 to 6000 words of WMT2011 Frequency min max avg cov (%) Wiki≤25 1 25 10 100.0 Wiki>25 27 19.4k 2.8k 100.0 Euro5-6k 1 2.6M 33.6k 87.3 40felipe@BUCC 2017
  • 42. Metaparameters explored •  Rapp •  window size (3,7,15,31) •  association measure (LLR, discontinuous odd-ratios) •  projecting UNK words (yes, no) •  similarity measure (fixed, cosine similarity) •  seed lexicon (fixed, in-house 107k entries) •  Word2Vec •  skip-gram versus continuous bag-of-word •  negative sampling (5 or 10) versus hierarchical softmax •  window size (6,10,20,30) •  dimension (from 50 up to 250 for SKG and 200 for CBOW) •  seed lexicon (2k-low, 5-high, 5k-rand) •  Faruqui & Dyer •  ratio (0.5, 0.8 and 1.0) •  fixed dimension (the best one for Word2Vec) 41felipe@BUCC 2017 No adhoc filtering as usually done, so very time and memory challenging for the Rapp approach (Prochasson and Fung, 2011) this work -  ~20k document pairs ~700k ones -  target voc. 128k words (nouns) 3M words We did apply a few filters: -  context vectors: 1000 top-ranked words -  50k first occurrences of a source term
  • 43. Best variant (per test set) @1 @5 @20 Wiki>25 (ed@1 19.3) Rapp 20.0 33.0 43.0 Miko 17.0 32.6 41.6 Faru 13.3 26.0 33.3 Wiki≤25 (ed@1 17.6) Rapp 2.6 4.3 7.3 Miko 1.6 4.6 10.6 Faru 1.6 2.6 5.0 Euro5-6k (ed@1 8.0) Rapp 16.6 31.8 41.2 Miko 42.0 59.0 67.8 Faru 30.6 47.7 59.8 42felipe@BUCC 2017
  • 44. Reranking Candidate Translations (Jakubina and Langlais, 2016) •  Reranking shown useful in a number of settings (Delpech et al., 2012; Harastani et al., 2013; Kontonatsios et al., 2014) •  Trained a reranker (random forest) with RankLib •  700 terms for training, remaining 300 terms for testing •  3-fold cross-validation •  Light features for each pair (s,t): •  frequency-based features •  freq of s,t and their difference •  string-based features •  length of s and t, their difference, their ratio, edit distance(s,t) •  rank-based features •  score and rank of t in the native list •  number of lists in which t appears (when several n-best lists are considered) 43felipe@BUCC 2017
  • 45. Reranking Individual n-best Lists Individual 1-Reranked @1 @5 @20 @1 @5 @20 Wiki>25 Rapp 20.0 33.0 43.0 36.3 48.8 53.8 Miko 17.0 32.6 41.6 38.1 49.0 54.3 Faru 13.3 26.0 33.3 34.3 44.0 47.9 Wiki≤25 Rapp 2.6 4.3 7.3 8.6 9.4 10.2 Miko 1.6 4.6 10.6 16.6 19.0 20.1 Faru 1.6 2.6 5.0 7.9 8.7 8.9 Euro5-6k Rapp 16.6 31.8 41.2 34.6 48.6 51.9 Miko 42.0 59.0 67.8 47.0 68.1 73.0 Faru 30.6 47.7 59.8 41.2 58.0 66.0 44felipe@BUCC 2017
  • 46. Reranking several n-best Lists 1-Reranked n-Reranked @1 @5 @20 @1 @5 @20 Wiki>25 Rapp 36.3 48.8 53.8 Miko 38.1 49.0 54.3 R+M 43.3 58.4 62.4 Faru 34.3 44.0 47.9 R+M+F 45.6 59.6 64.0 Wiki≤25 Rapp 8.6 9.4 10.2 Miko 16.6 19.0 20.1 R+M 18.9 22.0 23.6 Faru 7.9 8.7 8.9 R+M+F 21.3 24.4 25.7 Euro5-6k Rapp 34.6 48.6 51.9 Miko 47.0 68.1 73.0 R+M 49.5 68.7 76.1 Faru 41.2 58.0 66.0 R+M+F 47.6 68.5 76.2 45felipe@BUCC 2017
  • 47. One-Slide Wrap up • Hyp: Model-centric research somehow hides the value of: •  know-how in managing bilingual data (parallel and comparable) •  evaluation protocols involving real users (or proxies) • Better handling data is part of the game • We should learn from users felipe@BUCC 2017 46
  • 48. Thank you for your attention felipe@BUCC 2017 47