SlideShare a Scribd company logo
1 of 51
Download to read offline
Neural Semi-supervised
Learning under

Domain Shift
Sebastian Ruder
‣ Across domains

(Ruder & Plank, EMNLP 2017, ACL 2018;

Howard* & Ruder*, ACL 2018)
‣ Across tasks

(Ruder et al., arXiv 2017;

(Augenstein* & Ruder* et al., NAACL 2018)
‣ Across languages

(Ruder et al., JAIR 2018;

Søgaard, Ruder & Vulic, ACL 2018;

Ruder* & Cotterell* et al., EMNLP 2018;

Kementchedjhieva, Ruder et al., CoNLL 2018)
2
Research overview:
Transfer learning
* equal contribution
12/6/2017 label_embedding_layer.html
Label Embedding Layer
‣ Across domains

(Ruder & Plank, EMNLP 2017, ACL 2018;

Howard* & Ruder*, ACL 2018)
3
Research overview:
Transfer learning
* equal contribution
Beware
of non-i.i.d.
data!
‣ We never know how well our models truly generalise if we
just test them on data of the same distribution.
‣ CIFAR-10 classifiers don’t even generalise to CIFAR-10 (Recht
et al., 2018).
‣ “A challenge to the community: we should evaluate on out-of-
distribution data or on a new task.”

- Percy Liang, DeepGen workshop, NAACL-HLT 2018
Recht, B., Roelofs, R., Schmidt, L., & Berkeley, U. C. (2018). Do CIFAR-10 Classifiers Generalize to CIFAR-10 ?
4
Learning under Domain Shift
Labeled
source data
Different
task
Un-
labeled
target data
How to select
the most
relevant data?
5
Data setting 1:

Multiple source domains
Target domain
Source domains
Ruder, S., & Plank, B. (2017). Learning to select data for transfer learning with Bayesian Optimization. In
Proceedings of EMNLP 2017.
Why select data for domain adaptation at all?
Why don’t we just train on all source data?
‣ Prevent negative transfer for dissimilar domains.
‣ e.g. “electrifying” is positive in , but negative in
Existing approaches
‣ use a single similarity metric in isolation;
‣ focus on a single task.
6
Background
Intuition
‣ Different tasks and domains require different notions of
similarity.
Idea
‣ Learn a data selection policy using Bayesian Optimisation.
7
Our approach
8
Our approach
x1
x2
xm
⋮
S = ϕ(x)⊤
w
Training examples
⋮
Selection policy
xn
Sorted examples
m
‣ Related: curriculum learning (Tsvetkov et al., 2016)
Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for
Task-Specific Word Representation Learning. In Proceedings of ACL 2016.
‣ Treat objective as a black box, which we iteratively approximate
‣ Use Bayesian Optimisation (BO) to obtain best parameter setting.

cf. Fang & Cohn (2017) who use RL for selecting data for active
learning
‣ Sample-efficient; only need about 100-200 samples to converge.
‣ BO is typically used for hyper-parameter tuning (Snoek et al.,
2012; Melis et al., 2018).
‣ Alternative: Learn latent permutation with Sinkhorn operator
(Adams and Zemel, 2011; Mena et al., 2018)
9
Learning the data selection policy
Fang, M., Li, Y., & Cohn, T. (2017). Learning how to Active Learn: A Deep Reinforcement Learning Approach. In Proceedings of EMNLP
2017.
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of
NIPS 2012.
Melis, G., Dyer, C., & Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. In Proceedings of ICLR 2018.
Mena, G. E., Belanger, D., Linderman, S., & Snoek, J. (2018). Learning Latent Permutations with Gumbel-Sinkhorn Networks. In
Proceedings of ICLR 2018.
10
Optimisation framework
X
Feature

extraction
Bayesian
Optimisation
Task model
training
Evaluation on
validation set
Scoring & 

sorting with St
̂y
Mt
Xtϕ(X)
wt
wt+1
Two important choices
‣ Surrogate model: used to approximate objective function; e.g.
Gaussian Process (GP)
‣ Acquisition function: propose new samples; trades off exploration
vs. exploitation; e.g. Expected Improvement (EI)
Procedure:
‣ Sample next weight vector by optimising the acquisition
function over the GP:
‣ Obtain noisy validation score from trained model
‣ Append sample to , update GP
11
Bayesian Optimisation
wt = arg max
w
u(w|D1:t−1)
wt
u
̂yt
D1:t = {D1:t−1, (wt, ̂yt)}
‣ Treat each source example and entire target domain as
distributions and based on term and topic probabilities
12
Features
P Q
Similarity feature Diversity feature
Jensen-Shannon
divergence
# word types
Rényi divergence Type-token ratio
Bhattacharyya distance Entropy
Cosine similarity Simpson's index
Euclidean distance Rényi entropy
Variational distance Quadratic entropy
13
Data & Tasks
Three tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007)
POS tagging and dependency parsing on SANCL 2012 dataset (Petrov
and McDonald, 2012)
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain
adaptation for sentiment classification. In Proceedings of ACL 2007.
Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First
Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
14
Sentiment Analysis Results
Selecting 2,000 from 6,000 source domain examples
Accuracy(%)
62
68
74
80
86
Book DVD Electronics Kitchen
Random JS divergence (examples)
JS divergence (domain) Similarity (topics)
Diversity Similiarity + diversity
All source data (6,000 examples)
‣ Selecting relevant data is useful when domains are very different.
15
POS Tagging Results
Selecting 2,000 from 14-17.5k source domain examples
Accuracy(%)
92
93.25
94.5
95.75
97
Answers Emails Newsgroups Reviews Weblogs WSJ
JS divergence (examples) Similarity (terms) Diversity
Similiarity + diversity All source data
‣ Learned data selection outperforms static selection, but is less
useful when domains are very similar.
16
Dependency Parsing Results
Selecting 2,000 from 14-17.5k source domain examples
LabeledAttachmentScore(LAS)
80
82.25
84.5
86.75
89
Answers Emails Newsgroups Reviews Weblogs WSJ
JS divergence (examples) Similarity (terms) Diversity
Similiarity + diversity All source data
17
Cross-Model Transfer Results
Training a BiLSTM with the policy learned by a BiLSTM and a
Structured Perceptron for POS tagging
Accuracy(%)
92
93
94
95
96
Answers Emails Newsgroups Reviews Weblogs WSJ
BiLSTM, similarity + diversity Structured Perceptron, similarity + diversity
‣ The data selection policy can be learned with a cheap model and
transferred to more expensive models.
‣ Bayesian Optimisation is an efficient way to optimise
an expensive function, e.g. order of training examples.
‣ Different domains & tasks have different notions of
similarity.
‣ Preferring certain examples is mainly useful when
domains are dissimilar.
‣ Diversity complements similarity.
‣ The learned policy transfers (to some extent) across
models, tasks, and domains.
18
Takeaways
…
19
Learning under Domain Shift
Labeled
source data
Un-
labeled
target data
How well does
SSL work with
NNs?
20
Data setting 2:

Single source domain
Target domain
Source domain
Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In
Proceedings of ACL 2018.
‣ State-of-the-art domain adaptation approaches
‣ leverage task-specific features
‣ evaluate on proprietary datasets or on a single
benchmark
‣ Only compare against weak baselines
‣ Almost none evaluate against approaches from the
extensive semi-supervised learning (SSL) literature
21
Learning under Domain Shift
‣ How do classics in SSL compare to recent advances?
‣ Can we combine the best of both worlds?
‣ How well do these approaches work on out-of-distribution
data?
22
Revisiting Semi-Supervised Learning
Classics in a Neural World
• Self-training
• (Co-training*)
• Tri-training
• Tri-training with disagreement
Bootstrapping algorithms
* used in concurrent work: Wu, J., Li, L., & Wang, W. Y. (2018). Reinforced Co-Training. In Proceedings of
NAACL-HLT 2018.
1. Train model on labeled data.
2. Use confident predictions on unlabeled data
as training examples. Repeat.
24
Self-training
- Error amplification
‣ Mixed success in NLP. Some recent success in CV
(Radosavic et al., 2018).
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards
Omni-Supervised Learning. In Proceedings of CVPR 2018.
‣ Calibration
‣ Output probabilities in neural networks are poorly
calibrated.
‣ Throttling (Abney, 2007), i.e. selecting the top highest
confidence unlabeled examples works best.
‣ Online learning
‣ Training until convergence on labeled data and then on
unlabeled data works best.
25
Self-training variants
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni-
Supervised Learning. In Proceedings of CVPR 2018.
n
26
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
y = 1
x
y = 1
1
Tri-training
Tri-training
27
Tri-training
Tri-training
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree.
3. Final prediction: majority voting
Tri-training
y = 1y = 1 y = 0
1
x
Tri-training

with disagreement
28
Tri-training with
disagreement
1. Train three models on bootstrapped samples.
2. Use predictions on unlabeled data for third if two agree
and prediction differs.
y = 1
x
y = 1
1
y = 0
- 3 independent models
‣ Sampling unlabeled data
‣ Producing predictions for all unlabeled examples is
expensive
‣ Sample number of unlabeled examples
‣ Confidence thresholding
‣ Not effective for classic approaches, but essential for
our method
29
Tri-training hyper-parameters
30
y = 1
x
y = 1
1
Multi-task tri-training
1. Train one model with 3 objective functions.
2. Use predictions on unlabeled data for third if two agree.
Multi-task

Tri-training
3. Restrict final layers to 

use different 

representations.
4. Train third objective 

function only on 

pseudo labeled to 

bridge domain shift.
31
BiLSTM
w2
char
BiLSTM
BiLSTM
w1
char
BiLSTM
BiLSTM
w3
char
BiLSTM
m1 m2 m3 m1 m2 m3 m1 m2 m3
orthogonality constraint (Bousmalis et al., 2016)
Multi-task

Tri-training
Lorth = ∥W⊤
m1
Wm2
∥2
F
L(θ) = −
∑
i
∑
1,..,n
log Pmi
(y| ⃗h ) + γLorthLoss:
(Plank et al., 2016)
32
Data & Tasks
Two tasks: Domains:
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007)
POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
Sentiment Analysis Results
Accuracy
75
76.75
78.5
80.25
82
Avg over 4 target domains
VFAE* DANN* Asym* Source only
Self-training Tri-training Tri-training-Disagr. MT-Tri
* result from Saito et al., (2017)
33
‣ Multi-task tri-training slightly outperforms tri-training, but
has higher variance.
34
POS Tagging Results
Trained on 10% labeled data (WSJ)
Accuracy
88.7
88.975
89.25
89.525
89.8
Avg over 5 target domains
Source (+embeds) Self-training Tri-training
Tri-training-Disagr. MT-Tri
‣ Tri-training with disagreement works best with little data.
35
POS Tagging Results
* result from Schnabel & Schütze (2014)
Trained on full labeled data (WSJ)
Accuracy
89
89.75
90.5
91.25
92
Avg over 5 target domains
TnT Stanford* Source (+embeds)
Tri-training Tri-training-Disagr. MT-Tri
‣ Tri-training works best in the full data setting.
36
POS Tagging Analysis
Accuracy on out-of-vocabulary (OOV) tokens
AccuracyonOOVtokens
50
57.5
65
72.5
80
%OOVtokens
0
2.75
5.5
8.25
11
Answers Emails Newsgroups Reviews Weblogs
OOV tokens Src Tri MT-Tri
‣ Classic tri-training works best on OOV tokens.
‣ MT-Tri does worse than source-only baseline on OOV.
37
POS accuracy per binned log frequency
Accuracydeltavs.src-onlybaseline
-0.005
0
0.005
0.009
0.014
0.018
Binned frequency
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MT-Tri Tri
‣ Tri-training works best on low-frequency tokens (leftmost
bins).
POS Tagging Analysis
38
POS Tagging Analysis
Accuracy on unknown word-tag (UWT) tokens
AccuracyonUWTtokens
8
12.5
17
21.5
26
%UWTtokens
0
1
2
3
4
Answers Emails Newsgroups Reviews Weblogs
UWT rate Src Tri MT-Tri FLORS*
‣ No bootstrapping method works well on unknown word-
tag combinations.
‣ Less lexicalized FLORS approach is superior.
very difficult cases
* result from Schnabel
& Schütze (2014)
‣ Classic tri-training works best: outperforms recent
state-of-the-art methods for sentiment analysis.
‣ We address the drawback of tri-training (space &
time complexity) via the proposed MT-Tri model
‣ MT-Tri works best on sentiment, but not for POS.
‣ Importance of:
‣ Comparing neural methods to classics (strong
baselines)
‣ Evaluation on multiple tasks & domains
39
Takeaways
Tri-training
40
Learning under Domain Shift
Labeled
source data
Different
task
Un-
labeled
target data
How can we
leverage
pretrained LMs?
41
Data setting 3:

Different target task
Target domain

Target task
Source domain

Source task
Howard, J.*, & Ruder, S.* (2018). Universal Language Model Fine-tuning for Text Classification. In
Proceedings of ACL 2018.
*: equal
contribution
‣ Best practice: initialise first layer with pretrained word
embeddings
‣ Recent approaches (McCann et al., 2017; Peters et al.,
2018): Pretrained embeddings as fixed features. Peters et
al. (2018) is task-specific.
‣ Why not initialise remaining parameters?
‣ Dai and Le (2015) first proposed fine-tuning a LM.
However: No pretraining. Naive fine-tuning.
42
Transfer learning for NLP status
quo
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in Translation: Contextualized Word
Vectors. In Proceedings of NIPS 2017.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
contextualized word representations. In Proceedings of NAACL-HLT 2018.
Dai, A. M., & Le, Q. V. (2015). Semi-supervised Sequence Learning. In Proceedings of NIPS 2015.
43
Universal Language Model Fine-
tuning
3 step recipe:
1. Train language model (LM) on general domain data.
2. Fine-tune LM on target data.
3. Train classifier on labeled data on top of LM.
‣ Model: AWD-LSTM language model
‣ 3-layer LSTM
‣ Tuned dropout hyperparameters
‣ Data: WikiText-103
‣ 103 million tokens of Wikipedia text
‣ Train for ~24 hours on a Tesla V100
‣ Recently: deeper models, trained on more data, for longer
(Radford et al., 2018)
44
Language Model Pretraining
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by
Generative Pre-Training.
‣ Discriminative fine-tuning

Different layers capture different types of information. They
should be fine-tuned to different extents.
45
Language Model Fine-tuning
θl
t = θl
t−1 − ηl
⋅ ∇θl J(θ)
‣ Slanted triangular
learning rates

The model should
converge quickly to
a suitable region
and then refine its
parameters.
‣ Concat pooling

Concatenate pooled representations of hidden states to
capture long document contexts:



‣ Gradual unfreezing

Gradually unfreeze the layers starting from the last layer to
prevent catastrophic forgetting.
‣ Bidirectional language model

Pretrain both forward and backward LMs and fine-tune
them independently.
46
Classifier Fine-tuning
hc = [hT, 𝚖𝚊𝚡𝚙𝚘𝚘𝚕(H), 𝚖𝚎𝚊𝚗𝚙𝚘𝚘𝚕(H)]
47
Dataset Type # classes # examples
TREC-6 6 5.5k
IMDb 2 25k
Yelp-bi 2 560k
Yelp-full 5 650k
AG News 4 120k
DBpedia 14 560k
Data & Tasks
48
Results
Previous SOTA vs. ULMFiT
Errorrate(%)
0
1.75
3.5
5.25
7
IMDb TREC-6 AG News DBpedia Yelp-bi
2.16
0.8
5.01
3.6
4.6
2.64
0.84
6.57
3.9
5.9
Previous SOTA
ULMFiT
Yelp-full
29.9830.58
‣ ULMFiT outperforms the state-of-the-art by a significant
margin on many of the datasets.
49
Few-shot Learning
IMDb
Errorrate(%)
0
12.5
25
37.5
50
# of training examples
100 500 2000 10000 20000
From scratch
ULMFiT, supervised
ULMFiT, semi-supervised
AG-News
# of training examples
100 500 2000 10000 108000
‣ With 100 labeled examples, matches performance of
training from scratch with 10x and 20x more data.
‣ With 50-100k additional unlabeled examples, matches
performance of training with 50x and 20x more data.
‣ Proposed a general approach for fine-tuning a
pretrained language model.
‣ Proposed new techniques to reduce catastrophic
forgetting during fine-tuning.
‣ Approach achieves new SOTA on 6 text
classification tasks.
‣ Very sample-efficient.
50
Takeaways
‣ In order to understand how well our models truly
generalise, we need to measure their performance on out-
of-distribution data.
‣ It is important to evaluate our models on different domains
and tasks.
‣ Using pretrained language models is an effective way of
doing transfer / semi-supervised learning (SSL).
‣ Can be complemented by “explicit” SSL. We can take
lessons from traditional approaches.
‣ Dealing with stark domain differences is still a challenge
and requires ways to explicitly avoid negative transfer.
51
Final Takeaways

More Related Content

What's hot

Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling Yueshen Xu
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Daniele Di Mitri
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Roman Stanchak
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningYoonho Lee
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningYoonho Lee
 

What's hot (20)

Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Lecture20 xing
Lecture20 xingLecture20 xing
Lecture20 xing
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
(Hierarchical) topic modeling
(Hierarchical) topic modeling (Hierarchical) topic modeling
(Hierarchical) topic modeling
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 
Making Sense of Word Embeddings
Making Sense of Word EmbeddingsMaking Sense of Word Embeddings
Making Sense of Word Embeddings
 

Similar to Neural Semi-supervised Learning under Domain Shift

Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscapeDevansh16
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_ProjectreportSampath Velaga
 
text classification_NB.ppt
text classification_NB.ppttext classification_NB.ppt
text classification_NB.pptRithikRaj25
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 
Learning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesLearning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesGiuseppe (Pino) Di Fabbrizio
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...広樹 本間
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataPablo Bernabeu
 
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...Antonio Tejero de Pablos
 
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
 
Oversampling technique in student performance classification from engineering...
Oversampling technique in student performance classification from engineering...Oversampling technique in student performance classification from engineering...
Oversampling technique in student performance classification from engineering...IJECEIAES
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Bayes Nets meetup London
 
Qualitative approaches to learning analytics
Qualitative approaches to learning analyticsQualitative approaches to learning analytics
Qualitative approaches to learning analyticsRebecca Ferguson
 
Part 1
Part 1Part 1
Part 1butest
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004University of Twente
 
Es credit scoring_2020
Es credit scoring_2020Es credit scoring_2020
Es credit scoring_2020Eero Siljander
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 

Similar to Neural Semi-supervised Learning under Domain Shift (20)

Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
 
text classification_NB.ppt
text classification_NB.ppttext classification_NB.ppt
text classification_NB.ppt
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
Learning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectivesLearning when to give up: theory, practice and perspectives
Learning when to give up: theory, practice and perspectives
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
 
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODEL
 
Oversampling technique in student performance classification from engineering...
Oversampling technique in student performance classification from engineering...Oversampling technique in student performance classification from engineering...
Oversampling technique in student performance classification from engineering...
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
Qualitative approaches to learning analytics
Qualitative approaches to learning analyticsQualitative approaches to learning analytics
Qualitative approaches to learning analytics
 
Part 1
Part 1Part 1
Part 1
 
Cluster
ClusterCluster
Cluster
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004Pdi conditioning-sum2018-milan-20181004
Pdi conditioning-sum2018-milan-20181004
 
Es credit scoring_2020
Es credit scoring_2020Es credit scoring_2020
Es credit scoring_2020
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 

More from Sebastian Ruder

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionSebastian Ruder
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiSebastian Ruder
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimSebastian Ruder
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSebastian Ruder
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderSebastian Ruder
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENSebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Sebastian Ruder
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisSebastian Ruder
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Sebastian Ruder
 

More from Sebastian Ruder (17)

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 

Recently uploaded

OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detailhaiderbaloch3
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detail
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 

Neural Semi-supervised Learning under Domain Shift

  • 2. ‣ Across domains
 (Ruder & Plank, EMNLP 2017, ACL 2018;
 Howard* & Ruder*, ACL 2018) ‣ Across tasks
 (Ruder et al., arXiv 2017;
 (Augenstein* & Ruder* et al., NAACL 2018) ‣ Across languages
 (Ruder et al., JAIR 2018;
 Søgaard, Ruder & Vulic, ACL 2018;
 Ruder* & Cotterell* et al., EMNLP 2018;
 Kementchedjhieva, Ruder et al., CoNLL 2018) 2 Research overview: Transfer learning * equal contribution 12/6/2017 label_embedding_layer.html Label Embedding Layer
  • 3. ‣ Across domains
 (Ruder & Plank, EMNLP 2017, ACL 2018;
 Howard* & Ruder*, ACL 2018) 3 Research overview: Transfer learning * equal contribution Beware of non-i.i.d. data! ‣ We never know how well our models truly generalise if we just test them on data of the same distribution. ‣ CIFAR-10 classifiers don’t even generalise to CIFAR-10 (Recht et al., 2018). ‣ “A challenge to the community: we should evaluate on out-of- distribution data or on a new task.”
 - Percy Liang, DeepGen workshop, NAACL-HLT 2018 Recht, B., Roelofs, R., Schmidt, L., & Berkeley, U. C. (2018). Do CIFAR-10 Classifiers Generalize to CIFAR-10 ?
  • 4. 4 Learning under Domain Shift Labeled source data Different task Un- labeled target data How to select the most relevant data?
  • 5. 5 Data setting 1:
 Multiple source domains Target domain Source domains Ruder, S., & Plank, B. (2017). Learning to select data for transfer learning with Bayesian Optimization. In Proceedings of EMNLP 2017.
  • 6. Why select data for domain adaptation at all? Why don’t we just train on all source data? ‣ Prevent negative transfer for dissimilar domains. ‣ e.g. “electrifying” is positive in , but negative in Existing approaches ‣ use a single similarity metric in isolation; ‣ focus on a single task. 6 Background
  • 7. Intuition ‣ Different tasks and domains require different notions of similarity. Idea ‣ Learn a data selection policy using Bayesian Optimisation. 7 Our approach
  • 8. 8 Our approach x1 x2 xm ⋮ S = ϕ(x)⊤ w Training examples ⋮ Selection policy xn Sorted examples m ‣ Related: curriculum learning (Tsvetkov et al., 2016) Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. In Proceedings of ACL 2016.
  • 9. ‣ Treat objective as a black box, which we iteratively approximate ‣ Use Bayesian Optimisation (BO) to obtain best parameter setting.
 cf. Fang & Cohn (2017) who use RL for selecting data for active learning ‣ Sample-efficient; only need about 100-200 samples to converge. ‣ BO is typically used for hyper-parameter tuning (Snoek et al., 2012; Melis et al., 2018). ‣ Alternative: Learn latent permutation with Sinkhorn operator (Adams and Zemel, 2011; Mena et al., 2018) 9 Learning the data selection policy Fang, M., Li, Y., & Cohn, T. (2017). Learning how to Active Learn: A Deep Reinforcement Learning Approach. In Proceedings of EMNLP 2017. Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of NIPS 2012. Melis, G., Dyer, C., & Blunsom, P. (2018). On the State of the Art of Evaluation in Neural Language Models. In Proceedings of ICLR 2018. Mena, G. E., Belanger, D., Linderman, S., & Snoek, J. (2018). Learning Latent Permutations with Gumbel-Sinkhorn Networks. In Proceedings of ICLR 2018.
  • 10. 10 Optimisation framework X Feature
 extraction Bayesian Optimisation Task model training Evaluation on validation set Scoring & 
 sorting with St ̂y Mt Xtϕ(X) wt wt+1
  • 11. Two important choices ‣ Surrogate model: used to approximate objective function; e.g. Gaussian Process (GP) ‣ Acquisition function: propose new samples; trades off exploration vs. exploitation; e.g. Expected Improvement (EI) Procedure: ‣ Sample next weight vector by optimising the acquisition function over the GP: ‣ Obtain noisy validation score from trained model ‣ Append sample to , update GP 11 Bayesian Optimisation wt = arg max w u(w|D1:t−1) wt u ̂yt D1:t = {D1:t−1, (wt, ̂yt)}
  • 12. ‣ Treat each source example and entire target domain as distributions and based on term and topic probabilities 12 Features P Q Similarity feature Diversity feature Jensen-Shannon divergence # word types Rényi divergence Type-token ratio Bhattacharyya distance Entropy Cosine similarity Simpson's index Euclidean distance Rényi entropy Variational distance Quadratic entropy
  • 13. 13 Data & Tasks Three tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging and dependency parsing on SANCL 2012 dataset (Petrov and McDonald, 2012) Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL 2007. Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
  • 14. 14 Sentiment Analysis Results Selecting 2,000 from 6,000 source domain examples Accuracy(%) 62 68 74 80 86 Book DVD Electronics Kitchen Random JS divergence (examples) JS divergence (domain) Similarity (topics) Diversity Similiarity + diversity All source data (6,000 examples) ‣ Selecting relevant data is useful when domains are very different.
  • 15. 15 POS Tagging Results Selecting 2,000 from 14-17.5k source domain examples Accuracy(%) 92 93.25 94.5 95.75 97 Answers Emails Newsgroups Reviews Weblogs WSJ JS divergence (examples) Similarity (terms) Diversity Similiarity + diversity All source data ‣ Learned data selection outperforms static selection, but is less useful when domains are very similar.
  • 16. 16 Dependency Parsing Results Selecting 2,000 from 14-17.5k source domain examples LabeledAttachmentScore(LAS) 80 82.25 84.5 86.75 89 Answers Emails Newsgroups Reviews Weblogs WSJ JS divergence (examples) Similarity (terms) Diversity Similiarity + diversity All source data
  • 17. 17 Cross-Model Transfer Results Training a BiLSTM with the policy learned by a BiLSTM and a Structured Perceptron for POS tagging Accuracy(%) 92 93 94 95 96 Answers Emails Newsgroups Reviews Weblogs WSJ BiLSTM, similarity + diversity Structured Perceptron, similarity + diversity ‣ The data selection policy can be learned with a cheap model and transferred to more expensive models.
  • 18. ‣ Bayesian Optimisation is an efficient way to optimise an expensive function, e.g. order of training examples. ‣ Different domains & tasks have different notions of similarity. ‣ Preferring certain examples is mainly useful when domains are dissimilar. ‣ Diversity complements similarity. ‣ The learned policy transfers (to some extent) across models, tasks, and domains. 18 Takeaways …
  • 19. 19 Learning under Domain Shift Labeled source data Un- labeled target data How well does SSL work with NNs?
  • 20. 20 Data setting 2:
 Single source domain Target domain Source domain Ruder, S., & Plank, B. (2018). Strong Baselines for Neural Semi-supervised Learning under Domain Shift. In Proceedings of ACL 2018.
  • 21. ‣ State-of-the-art domain adaptation approaches ‣ leverage task-specific features ‣ evaluate on proprietary datasets or on a single benchmark ‣ Only compare against weak baselines ‣ Almost none evaluate against approaches from the extensive semi-supervised learning (SSL) literature 21 Learning under Domain Shift
  • 22. ‣ How do classics in SSL compare to recent advances? ‣ Can we combine the best of both worlds? ‣ How well do these approaches work on out-of-distribution data? 22 Revisiting Semi-Supervised Learning Classics in a Neural World
  • 23. • Self-training • (Co-training*) • Tri-training • Tri-training with disagreement Bootstrapping algorithms * used in concurrent work: Wu, J., Li, L., & Wang, W. Y. (2018). Reinforced Co-Training. In Proceedings of NAACL-HLT 2018.
  • 24. 1. Train model on labeled data. 2. Use confident predictions on unlabeled data as training examples. Repeat. 24 Self-training - Error amplification ‣ Mixed success in NLP. Some recent success in CV (Radosavic et al., 2018). Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni-Supervised Learning. In Proceedings of CVPR 2018.
  • 25. ‣ Calibration ‣ Output probabilities in neural networks are poorly calibrated. ‣ Throttling (Abney, 2007), i.e. selecting the top highest confidence unlabeled examples works best. ‣ Online learning ‣ Training until convergence on labeled data and then on unlabeled data works best. 25 Self-training variants Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data Distillation: Towards Omni- Supervised Learning. In Proceedings of CVPR 2018. n
  • 26. 26 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. y = 1 x y = 1 1 Tri-training Tri-training
  • 27. 27 Tri-training Tri-training 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree. 3. Final prediction: majority voting Tri-training y = 1y = 1 y = 0 1 x
  • 28. Tri-training
 with disagreement 28 Tri-training with disagreement 1. Train three models on bootstrapped samples. 2. Use predictions on unlabeled data for third if two agree and prediction differs. y = 1 x y = 1 1 y = 0 - 3 independent models
  • 29. ‣ Sampling unlabeled data ‣ Producing predictions for all unlabeled examples is expensive ‣ Sample number of unlabeled examples ‣ Confidence thresholding ‣ Not effective for classic approaches, but essential for our method 29 Tri-training hyper-parameters
  • 30. 30 y = 1 x y = 1 1 Multi-task tri-training 1. Train one model with 3 objective functions. 2. Use predictions on unlabeled data for third if two agree. Multi-task
 Tri-training 3. Restrict final layers to 
 use different 
 representations. 4. Train third objective 
 function only on 
 pseudo labeled to 
 bridge domain shift.
  • 31. 31 BiLSTM w2 char BiLSTM BiLSTM w1 char BiLSTM BiLSTM w3 char BiLSTM m1 m2 m3 m1 m2 m3 m1 m2 m3 orthogonality constraint (Bousmalis et al., 2016) Multi-task
 Tri-training Lorth = ∥W⊤ m1 Wm2 ∥2 F L(θ) = − ∑ i ∑ 1,..,n log Pmi (y| ⃗h ) + γLorthLoss: (Plank et al., 2016)
  • 32. 32 Data & Tasks Two tasks: Domains: Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)
  • 33. Sentiment Analysis Results Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri * result from Saito et al., (2017) 33 ‣ Multi-task tri-training slightly outperforms tri-training, but has higher variance.
  • 34. 34 POS Tagging Results Trained on 10% labeled data (WSJ) Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training with disagreement works best with little data.
  • 35. 35 POS Tagging Results * result from Schnabel & Schütze (2014) Trained on full labeled data (WSJ) Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri ‣ Tri-training works best in the full data setting.
  • 36. 36 POS Tagging Analysis Accuracy on out-of-vocabulary (OOV) tokens AccuracyonOOVtokens 50 57.5 65 72.5 80 %OOVtokens 0 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs OOV tokens Src Tri MT-Tri ‣ Classic tri-training works best on OOV tokens. ‣ MT-Tri does worse than source-only baseline on OOV.
  • 37. 37 POS accuracy per binned log frequency Accuracydeltavs.src-onlybaseline -0.005 0 0.005 0.009 0.014 0.018 Binned frequency 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 MT-Tri Tri ‣ Tri-training works best on low-frequency tokens (leftmost bins). POS Tagging Analysis
  • 38. 38 POS Tagging Analysis Accuracy on unknown word-tag (UWT) tokens AccuracyonUWTtokens 8 12.5 17 21.5 26 %UWTtokens 0 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs UWT rate Src Tri MT-Tri FLORS* ‣ No bootstrapping method works well on unknown word- tag combinations. ‣ Less lexicalized FLORS approach is superior. very difficult cases * result from Schnabel & Schütze (2014)
  • 39. ‣ Classic tri-training works best: outperforms recent state-of-the-art methods for sentiment analysis. ‣ We address the drawback of tri-training (space & time complexity) via the proposed MT-Tri model ‣ MT-Tri works best on sentiment, but not for POS. ‣ Importance of: ‣ Comparing neural methods to classics (strong baselines) ‣ Evaluation on multiple tasks & domains 39 Takeaways Tri-training
  • 40. 40 Learning under Domain Shift Labeled source data Different task Un- labeled target data How can we leverage pretrained LMs?
  • 41. 41 Data setting 3:
 Different target task Target domain
 Target task Source domain
 Source task Howard, J.*, & Ruder, S.* (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018. *: equal contribution
  • 42. ‣ Best practice: initialise first layer with pretrained word embeddings ‣ Recent approaches (McCann et al., 2017; Peters et al., 2018): Pretrained embeddings as fixed features. Peters et al. (2018) is task-specific. ‣ Why not initialise remaining parameters? ‣ Dai and Le (2015) first proposed fine-tuning a LM. However: No pretraining. Naive fine-tuning. 42 Transfer learning for NLP status quo McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in Translation: Contextualized Word Vectors. In Proceedings of NIPS 2017. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of NAACL-HLT 2018. Dai, A. M., & Le, Q. V. (2015). Semi-supervised Sequence Learning. In Proceedings of NIPS 2015.
  • 43. 43 Universal Language Model Fine- tuning 3 step recipe: 1. Train language model (LM) on general domain data. 2. Fine-tune LM on target data. 3. Train classifier on labeled data on top of LM.
  • 44. ‣ Model: AWD-LSTM language model ‣ 3-layer LSTM ‣ Tuned dropout hyperparameters ‣ Data: WikiText-103 ‣ 103 million tokens of Wikipedia text ‣ Train for ~24 hours on a Tesla V100 ‣ Recently: deeper models, trained on more data, for longer (Radford et al., 2018) 44 Language Model Pretraining Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
  • 45. ‣ Discriminative fine-tuning
 Different layers capture different types of information. They should be fine-tuned to different extents. 45 Language Model Fine-tuning θl t = θl t−1 − ηl ⋅ ∇θl J(θ) ‣ Slanted triangular learning rates
 The model should converge quickly to a suitable region and then refine its parameters.
  • 46. ‣ Concat pooling
 Concatenate pooled representations of hidden states to capture long document contexts:
 
 ‣ Gradual unfreezing
 Gradually unfreeze the layers starting from the last layer to prevent catastrophic forgetting. ‣ Bidirectional language model
 Pretrain both forward and backward LMs and fine-tune them independently. 46 Classifier Fine-tuning hc = [hT, 𝚖𝚊𝚡𝚙𝚘𝚘𝚕(H), 𝚖𝚎𝚊𝚗𝚙𝚘𝚘𝚕(H)]
  • 47. 47 Dataset Type # classes # examples TREC-6 6 5.5k IMDb 2 25k Yelp-bi 2 560k Yelp-full 5 650k AG News 4 120k DBpedia 14 560k Data & Tasks
  • 48. 48 Results Previous SOTA vs. ULMFiT Errorrate(%) 0 1.75 3.5 5.25 7 IMDb TREC-6 AG News DBpedia Yelp-bi 2.16 0.8 5.01 3.6 4.6 2.64 0.84 6.57 3.9 5.9 Previous SOTA ULMFiT Yelp-full 29.9830.58 ‣ ULMFiT outperforms the state-of-the-art by a significant margin on many of the datasets.
  • 49. 49 Few-shot Learning IMDb Errorrate(%) 0 12.5 25 37.5 50 # of training examples 100 500 2000 10000 20000 From scratch ULMFiT, supervised ULMFiT, semi-supervised AG-News # of training examples 100 500 2000 10000 108000 ‣ With 100 labeled examples, matches performance of training from scratch with 10x and 20x more data. ‣ With 50-100k additional unlabeled examples, matches performance of training with 50x and 20x more data.
  • 50. ‣ Proposed a general approach for fine-tuning a pretrained language model. ‣ Proposed new techniques to reduce catastrophic forgetting during fine-tuning. ‣ Approach achieves new SOTA on 6 text classification tasks. ‣ Very sample-efficient. 50 Takeaways
  • 51. ‣ In order to understand how well our models truly generalise, we need to measure their performance on out- of-distribution data. ‣ It is important to evaluate our models on different domains and tasks. ‣ Using pretrained language models is an effective way of doing transfer / semi-supervised learning (SSL). ‣ Can be complemented by “explicit” SSL. We can take lessons from traditional approaches. ‣ Dealing with stark domain differences is still a challenge and requires ways to explicitly avoid negative transfer. 51 Final Takeaways