SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
11
Generating Natural-Language
Text with Neural Networks
Jonathan Mugan
@jmugan
22
Computers are Illiterate
Reading requires mapping the words on a page to shared concepts in our culture.
Writing requires mapping those shared concepts into other words on a page.
We currently don’t know how to endow computer systems with a conceptual
system rich enough to represent even what a small child knows.
However, AI researchers have developed a method that can
1. encode representations of our world, such as images or words on a page, into a
space of meanings (proto-read)
2. convert points in that meaning space into language (proto-write)
This method uses neural networks, so we call it neural text generation.
33
Overview of neural text generation
world
state
encoder meaning space
decoder n
generated
text
decoder 1
generated
text
...
Neural text generation is useful for
• Machine translation
• Image captioning
• Style manipulation
44
Overview of neural text generation
world
state
encoder meaning space
decoder n
generated
text
decoder 1
generated
text
...
• Imagine having the text on your website customized
for each user.
• Imagine being able to have your boring HR manual
translated into the style of Cormac McCarthy.
The coffee in the break room is available to all employees,
like some primordial soup handed out to the huddled and
wasted damned.
55
Limited by a lack of common sense
world
state
encoder meaning space
decoder n
generated
text
decoder 1
generated
text
...
• It lacks precision: if you ask it to make you a reservation on Tuesday, it may make
it on Wednesday because the word vectors are similar (also see [0]).
• It also makes mistakes that no human would not make: in image captioning it
can mistake a toothbrush for a baseball bat [1].
[0] A Random Walk Through EMNLP 2017 http://approximatelycorrect.com/2017/09/26/a-random-walk-through-emnlp-2017/
[1] Deep Visual-Semantic Alignments for Generating Image Descriptions http://cs.stanford.edu/people/karpathy/deepimagesent/
The beautiful: as our neural networks get richer, the
meaning space will get closer to being able to
represent the concepts a small child can understand,
and we will get closer to human-level literacy. We are
taking baby steps toward real intelligence.
66
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Moving beyond pairs of sentences
• Conclusion
77
Outline
• Introduction
• Learning from pairs of sentences
• Overview
• Encoders
• Decoders
• Attention
• Uses
• Mapping to and from meaning space
• Adversarial text generation
• Conclusion
88
Sequence-to-sequence models
Both the encoding and decoding are often done using recurrent neural
networks (RNNs).
The initial big application for this was machine translation. For example,
where the source sentences are English and the target sentences are
Spanish.
A sequence-to-sequence (seq2seq) model
1. encodes a sequence of tokens, such as a sentence, into a single vector
2. decodes that vector into another sequences of tokens
99
The
ℎ"
woman
ℎ#
ate
ℎ$
pizza
ℎ%
.
ℎ&
As each word is examined, the encoder RNN integrates it
into the hidden state vector h.
Encoding a sentence for machine translation
1010
La
ℎ&
mujer
ℎ'
comió
ℎ(
pizza
ℎ)
.
ℎ*
The model begins with the last hidden state from the
encoder and generates words until a special stop symbol is
generated.
Decoding the sentence into Spanish
Note that the lengths of the source and target sentences do
not need to be the same, because it goes until it generates the
stop symbol.
1111
Generalized view
text meaning space
generated
text
ℎ&
1212
Outline
• Introduction
• Learning from pairs of sentences
• Overview
• Encoders
• Decoders
• Attention
• Uses
• Mapping to and from meaning space
• Adversarial text generation
• Conclusion
1313
The
ℎ"
woman
ℎ#
ate
ℎ$
pizza
ℎ%
.
ℎ&
Let’s look at a simple cell so we can get
a grounded understanding.
Concrete example of an RNN encoder
aardvark 1.32 1.56 0.31 –1.21 0.31
ate 0.36 –0.26 0.31 –1.99 0.11
... –0.69 0.33 0.77 0.22 –1.29
zoology 0.41 0.21 –0.32 0.31 0.22
Vocabulary table
• 𝑉: vocabulary table of size 50,000 by 300
• 𝑣-: word vector is row from 𝑉
• 𝑊/: transition matrix of size 300 by 300
• 𝑊0: input matrix of size 300 by 300
• 𝑏 is a bias vector of size 300
• ℎ- = tanh(ℎ-8# 𝑊/ + 𝑣- 𝑊0 + 𝑏)
See Christopher Olah’s blog post for great
presentation of more complex cells like LSTMs
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
𝑣-
ℎ-
(each row is actually 300 dimensions)
Parameters start
off as random
values and are
updated through
backpropagation.
1414
Tom ate with his brother
0.23 0.36 0.23 0.23 0.23
1.42 –0.26 1.32 1.27 –1.42
–0.33 0.31 0.33 0.59 0.33
1.23 –1.99 1.23 –1.31 1.23
1.11 0.11 0.11 2.19 1.83
0.21 0.56 0.21
1.12 –1.12 1.12
0.55 0.26 0.55
0.81 1.71 –0.31
–0.66 9.66 0.66
• 128 filters of width 3, 4, and 5
• Total of 384 filters
• 2) Slide each filter over the
sentence and element-
wise multiply and add (dot
product); then take the
max of the values seen.
• Result is a sentence vector
of length 384 that
represents the sentence.
• Analogous to
• For classification: each class has a vector of
weights of length 384
• 3) To decide which class, the class vector is
multiplied element wise by the sentence
vector and summed (dot product)
• The class with the highest sum wins
aardvark 1.32 1.56 0.31 –1.21 0.31
ate 0.36 –0.26 0.31 –1.99 0.11
... –0.69 0.33 0.77 0.22 –1.29
zoology 0.41 0.21 –0.32 0.31 0.22
Vocabulary table
1) For each word in the sentence, get its
word vector from the vocabulary table.
Convolutional Neural Network (CNN) Text Encoder
(each row is actually 300 dimensions)
• See the code here by Denny Britz https://github.com/dennybritz/cnn-text-classification-tf
• README.md contains reference to his blog post and the paper by Y. Kim
ℎ&
(example filter three words wide;
size 300 by 3)
1515
Outline
• Introduction
• Learning from pairs of sentences
• Overview
• Encoders
• Decoders
• Attention
• Uses
• Mapping to and from meaning space
• Adversarial text generation
• Conclusion
1616
Concrete example of RNN decoder
• 𝑉: vocabulary table of size 50,000 by 300
• 𝑣-8#: word vector row from 𝑉; 𝑣' is “mujer”
• 𝑊/
:
: transition matrix of size 300 by 300
• 𝑊0
:
: input matrix of size 300 by 300
• 𝑏 is a bias vector of size 300
• ℎ- = tanh(ℎ-8# 𝑊/
:
+ 𝑣-8# 𝑊0
:
+ 𝑏)
ℎ(
La
ℎ&
mujer
ℎ'
comió
ℎ(
pizza
ℎ)
.
ℎ*
• 𝑊;
:
: output matrix of size 300 by 50,000
• A probability distribution over next words
• 𝑝 𝑤𝑜𝑟𝑑𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(ℎ- 𝑊;
:
)
• The softmax function makes them all sum
to 1
• 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑥H = 𝑒JK/ ∑N 𝑒JO
1717
Models are trained with paired sentences to minimize cost
At time 𝑡 = 6 we know the correct
word is “comió”, so the cost is
𝑐𝑜𝑠𝑡 = −log(𝑝(comió))
So the more likely “comió” is, the
lower the cost.
Also, note that regardless of what the model
says at time 𝑡 − 1, we feed in “mujer” at time 𝑡.
This is called teacher forcing.
La
ℎ&
mujer
ℎ'
comió
ℎ(
pizza
ℎ)
.
ℎ*
• 𝑊;
:
: output matrix of size 300 by 50,000
• A probability distribution over next words
• 𝑝 𝑤𝑜𝑟𝑑𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(ℎ- 𝑊;
:
)
• The softmax function makes them all sum
to 1
• 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑥H = 𝑒JK/ ∑N 𝑒JO
1818
When it comes time to generate text
A compromise is beam search
• Start generating with the
first word
• Keep the 𝑘 (around 10)
most promising sequences
When you are actually using the model to generate text, you don’t know what
the right answer is.
You could do a greedy decoder
• Take the most likely first word,
then given that word, takes the
most likely next word, and so on.
• But that could lead you down a
suboptimal path.
You could look at all sequences
of words up to a given length
• Choose the sequence with
the lowest cost
• But there are far too many.
too cheap too expensive
La
ℎ&
mujer
ℎ'
comió
ℎ(
pizza
ℎ)
.
ℎ*
Beam search can sometimes lack diversity, and there have been proposals to address this;
see Li and Jurafsky, https://arxiv.org/pdf/1601.00372v2.pdf and Ashwin et al., https://arxiv.org/pdf/1610.02424v1.pdf
1919
Outline
• Introduction
• Learning from pairs of sentences
• Overview
• Encoders
• Decoders
• Attention
• Uses
• Mapping to and from meaning space
• Adversarial text generation
• Conclusion
2020
The network learns how to determine which previous memory is
most relevant at each decision point [Bahdanau et al., 2014].
The
ℎ"
woman
ℎ#
ate
ℎ$
tacos
ℎ%
.
ℎ&
La
ℎ&
mujer
ℎ'
comió
ℎ(
tacos
ℎ)
.
ℎ*
Attention helps the model consider more information
2121
RNN seq2seq model with attention
text meaning space
generated
text
ℎ&
Attention
2222
Outline
• Introduction
• Learning from pairs of sentences
• Overview
• Encoders
• Decoders
• Attention
• Uses
• Mapping to and from meaning space
• Adversarial text generation
• Conclusion
2323
Seq2seq works on all kinds of sentence pairs
For example, take a bunch of dialogs and use machine learning
to predict what the next statement will be given the last
statement.
• Movie and TV subtitles
• OpenSubtitles http://opus.lingfil.uu.se/OpenSubtitles.php
• Can mine Twitter
• Look for replies to tweets using the API
• Ubuntu Dialog Corpus
• Dialogs of people wanting technical support
https://arxiv.org/abs/1506.08909
Another example, text summarization. E.g., learn to generate
headlines for news articles. See
https://memkite.com/deeplearningkit/2016/04/23/deep-learning-for-text-summarization/
2424
Style and more
[0] Seq2seq style transfer: Dear Sir or Madam, May I introduce the YAFC Corpus: Corpus,
Benchmarks and Metrics for Formality Style Transfer https://arxiv.org/pdf/1803.06535.pdf
• You can even do style transfer [0].
• They had Mechanical Turk workers write different styles of
sentences for training.
• We need to get them on the Cormac McCarthy task.
2525
ℎ&
Convolutional
neural network
Weird
ℎ&
dude
ℎ'
on
ℎ(
table
ℎ)
.
ℎ*
Karpathy and Fei-Fei, 2015
http://cs.stanford.edu/people/karpathy/deepimagesent/
Vinyals et al., 2014
http://googleresearch.blogspot.com/2014/11/a-picture-is-
worth-thousand-coherent.html
Can even automatically caption images
Chavo protects my office from Evil spirits.
Seq2seq takeaway: works well if you
have lots of pairs of things to train on.
2626
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Meaning space should be smooth
• Variational methods ensure smoothness
• Navigating the meaning space
• Moving beyond pairs of sentences
• Conclusion
2727
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Meaning space should be smooth
• Variational methods ensure smoothness
• Navigating the meaning space
• Moving beyond pairs of sentences
• Conclusion
2828
Meaning Space Should be Smooth
Meaning space: mapping of concepts and ideas
to points in a continuous, high-dimensional
grid. We want:
• Semantically meaningful things to be near
each other.
• To be able to interpolate between sentences.
• Could be the last state of the RNN or the output of the CNN
• But we want to use the space as a representation of meaning.
• We can get more continuity in the space when we use a bias encourage the neural
network to represent the space compactly [0].
Meaning Space
This is an
outrage!
tacos are
me
I love
tacos
we are all
tacos
[0] Generating Sentences from a Continuous Space, https://arxiv.org/abs/1511.06349
2929
Meaning Space Should be Smooth
• Get bias by adding prior probability
distribution as organizational constraint.
• Prior wants to put all points at (0,0,...0)
unless there is a good reason not to.
• That “good reason” is the meaning we
want to capture.
• By forcing the system to maintain that
balance, we encourage it to organize
such that concepts that are near each
other in space have similar meaning.
Meaning Space
This is an
outrage!
tacos are
me
I love
tacos
we are all
tacos
3030
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Meaning space should be smooth
• Variational methods ensure smoothness
• Navigating the meaning space
• Moving beyond pairs of sentences
• Conclusion
3131
Variational methods provide structure to meaning space
Variational methods can provide this
organizational constraint in the form of a prior.
• In the movie Arrival, variational principles in
physics allowed one to know the future.
• In our world, variational comes from the
calculus of variations; optimization over a space
of functions.
• Variational inference is coming up with a
simpler function that closely approximates the
true function.
meaning space
This is an
outrage! we are all
tacos
tacos are
me
I love
tacos
we are all
tacos
3232
Variational methods provide structure to meaning space
meaning space
This is an
outrage!
Our simpler function for 𝑝(ℎ|𝑥) is
𝑞 ℎ 𝑥 = 𝑁(𝜇, 𝜎).
We use a neural network (RNN) to
output 𝜇 and 𝜎.
To maximize this approximation, it turns out that we
maximize the evidence lower bound (ELBO):
ℒ 𝑥, 𝑞 = 𝐸` ℎ 𝑥 [log 𝑝 𝑥 ℎ ] − 𝐾𝐿(𝑞(ℎ|𝑥)||𝑝 ℎ ).
See Kingma and Welling https://arxiv.org/pdf/1312.6114.pdf
In practice, we only take
one sample and we use
the reparameterization
trick to keep it
differentiable.
tacos are
me
I love
tacos
we are all
tacos
3333
Variational methods provide structure to meaning space
Note that the equation
ℒ 𝑥, 𝑞 = 𝐸` ℎ 𝑥 [log 𝑝 𝑥 ℎ ] − 𝐾𝐿(𝑞(ℎ|𝑥)||𝑝 ℎ )
has the prior 𝑝 ℎ .
The prior 𝑝 ℎ = 𝑁(0, 𝐼) forces the algorithm to use the space efficiently,
which forces it to put semantically similar latent representations
together. Without this prior, the algorithm could put ℎ values anywhere
in the latent space that it wanted, willy-nilly.
𝑁(0, 𝐼)
This is an
outrage!
tacos are
me
I love
tacos
we are all
tacos
3434
Trained using an autoencoder
The prior acts like regularization, so we are
1. maximizing the reconstruction fidelity 𝑝 𝑥 ℎ and
2. minimizing the difference of ℎ from the prior 𝑁(0, 𝐼).
Can be tricky to balance these things. Use KL annealing,
see Bowman et al., https://arxiv.org/pdf/1511.06349.pdf
text 𝑥 encoder decoder text 𝑥𝑁(0, 𝐼)
𝑞(ℎ|𝑥)
3535
We can also do classification over meaning space
When we train with variational
methods to regenerate the input text,
this is called a variational
autoencoder.
classifier
This can be the basis for learning with less
labeled training data (semi-supervised learning).
See Kingma et al., https://arxiv.org/pdf/1406.5298.pdf
𝑁(0, 𝐼)
This is an
outrage!
tacos are
me
I love
tacos
we are all
tacos
3636
We can place a hierarchical structure on the meaning space
Hierarchical
Meaning Space
tacos
Food
In the vision domain, Goyal et al. allows
one to learn a hierarchy on the prior space
using a hierarchical Dirichlet process
https://arxiv.org/pdf/1703.07027.pdf
Videos on understanding Dirichlet processes and the related
Chinese restaurant processes by Tamara Broderick
part I https://www.youtube.com/watch?v=kKZkNUvsJ4M
part II https://www.youtube.com/watch?v=oPcv8MaESGY
part III https://www.youtube.com/watch?v=HpcGlr19vNk
candy
GPUs
computers
VIC-20
old
cars
Cars
self-
driving
The Dirichlet process has allows for an infinite
number of clusters. As the world changes and new
concepts emerge, you can keep adding clusters.
3737
Outline
• Introduction
• Seq2seq models and aligned data
• Representing text in meaning space
• Meaning space should be smooth
• Variational methods ensure smoothness
• Navigating the meaning space
• Adversarial text generation
• Conclusion
3838
We can learn functions to find desired parts of the meaning space
meaning space
In the vision domain, Engel et al. allows
one to translate a point in meaning
space to a new point with desired
characteristics
https://arxiv.org/pdf/1711.05772.pdf
Bob
Bob w/
glasses
𝑓(𝐵𝑜𝑏)
3939
We can learn functions to find desired parts of the meaning space
meaning space
Bob
Bob w/
glasses
𝑓(𝐵𝑜𝑏)
In the text domain, [0] provides a way to
rewrite sentences to change them
according to some function, such as
• making the sentiment more positive or
• making a Shakespearean sentence
sound more modern.
[0] Sequence to Better Sequence: Continuous Revision of Combinatorial Structures, http://www.mit.edu/~jonasm/info/Seq2betterSeq.pdf
4040
We can learn functions to find desired parts of the meaning space
meaning space
Bob
Bob w/
glasses
𝑓(𝐵𝑜𝑏)
Since the space is smooth, we might be able to hill climb to get whatever digital
artifacts we wanted, such as computer programs.
[see Liang et al., https://arxiv.org/pdf/1611.00020.pdf and Yang et al., https://arxiv.org/pdf/1711.06744.pdf]
This movie looks pretty good, but make it a little more steampunk ...
[0] Sequence to Better Sequence: Continuous Revision of Combinatorial Structures, http://www.mit.edu/~jonasm/info/Seq2betterSeq.pdf
Method [0]
1. Encode sentence 𝑠 using a
variational autoencoder to get ℎ.
2. Look around the meaning space to find
the point ℎ∗ of the nearby space that
maximizes your evaluation function.
3. Decode ℎ∗ to create a modified
version of sentence 𝑠
4141
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Moving beyond pairs of sentences
• Conclusion
4242
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Moving beyond pairs of sentences
• Bridging meanings through shared space
• Adversarial text generation
• Learning through reinforcement learning
• Conclusion
4343
You have pairs of languages, but not the right ones
• You may have pairs of languages or styles but not the pairs you want.
• For example, you have lots of translations from English to Spanish and from English
to Tagalog but you want to translate directly from Spanish to Tagalog.
Google faced this scenario with language translation [0].
• Use a single encoder and decoder for all languages.
• Use word segments so there is one vocabulary table; and
the encoder can encode all languages to the same space.
• Add the target language as the first symbol of the source
text, so the decoder knows the language to use.
[0] Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System, https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
Source
lang. text
& target
symbol
encoder meaning space decoder
Target
language
After, they could
translate pairs of
languages even when
there was no pairwise
training data.
4444
You have no pairs at all
decoder
generated
text
meaning space decoder
generated
text
But how do you associate your language model with a meaning space?
How do you get the probability of the first word?
Imagine you have text written in two styles, but it is not tied together in pairs.
For each style, you can learn a language model,
which gives a probability distribution over the next
word given the pervious ones.
Language models are the technology
behind those posts of the form:
“I forced my computer to watch
1000 hours of Hee Haw and this
was the result.”
4545
Assign a meaning space using an autoencoder!
You can associate a language model with a meaning space using an autoencoder.
But how can you line up the meaning spaces of the different languages and styles so
you can translate between them?
text 𝑥 encoder decoder text 𝑥𝑁(0, 𝐼)
𝑞(ℎ|𝑥)
4646
1. Train languages 𝐴 and 𝐵 with autoencoders using
same encoder and different decoders.
2. Start with a sentence 𝑏 ∈ 𝐵 (start with 𝐵 because we want to train in the other direction)
3. Encode 𝑏 and decode it into 𝑎 ∈ 𝐴
4. You now have a synthetic pair (𝑎, 𝑏)
5. Train using the synthetic pairs
Bridging meaning spaces with synthetic pairs [0]
Lang. 𝐴
or 𝐵
encoder meaning space
decoder B
generated
B
decoder 𝐴
generated
𝐴
[0] Unsupervised Neural Machine Translation
https://arxiv.org/pdf/1710.11041.pdf
Creating pairs this way is called backtranslation
Improving Neural Machine Translation Models with Monolingual Data,
http://www.aclweb.org/anthology/P16-1009
4747
Bridging meaning spaces with round-trip cost [0]
1. Start with sentence 𝑎 ∈ 𝐴
2. Backprop with autoencoder cost 𝑎 → 𝑎
3. Backprop with round trip cost 𝑎 → 𝑏 → 𝑎
4. Repeat with a sentence 𝑏 ∈ 𝐵
[0] Improved Neural Text Attribute Transfer with Non-
parallel Data https://arxiv.org/pdf/1711.09395.pdf
• For example, for a Yelp review, they automatically converted the positive sentiment phrase
“love the southwestern burger” to the negative sentiment phrase “avoid the grease burger.”
• Notice that the word “burger” is in both, as it should be. Maintaining semantic meaning can
sometimes be a challenge with these systems.
• In their work [0], they add an additional constraint that nouns should be maintained.
Lang. 𝐴
or 𝐵
encoder meaning space
decoder B
generated
B
decoder 𝐴
generated
𝐴
4848
Making the most of data with dual learning [0]
Text
𝐴
Encoder
𝐴
meaning
space
Decoder
𝐵
Text
𝐵
Decoder
𝐵
Text
𝐵
Model from 𝐴 to 𝐵.
Language model for 𝐵.
Text
𝐵
Encoder
𝐵
meaning
space
Decoder
𝐴
Text
𝐴
Decoder
𝐴
Text
𝐴
Model from 𝐵 to 𝐴.
Language model for 𝐴.
Policy gradient methods to maximize
1. communication reward, which is they fidelity of a
round-trip translation of 𝐴 back to 𝐴 (and 𝐵 back to 𝐵)
2. language model reward, which for 𝐴 → 𝐵 is how
likely 𝐵 is in the monolingual language model for 𝐵
(and likewise for 𝐵 → 𝐴).
[0] Achieving Human Parity on Automatic Chinese to English
News Translation https://arxiv.org/pdf/1803.05567.pdf
4949
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Moving beyond pairs of sentences
• Bridging meanings through shared space
• Adversarial text generation
• Learning through reinforcement learning
• Conclusion
5050
Examples on movie reviews:
(“negative”, “past”)
the acting was also kind of hit or miss .
(“positive”, “past”)
his acting was impeccable
(“negative”, “present”)
the movie is very close to the show in plot and characters
(“positive”, “present”)
this is one of the better dance films
Using a discriminator [0]
Specify desired feature
text encoder 𝑁(0,1) decoder
generated
text with
feature
Learned
Discriminator on
that feature
[0] Hu et al. allow one to manipulate single
values of the meaning vector to change what
is written https://arxiv.org/abs/1703.00955
5151
Generative Adversarial networks (GANs)
Generator
Synthetic Data
Discriminator Real Data
Guidance to
improve generator
• Generative Adversarial Networks
(GANs) are built on deep learning.
• A GAN consists of a generator
neural network and a
discriminator neural network
training together in an adversarial
relationship.
Goodfellow et al., Generative Adversarial
Networks, https://arxiv.org/abs/1406.2661
5252
GAN for text generation
text encoder meaning space decoder
generated
text
Learned
Discriminator on
text
Problem: Generating text in this way is not differentiable.
Three solutions
1. Soft selection (take weighted average of words) (Hu does this.)
2. Run discriminator on continuous meaning space or
continuous decoder state. E.g., Professor Forcing [0]
3. Use reinforcement learning.
The discriminator could
try to differentiate
generated from real text.
[0] Professor Forcing: A New Algorithm for Training Recurrent Networks
https://arxiv.org/pdf/1610.09038.pdf
[1] Adversarial Learning for Neural Dialogue Generation
https://arxiv.org/pdf/1701.06547.pdf
[Note, encoding could be a line in dialog, like in [1]].
5353
Outline
• Introduction
• Sequence-to-sequence models: learning form pairs of
sentences
• Mapping meaning to and from meaning space
• Moving beyond pairs of sentences
• Bridging meanings through shared space
• Adversarial text generation
• Learning through reinforcement learning
• Conclusion
5454
Reinforcement learning is a gradual stamping in of behavior
• Reinforcement learning (RL) was studied in
animals by Thorndike [1898].
• Became the study of Behaviorism [Skinner, 1953]
(see Skinner box on the right).
• Formulated into artificial intelligence; see
leading book by Sutton and Barto, 1998.
(new draft available online
http://incompleteideas.net/book/bookdraft2018jan1.pdf)
By Andreas1 (Adapted from Image:Boite skinner.jpg) [GFDL
(http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0
(http://creativecommons.org/licenses/by-sa/3.0/)], via
Wikimedia Commons
A “Skinner box”
5555
Beginning with random exploration
In reinforcement learning, the agent begins by
randomly exploring until it reaches its goal.
5656
• When it reaches the goal, credit is propagated back to its previous
states.
• The agent learns the function 𝑄m(𝑠, 𝑎), which gives the cumulative
expected discounted reward of being in state 𝑠 and taking action
𝑎 and acting according to policy 𝜋 thereafter.
Reaching the goal
5757
Eventually, the agent learns the value of being in each state and
taking each action and can therefore always do the best thing in
each state.
Learning the behavior
5858
RL is now like supervised learning over actions
With the rise of deep learning, we can now more effectively work in high
dimensional and continuous domains.
When domains have large action spaces, the kind of reinforcement learning that
works best is policy gradient methods.
• D. Silver http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf
• A. Karpathy http://karpathy.github.io/2016/05/31/rl/
Policy gradient methods learn a policy directly on top of a neural network.
Basically softmax over actions on top of a neural network.
Reinforcement learning starts to look more like supervised learning, where the
labels are actions.
Instead of doing gradient descent on the error function as in supervised learning,
you are doing gradient ascent on the expected reward.
5959
Searching forward to get this reward
You have to do a search forward to
get the reward.
La
ℎ&
mujer
ℎ'
comió
ℎ(
Since we are not doing teacher forcing where
we know what the next word should be, we
do a forward search: If we were to write
“comió” and follow our current policy after
that, how good would the final reward be?
That “how good” is the estimate we use to
determine whether writing “comió” should be
the action this network takes in the future.
[see Lantao et al., https://arxiv.org/pdf/1609.05473v3.pdf]
Called Monte Carlo search
This is similar how game playing AI is done, for example with the recent success in Go
[see Silver et al., https://arxiv.org/pdf/1712.01815.pdf ]
In the language domain, this kind of training can be very slow. There are 50,000 choices.
6060
Outline
• Introduction
• Learning from pairs of sentences
• Mapping to and from meaning space
• Moving beyond pairs of sentences
• Conclusion
6161
Conclusion: How to Get Started?
GPT2 https://gpt2.apps.allenai.org/?text=Joel%20is
https://github.com/huggingface/pytorch-pretrained-BERT
If you have pairs, OpenNMT http://opennmt.net/
Texar (Built on TensorFlow) https://github.com/asyml/texar
6262
Conclusion: when should you use current neural
text generation?
If you can enumerate all of the patterns for the things you want the computer to
understand and say: use context-free grammars for understanding and templates for
language generation.
If you can’t enumerate what you want the computer to say and you have a lot of data,
neural text generation might be the way to go.
• Machine translation is an excellent example.
• Or, if you want your text converted to different styles.
• Machines will be illiterate until we can teach them common sense.
• But as algorithms get better at controlling and navigating the meaning space, neural
text generation has the potential to be transformative.
• Few things are more enjoyable than reading a story by your favorite author, and if
we could have all of our information given to us with a style tailored to our tastes,
the world could be enchanting.
Text can’t be long, and
it won’t be exact.
6363
Thank You!
Questions ?
6500 River Place Blvd.
Bldg. 3, Suite 120
Austin, TX. 78730
@DeUmbraAI

Más contenido relacionado

La actualidad más candente

Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Improving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior KnowledgeImproving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior KnowledgeGaetano Rossiello, PhD
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Uncertain knowledge and reasoning
Uncertain knowledge and reasoningUncertain knowledge and reasoning
Uncertain knowledge and reasoningShiwani Gupta
 
Image captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptxImage captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptxMrUnknown820784
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
LSTM Based Sentiment Analysis
LSTM Based Sentiment AnalysisLSTM Based Sentiment Analysis
LSTM Based Sentiment Analysisijtsrd
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial IntelligenceBise Mond
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesPier Luca Lanzi
 

La actualidad más candente (20)

Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Improving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior KnowledgeImproving Neural Abstractive Text Summarization with Prior Knowledge
Improving Neural Abstractive Text Summarization with Prior Knowledge
 
Machine learning projects
Machine learning projectsMachine learning projects
Machine learning projects
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Cnn method
Cnn methodCnn method
Cnn method
 
Uncertain knowledge and reasoning
Uncertain knowledge and reasoningUncertain knowledge and reasoning
Uncertain knowledge and reasoning
 
Recurrent neural networks
Recurrent neural networksRecurrent neural networks
Recurrent neural networks
 
Decision tree
Decision treeDecision tree
Decision tree
 
Image captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptxImage captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptx
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Text summarization
Text summarizationText summarization
Text summarization
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
LSTM Based Sentiment Analysis
LSTM Based Sentiment AnalysisLSTM Based Sentiment Analysis
LSTM Based Sentiment Analysis
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 

Similar a Generating Natural-Language Text with Neural Networks

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...NILESH VERMA
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018Olaf de Leeuw
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdfFEG
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportAkshit Arora
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxLecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxKunalSingh560957
 

Similar a Generating Natural-Language Text with Neural Networks (20)

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Ruby Egison
Ruby EgisonRuby Egison
Ruby Egison
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Word embedding
Word embedding Word embedding
Word embedding
 
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxLecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
 

Más de Jonathan Mugan

How to build someone we can talk to
How to build someone we can talk toHow to build someone we can talk to
How to build someone we can talk toJonathan Mugan
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIJonathan Mugan
 
Data Day Seattle, Chatbots from First Principles
Data Day Seattle, Chatbots from First PrinciplesData Day Seattle, Chatbots from First Principles
Data Day Seattle, Chatbots from First PrinciplesJonathan Mugan
 
Chatbots from first principles
Chatbots from first principlesChatbots from first principles
Chatbots from first principlesJonathan Mugan
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceJonathan Mugan
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceJonathan Mugan
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingJonathan Mugan
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceJonathan Mugan
 

Más de Jonathan Mugan (9)

How to build someone we can talk to
How to build someone we can talk toHow to build someone we can talk to
How to build someone we can talk to
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
 
Data Day Seattle, Chatbots from First Principles
Data Day Seattle, Chatbots from First PrinciplesData Day Seattle, Chatbots from First Principles
Data Day Seattle, Chatbots from First Principles
 
Chatbots from first principles
Chatbots from first principlesChatbots from first principles
Chatbots from first principles
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Generating Natural-Language Text with Neural Networks

  • 1. 11 Generating Natural-Language Text with Neural Networks Jonathan Mugan @jmugan
  • 2. 22 Computers are Illiterate Reading requires mapping the words on a page to shared concepts in our culture. Writing requires mapping those shared concepts into other words on a page. We currently don’t know how to endow computer systems with a conceptual system rich enough to represent even what a small child knows. However, AI researchers have developed a method that can 1. encode representations of our world, such as images or words on a page, into a space of meanings (proto-read) 2. convert points in that meaning space into language (proto-write) This method uses neural networks, so we call it neural text generation.
  • 3. 33 Overview of neural text generation world state encoder meaning space decoder n generated text decoder 1 generated text ... Neural text generation is useful for • Machine translation • Image captioning • Style manipulation
  • 4. 44 Overview of neural text generation world state encoder meaning space decoder n generated text decoder 1 generated text ... • Imagine having the text on your website customized for each user. • Imagine being able to have your boring HR manual translated into the style of Cormac McCarthy. The coffee in the break room is available to all employees, like some primordial soup handed out to the huddled and wasted damned.
  • 5. 55 Limited by a lack of common sense world state encoder meaning space decoder n generated text decoder 1 generated text ... • It lacks precision: if you ask it to make you a reservation on Tuesday, it may make it on Wednesday because the word vectors are similar (also see [0]). • It also makes mistakes that no human would not make: in image captioning it can mistake a toothbrush for a baseball bat [1]. [0] A Random Walk Through EMNLP 2017 http://approximatelycorrect.com/2017/09/26/a-random-walk-through-emnlp-2017/ [1] Deep Visual-Semantic Alignments for Generating Image Descriptions http://cs.stanford.edu/people/karpathy/deepimagesent/ The beautiful: as our neural networks get richer, the meaning space will get closer to being able to represent the concepts a small child can understand, and we will get closer to human-level literacy. We are taking baby steps toward real intelligence.
  • 6. 66 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Moving beyond pairs of sentences • Conclusion
  • 7. 77 Outline • Introduction • Learning from pairs of sentences • Overview • Encoders • Decoders • Attention • Uses • Mapping to and from meaning space • Adversarial text generation • Conclusion
  • 8. 88 Sequence-to-sequence models Both the encoding and decoding are often done using recurrent neural networks (RNNs). The initial big application for this was machine translation. For example, where the source sentences are English and the target sentences are Spanish. A sequence-to-sequence (seq2seq) model 1. encodes a sequence of tokens, such as a sentence, into a single vector 2. decodes that vector into another sequences of tokens
  • 9. 99 The ℎ" woman ℎ# ate ℎ$ pizza ℎ% . ℎ& As each word is examined, the encoder RNN integrates it into the hidden state vector h. Encoding a sentence for machine translation
  • 10. 1010 La ℎ& mujer ℎ' comió ℎ( pizza ℎ) . ℎ* The model begins with the last hidden state from the encoder and generates words until a special stop symbol is generated. Decoding the sentence into Spanish Note that the lengths of the source and target sentences do not need to be the same, because it goes until it generates the stop symbol.
  • 11. 1111 Generalized view text meaning space generated text ℎ&
  • 12. 1212 Outline • Introduction • Learning from pairs of sentences • Overview • Encoders • Decoders • Attention • Uses • Mapping to and from meaning space • Adversarial text generation • Conclusion
  • 13. 1313 The ℎ" woman ℎ# ate ℎ$ pizza ℎ% . ℎ& Let’s look at a simple cell so we can get a grounded understanding. Concrete example of an RNN encoder aardvark 1.32 1.56 0.31 –1.21 0.31 ate 0.36 –0.26 0.31 –1.99 0.11 ... –0.69 0.33 0.77 0.22 –1.29 zoology 0.41 0.21 –0.32 0.31 0.22 Vocabulary table • 𝑉: vocabulary table of size 50,000 by 300 • 𝑣-: word vector is row from 𝑉 • 𝑊/: transition matrix of size 300 by 300 • 𝑊0: input matrix of size 300 by 300 • 𝑏 is a bias vector of size 300 • ℎ- = tanh(ℎ-8# 𝑊/ + 𝑣- 𝑊0 + 𝑏) See Christopher Olah’s blog post for great presentation of more complex cells like LSTMs http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 𝑣- ℎ- (each row is actually 300 dimensions) Parameters start off as random values and are updated through backpropagation.
  • 14. 1414 Tom ate with his brother 0.23 0.36 0.23 0.23 0.23 1.42 –0.26 1.32 1.27 –1.42 –0.33 0.31 0.33 0.59 0.33 1.23 –1.99 1.23 –1.31 1.23 1.11 0.11 0.11 2.19 1.83 0.21 0.56 0.21 1.12 –1.12 1.12 0.55 0.26 0.55 0.81 1.71 –0.31 –0.66 9.66 0.66 • 128 filters of width 3, 4, and 5 • Total of 384 filters • 2) Slide each filter over the sentence and element- wise multiply and add (dot product); then take the max of the values seen. • Result is a sentence vector of length 384 that represents the sentence. • Analogous to • For classification: each class has a vector of weights of length 384 • 3) To decide which class, the class vector is multiplied element wise by the sentence vector and summed (dot product) • The class with the highest sum wins aardvark 1.32 1.56 0.31 –1.21 0.31 ate 0.36 –0.26 0.31 –1.99 0.11 ... –0.69 0.33 0.77 0.22 –1.29 zoology 0.41 0.21 –0.32 0.31 0.22 Vocabulary table 1) For each word in the sentence, get its word vector from the vocabulary table. Convolutional Neural Network (CNN) Text Encoder (each row is actually 300 dimensions) • See the code here by Denny Britz https://github.com/dennybritz/cnn-text-classification-tf • README.md contains reference to his blog post and the paper by Y. Kim ℎ& (example filter three words wide; size 300 by 3)
  • 15. 1515 Outline • Introduction • Learning from pairs of sentences • Overview • Encoders • Decoders • Attention • Uses • Mapping to and from meaning space • Adversarial text generation • Conclusion
  • 16. 1616 Concrete example of RNN decoder • 𝑉: vocabulary table of size 50,000 by 300 • 𝑣-8#: word vector row from 𝑉; 𝑣' is “mujer” • 𝑊/ : : transition matrix of size 300 by 300 • 𝑊0 : : input matrix of size 300 by 300 • 𝑏 is a bias vector of size 300 • ℎ- = tanh(ℎ-8# 𝑊/ : + 𝑣-8# 𝑊0 : + 𝑏) ℎ( La ℎ& mujer ℎ' comió ℎ( pizza ℎ) . ℎ* • 𝑊; : : output matrix of size 300 by 50,000 • A probability distribution over next words • 𝑝 𝑤𝑜𝑟𝑑𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(ℎ- 𝑊; : ) • The softmax function makes them all sum to 1 • 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑥H = 𝑒JK/ ∑N 𝑒JO
  • 17. 1717 Models are trained with paired sentences to minimize cost At time 𝑡 = 6 we know the correct word is “comió”, so the cost is 𝑐𝑜𝑠𝑡 = −log(𝑝(comió)) So the more likely “comió” is, the lower the cost. Also, note that regardless of what the model says at time 𝑡 − 1, we feed in “mujer” at time 𝑡. This is called teacher forcing. La ℎ& mujer ℎ' comió ℎ( pizza ℎ) . ℎ* • 𝑊; : : output matrix of size 300 by 50,000 • A probability distribution over next words • 𝑝 𝑤𝑜𝑟𝑑𝑠 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(ℎ- 𝑊; : ) • The softmax function makes them all sum to 1 • 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑥H = 𝑒JK/ ∑N 𝑒JO
  • 18. 1818 When it comes time to generate text A compromise is beam search • Start generating with the first word • Keep the 𝑘 (around 10) most promising sequences When you are actually using the model to generate text, you don’t know what the right answer is. You could do a greedy decoder • Take the most likely first word, then given that word, takes the most likely next word, and so on. • But that could lead you down a suboptimal path. You could look at all sequences of words up to a given length • Choose the sequence with the lowest cost • But there are far too many. too cheap too expensive La ℎ& mujer ℎ' comió ℎ( pizza ℎ) . ℎ* Beam search can sometimes lack diversity, and there have been proposals to address this; see Li and Jurafsky, https://arxiv.org/pdf/1601.00372v2.pdf and Ashwin et al., https://arxiv.org/pdf/1610.02424v1.pdf
  • 19. 1919 Outline • Introduction • Learning from pairs of sentences • Overview • Encoders • Decoders • Attention • Uses • Mapping to and from meaning space • Adversarial text generation • Conclusion
  • 20. 2020 The network learns how to determine which previous memory is most relevant at each decision point [Bahdanau et al., 2014]. The ℎ" woman ℎ# ate ℎ$ tacos ℎ% . ℎ& La ℎ& mujer ℎ' comió ℎ( tacos ℎ) . ℎ* Attention helps the model consider more information
  • 21. 2121 RNN seq2seq model with attention text meaning space generated text ℎ& Attention
  • 22. 2222 Outline • Introduction • Learning from pairs of sentences • Overview • Encoders • Decoders • Attention • Uses • Mapping to and from meaning space • Adversarial text generation • Conclusion
  • 23. 2323 Seq2seq works on all kinds of sentence pairs For example, take a bunch of dialogs and use machine learning to predict what the next statement will be given the last statement. • Movie and TV subtitles • OpenSubtitles http://opus.lingfil.uu.se/OpenSubtitles.php • Can mine Twitter • Look for replies to tweets using the API • Ubuntu Dialog Corpus • Dialogs of people wanting technical support https://arxiv.org/abs/1506.08909 Another example, text summarization. E.g., learn to generate headlines for news articles. See https://memkite.com/deeplearningkit/2016/04/23/deep-learning-for-text-summarization/
  • 24. 2424 Style and more [0] Seq2seq style transfer: Dear Sir or Madam, May I introduce the YAFC Corpus: Corpus, Benchmarks and Metrics for Formality Style Transfer https://arxiv.org/pdf/1803.06535.pdf • You can even do style transfer [0]. • They had Mechanical Turk workers write different styles of sentences for training. • We need to get them on the Cormac McCarthy task.
  • 25. 2525 ℎ& Convolutional neural network Weird ℎ& dude ℎ' on ℎ( table ℎ) . ℎ* Karpathy and Fei-Fei, 2015 http://cs.stanford.edu/people/karpathy/deepimagesent/ Vinyals et al., 2014 http://googleresearch.blogspot.com/2014/11/a-picture-is- worth-thousand-coherent.html Can even automatically caption images Chavo protects my office from Evil spirits. Seq2seq takeaway: works well if you have lots of pairs of things to train on.
  • 26. 2626 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Meaning space should be smooth • Variational methods ensure smoothness • Navigating the meaning space • Moving beyond pairs of sentences • Conclusion
  • 27. 2727 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Meaning space should be smooth • Variational methods ensure smoothness • Navigating the meaning space • Moving beyond pairs of sentences • Conclusion
  • 28. 2828 Meaning Space Should be Smooth Meaning space: mapping of concepts and ideas to points in a continuous, high-dimensional grid. We want: • Semantically meaningful things to be near each other. • To be able to interpolate between sentences. • Could be the last state of the RNN or the output of the CNN • But we want to use the space as a representation of meaning. • We can get more continuity in the space when we use a bias encourage the neural network to represent the space compactly [0]. Meaning Space This is an outrage! tacos are me I love tacos we are all tacos [0] Generating Sentences from a Continuous Space, https://arxiv.org/abs/1511.06349
  • 29. 2929 Meaning Space Should be Smooth • Get bias by adding prior probability distribution as organizational constraint. • Prior wants to put all points at (0,0,...0) unless there is a good reason not to. • That “good reason” is the meaning we want to capture. • By forcing the system to maintain that balance, we encourage it to organize such that concepts that are near each other in space have similar meaning. Meaning Space This is an outrage! tacos are me I love tacos we are all tacos
  • 30. 3030 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Meaning space should be smooth • Variational methods ensure smoothness • Navigating the meaning space • Moving beyond pairs of sentences • Conclusion
  • 31. 3131 Variational methods provide structure to meaning space Variational methods can provide this organizational constraint in the form of a prior. • In the movie Arrival, variational principles in physics allowed one to know the future. • In our world, variational comes from the calculus of variations; optimization over a space of functions. • Variational inference is coming up with a simpler function that closely approximates the true function. meaning space This is an outrage! we are all tacos tacos are me I love tacos we are all tacos
  • 32. 3232 Variational methods provide structure to meaning space meaning space This is an outrage! Our simpler function for 𝑝(ℎ|𝑥) is 𝑞 ℎ 𝑥 = 𝑁(𝜇, 𝜎). We use a neural network (RNN) to output 𝜇 and 𝜎. To maximize this approximation, it turns out that we maximize the evidence lower bound (ELBO): ℒ 𝑥, 𝑞 = 𝐸` ℎ 𝑥 [log 𝑝 𝑥 ℎ ] − 𝐾𝐿(𝑞(ℎ|𝑥)||𝑝 ℎ ). See Kingma and Welling https://arxiv.org/pdf/1312.6114.pdf In practice, we only take one sample and we use the reparameterization trick to keep it differentiable. tacos are me I love tacos we are all tacos
  • 33. 3333 Variational methods provide structure to meaning space Note that the equation ℒ 𝑥, 𝑞 = 𝐸` ℎ 𝑥 [log 𝑝 𝑥 ℎ ] − 𝐾𝐿(𝑞(ℎ|𝑥)||𝑝 ℎ ) has the prior 𝑝 ℎ . The prior 𝑝 ℎ = 𝑁(0, 𝐼) forces the algorithm to use the space efficiently, which forces it to put semantically similar latent representations together. Without this prior, the algorithm could put ℎ values anywhere in the latent space that it wanted, willy-nilly. 𝑁(0, 𝐼) This is an outrage! tacos are me I love tacos we are all tacos
  • 34. 3434 Trained using an autoencoder The prior acts like regularization, so we are 1. maximizing the reconstruction fidelity 𝑝 𝑥 ℎ and 2. minimizing the difference of ℎ from the prior 𝑁(0, 𝐼). Can be tricky to balance these things. Use KL annealing, see Bowman et al., https://arxiv.org/pdf/1511.06349.pdf text 𝑥 encoder decoder text 𝑥𝑁(0, 𝐼) 𝑞(ℎ|𝑥)
  • 35. 3535 We can also do classification over meaning space When we train with variational methods to regenerate the input text, this is called a variational autoencoder. classifier This can be the basis for learning with less labeled training data (semi-supervised learning). See Kingma et al., https://arxiv.org/pdf/1406.5298.pdf 𝑁(0, 𝐼) This is an outrage! tacos are me I love tacos we are all tacos
  • 36. 3636 We can place a hierarchical structure on the meaning space Hierarchical Meaning Space tacos Food In the vision domain, Goyal et al. allows one to learn a hierarchy on the prior space using a hierarchical Dirichlet process https://arxiv.org/pdf/1703.07027.pdf Videos on understanding Dirichlet processes and the related Chinese restaurant processes by Tamara Broderick part I https://www.youtube.com/watch?v=kKZkNUvsJ4M part II https://www.youtube.com/watch?v=oPcv8MaESGY part III https://www.youtube.com/watch?v=HpcGlr19vNk candy GPUs computers VIC-20 old cars Cars self- driving The Dirichlet process has allows for an infinite number of clusters. As the world changes and new concepts emerge, you can keep adding clusters.
  • 37. 3737 Outline • Introduction • Seq2seq models and aligned data • Representing text in meaning space • Meaning space should be smooth • Variational methods ensure smoothness • Navigating the meaning space • Adversarial text generation • Conclusion
  • 38. 3838 We can learn functions to find desired parts of the meaning space meaning space In the vision domain, Engel et al. allows one to translate a point in meaning space to a new point with desired characteristics https://arxiv.org/pdf/1711.05772.pdf Bob Bob w/ glasses 𝑓(𝐵𝑜𝑏)
  • 39. 3939 We can learn functions to find desired parts of the meaning space meaning space Bob Bob w/ glasses 𝑓(𝐵𝑜𝑏) In the text domain, [0] provides a way to rewrite sentences to change them according to some function, such as • making the sentiment more positive or • making a Shakespearean sentence sound more modern. [0] Sequence to Better Sequence: Continuous Revision of Combinatorial Structures, http://www.mit.edu/~jonasm/info/Seq2betterSeq.pdf
  • 40. 4040 We can learn functions to find desired parts of the meaning space meaning space Bob Bob w/ glasses 𝑓(𝐵𝑜𝑏) Since the space is smooth, we might be able to hill climb to get whatever digital artifacts we wanted, such as computer programs. [see Liang et al., https://arxiv.org/pdf/1611.00020.pdf and Yang et al., https://arxiv.org/pdf/1711.06744.pdf] This movie looks pretty good, but make it a little more steampunk ... [0] Sequence to Better Sequence: Continuous Revision of Combinatorial Structures, http://www.mit.edu/~jonasm/info/Seq2betterSeq.pdf Method [0] 1. Encode sentence 𝑠 using a variational autoencoder to get ℎ. 2. Look around the meaning space to find the point ℎ∗ of the nearby space that maximizes your evaluation function. 3. Decode ℎ∗ to create a modified version of sentence 𝑠
  • 41. 4141 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Moving beyond pairs of sentences • Conclusion
  • 42. 4242 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Moving beyond pairs of sentences • Bridging meanings through shared space • Adversarial text generation • Learning through reinforcement learning • Conclusion
  • 43. 4343 You have pairs of languages, but not the right ones • You may have pairs of languages or styles but not the pairs you want. • For example, you have lots of translations from English to Spanish and from English to Tagalog but you want to translate directly from Spanish to Tagalog. Google faced this scenario with language translation [0]. • Use a single encoder and decoder for all languages. • Use word segments so there is one vocabulary table; and the encoder can encode all languages to the same space. • Add the target language as the first symbol of the source text, so the decoder knows the language to use. [0] Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System, https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html Source lang. text & target symbol encoder meaning space decoder Target language After, they could translate pairs of languages even when there was no pairwise training data.
  • 44. 4444 You have no pairs at all decoder generated text meaning space decoder generated text But how do you associate your language model with a meaning space? How do you get the probability of the first word? Imagine you have text written in two styles, but it is not tied together in pairs. For each style, you can learn a language model, which gives a probability distribution over the next word given the pervious ones. Language models are the technology behind those posts of the form: “I forced my computer to watch 1000 hours of Hee Haw and this was the result.”
  • 45. 4545 Assign a meaning space using an autoencoder! You can associate a language model with a meaning space using an autoencoder. But how can you line up the meaning spaces of the different languages and styles so you can translate between them? text 𝑥 encoder decoder text 𝑥𝑁(0, 𝐼) 𝑞(ℎ|𝑥)
  • 46. 4646 1. Train languages 𝐴 and 𝐵 with autoencoders using same encoder and different decoders. 2. Start with a sentence 𝑏 ∈ 𝐵 (start with 𝐵 because we want to train in the other direction) 3. Encode 𝑏 and decode it into 𝑎 ∈ 𝐴 4. You now have a synthetic pair (𝑎, 𝑏) 5. Train using the synthetic pairs Bridging meaning spaces with synthetic pairs [0] Lang. 𝐴 or 𝐵 encoder meaning space decoder B generated B decoder 𝐴 generated 𝐴 [0] Unsupervised Neural Machine Translation https://arxiv.org/pdf/1710.11041.pdf Creating pairs this way is called backtranslation Improving Neural Machine Translation Models with Monolingual Data, http://www.aclweb.org/anthology/P16-1009
  • 47. 4747 Bridging meaning spaces with round-trip cost [0] 1. Start with sentence 𝑎 ∈ 𝐴 2. Backprop with autoencoder cost 𝑎 → 𝑎 3. Backprop with round trip cost 𝑎 → 𝑏 → 𝑎 4. Repeat with a sentence 𝑏 ∈ 𝐵 [0] Improved Neural Text Attribute Transfer with Non- parallel Data https://arxiv.org/pdf/1711.09395.pdf • For example, for a Yelp review, they automatically converted the positive sentiment phrase “love the southwestern burger” to the negative sentiment phrase “avoid the grease burger.” • Notice that the word “burger” is in both, as it should be. Maintaining semantic meaning can sometimes be a challenge with these systems. • In their work [0], they add an additional constraint that nouns should be maintained. Lang. 𝐴 or 𝐵 encoder meaning space decoder B generated B decoder 𝐴 generated 𝐴
  • 48. 4848 Making the most of data with dual learning [0] Text 𝐴 Encoder 𝐴 meaning space Decoder 𝐵 Text 𝐵 Decoder 𝐵 Text 𝐵 Model from 𝐴 to 𝐵. Language model for 𝐵. Text 𝐵 Encoder 𝐵 meaning space Decoder 𝐴 Text 𝐴 Decoder 𝐴 Text 𝐴 Model from 𝐵 to 𝐴. Language model for 𝐴. Policy gradient methods to maximize 1. communication reward, which is they fidelity of a round-trip translation of 𝐴 back to 𝐴 (and 𝐵 back to 𝐵) 2. language model reward, which for 𝐴 → 𝐵 is how likely 𝐵 is in the monolingual language model for 𝐵 (and likewise for 𝐵 → 𝐴). [0] Achieving Human Parity on Automatic Chinese to English News Translation https://arxiv.org/pdf/1803.05567.pdf
  • 49. 4949 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Moving beyond pairs of sentences • Bridging meanings through shared space • Adversarial text generation • Learning through reinforcement learning • Conclusion
  • 50. 5050 Examples on movie reviews: (“negative”, “past”) the acting was also kind of hit or miss . (“positive”, “past”) his acting was impeccable (“negative”, “present”) the movie is very close to the show in plot and characters (“positive”, “present”) this is one of the better dance films Using a discriminator [0] Specify desired feature text encoder 𝑁(0,1) decoder generated text with feature Learned Discriminator on that feature [0] Hu et al. allow one to manipulate single values of the meaning vector to change what is written https://arxiv.org/abs/1703.00955
  • 51. 5151 Generative Adversarial networks (GANs) Generator Synthetic Data Discriminator Real Data Guidance to improve generator • Generative Adversarial Networks (GANs) are built on deep learning. • A GAN consists of a generator neural network and a discriminator neural network training together in an adversarial relationship. Goodfellow et al., Generative Adversarial Networks, https://arxiv.org/abs/1406.2661
  • 52. 5252 GAN for text generation text encoder meaning space decoder generated text Learned Discriminator on text Problem: Generating text in this way is not differentiable. Three solutions 1. Soft selection (take weighted average of words) (Hu does this.) 2. Run discriminator on continuous meaning space or continuous decoder state. E.g., Professor Forcing [0] 3. Use reinforcement learning. The discriminator could try to differentiate generated from real text. [0] Professor Forcing: A New Algorithm for Training Recurrent Networks https://arxiv.org/pdf/1610.09038.pdf [1] Adversarial Learning for Neural Dialogue Generation https://arxiv.org/pdf/1701.06547.pdf [Note, encoding could be a line in dialog, like in [1]].
  • 53. 5353 Outline • Introduction • Sequence-to-sequence models: learning form pairs of sentences • Mapping meaning to and from meaning space • Moving beyond pairs of sentences • Bridging meanings through shared space • Adversarial text generation • Learning through reinforcement learning • Conclusion
  • 54. 5454 Reinforcement learning is a gradual stamping in of behavior • Reinforcement learning (RL) was studied in animals by Thorndike [1898]. • Became the study of Behaviorism [Skinner, 1953] (see Skinner box on the right). • Formulated into artificial intelligence; see leading book by Sutton and Barto, 1998. (new draft available online http://incompleteideas.net/book/bookdraft2018jan1.pdf) By Andreas1 (Adapted from Image:Boite skinner.jpg) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons A “Skinner box”
  • 55. 5555 Beginning with random exploration In reinforcement learning, the agent begins by randomly exploring until it reaches its goal.
  • 56. 5656 • When it reaches the goal, credit is propagated back to its previous states. • The agent learns the function 𝑄m(𝑠, 𝑎), which gives the cumulative expected discounted reward of being in state 𝑠 and taking action 𝑎 and acting according to policy 𝜋 thereafter. Reaching the goal
  • 57. 5757 Eventually, the agent learns the value of being in each state and taking each action and can therefore always do the best thing in each state. Learning the behavior
  • 58. 5858 RL is now like supervised learning over actions With the rise of deep learning, we can now more effectively work in high dimensional and continuous domains. When domains have large action spaces, the kind of reinforcement learning that works best is policy gradient methods. • D. Silver http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf • A. Karpathy http://karpathy.github.io/2016/05/31/rl/ Policy gradient methods learn a policy directly on top of a neural network. Basically softmax over actions on top of a neural network. Reinforcement learning starts to look more like supervised learning, where the labels are actions. Instead of doing gradient descent on the error function as in supervised learning, you are doing gradient ascent on the expected reward.
  • 59. 5959 Searching forward to get this reward You have to do a search forward to get the reward. La ℎ& mujer ℎ' comió ℎ( Since we are not doing teacher forcing where we know what the next word should be, we do a forward search: If we were to write “comió” and follow our current policy after that, how good would the final reward be? That “how good” is the estimate we use to determine whether writing “comió” should be the action this network takes in the future. [see Lantao et al., https://arxiv.org/pdf/1609.05473v3.pdf] Called Monte Carlo search This is similar how game playing AI is done, for example with the recent success in Go [see Silver et al., https://arxiv.org/pdf/1712.01815.pdf ] In the language domain, this kind of training can be very slow. There are 50,000 choices.
  • 60. 6060 Outline • Introduction • Learning from pairs of sentences • Mapping to and from meaning space • Moving beyond pairs of sentences • Conclusion
  • 61. 6161 Conclusion: How to Get Started? GPT2 https://gpt2.apps.allenai.org/?text=Joel%20is https://github.com/huggingface/pytorch-pretrained-BERT If you have pairs, OpenNMT http://opennmt.net/ Texar (Built on TensorFlow) https://github.com/asyml/texar
  • 62. 6262 Conclusion: when should you use current neural text generation? If you can enumerate all of the patterns for the things you want the computer to understand and say: use context-free grammars for understanding and templates for language generation. If you can’t enumerate what you want the computer to say and you have a lot of data, neural text generation might be the way to go. • Machine translation is an excellent example. • Or, if you want your text converted to different styles. • Machines will be illiterate until we can teach them common sense. • But as algorithms get better at controlling and navigating the meaning space, neural text generation has the potential to be transformative. • Few things are more enjoyable than reading a story by your favorite author, and if we could have all of our information given to us with a style tailored to our tastes, the world could be enchanting. Text can’t be long, and it won’t be exact.
  • 63. 6363 Thank You! Questions ? 6500 River Place Blvd. Bldg. 3, Suite 120 Austin, TX. 78730 @DeUmbraAI