Theory of Mind and Language Processing, Fast and Slow

Theory of Mind and Language
Processing, Fast and Slow
Oka Natsuki
Faculty of Information and Human Sciences
Kyoto Institute of Technology
Cognitive Interaction Design
August 3, 2020
1

Summary
A characteristic of information processing
performed by humans is that it consists of both
System 1 that performs fast automatic processing
and System 2 that performs slow conscious
processing. In this lecture, I will introduce
computational models for A) natural language
understanding/generation and B) understanding
the state of mind from the perspective of Systems 1
and 2, providing an opportunity to think about
what intelligence is.
(The lecture is given in Japanese, but most slides
are written in English.)
2

講義の概要
人が行う情報処理の特徴は、自動的な速い処
理を行う System 1 と意識的な遅い処理を行う
System 2 の両方で成り立っていることである。
本講義では、A)自然言語理解／生成、B)心の
状態の理解―それぞれの計算モデルをSystem
1, System 2の観点から紹介し、知能とは何かを
考えるきっかけを提供する。
（講義は日本語で行いますが、主要なスライド
は英語で書かれています。）
3

Context-Free Grammar 文脈自由文法
Syntax tree 構文木
N PREP N CONJ N N
theory of mind and language processing
PP
NP
NP
NP
N PREP N CONJ N N
theory of mind and language processing
PP
NP
NP
NP
NPNP
NP
NP
NP → NP PP
NP → N
NP → NP CONJ NP
PP → PREP NP
・・・
4

Deep Learning: Language Processing
without Explicit Grammar Rules
5

Deep Learning: Language Processing
without Explicit Grammar Rules
6

System 1 operates
automatically and quickly,
with little or no effort and no
sense of voluntary control.
System 2 allocates attention to
the effortful mental activities
that demand it, including
complex computations. The
operations of System 2 are
often associated with the
subjective experience of
agency, choice, and
concentration.
7

System 1 operates
automatically and quickly,
with little or no effort and no
sense of voluntary control.
System 2 allocates attention to
the effortful mental activities
that demand it, including
complex computations. The
operations of System 2 are
often associated with the
subjective experience of
agency, choice, and
concentration.
8

Report Assignment
Deadline: August 17; Submit a report of about 1000 words in pdf format.
Answer the following questions 1 and 2 by choosing either A or B,
where A is “Understanding of the state of mind" and B is
"Natural language understanding/generation“. You may choose
both A and B.
1. In many cases, both System 1 and System 2 are likely to be
involved in the execution of A and B. List specific situations in
which both systems are likely to be involved, and describe
your hypothesis in as much detail as possible about what
each system does and how they interact.
2. Be as specific as possible and as detailed as possible how to
check if the hypothesis is correct.
9

レポート課題
8月17日締切；moodleにpdfで提出；2000字程度
Aを「心の状態の推測やそれを伴う判断」、Bを「自然言
語理解／生成」として、A, Bのどちらかを選んで以下の
問い1., 2.に答えよ。A, B両方について答えてもよい。
1. AやBの実行には、多くの場合、System 1とSystem 2
の両方が関与していると思われる。両システムが関
与していそうな具体的な場面を挙げ、各システムが
どのような処理を行い、両システムがどのように影響
を及ぼし合っているかについて、あなたの仮説をでき
るだけ詳細に記せ。
2. その仮説が正しいかどうかを確かめる方法を、でき
るだけ具体的に、できるだけ詳細に記せ。
10

Outline
Introducing computational models (in a broad sense)
1. Natural Language Processing
– Computational model of System 2
• Top-down parser
– Computational models of System 1
• RNN, LSTM, Transformer, BERT
– Integration
2. Theory of Mind
• “Rational quantitative attribution of beliefs, desires and percepts
in human mentalizing”
• “Machine Theory of Mind”
– Integration
11

Outline
• Top-down parser
– Integration
2. Theory of Mind
– Integration
12

A boy saw a girl with a telescope.
13

Context-free grammar
S → NP VP
NP → D N｜NP PP
VP → V NP｜V NP PP
PP → P NP
D → a
N → boy｜girl｜telescope
V → saw
P → with
S: 文(sentence), NP: 名詞句(noun phrase),
VP: 動詞句(verb phrase), D: 限定詞(determiner),
N: 名詞(noun), PP: 前置詞句(prepositional phrase),
V: 動詞(verb), P: 前置詞(preposition) 14
縦棒はORです

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 15
NP
S
VP
A boy saw a girl with a telescope

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 16
NP
S
VP
D N

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 17
NP
S
VP
D N
A

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 18
NP
S
VP
D N
A boy

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 19
NP
S
VP
D N
A boy
NP
V

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 20
NP
S
VP
D N
A boy
NP
V
saw

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 21
NP
S
VP
D N
A boy
NP
V
saw
D N

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 22
NP
S
VP
D N
A boy
NP
V
saw
D N
a

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 23
NP
S
VP
D N
A boy
NP
V
saw
D N
a boy

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 24
NP
S
VP
D N
A boy
NP
V
saw
D N
a girl

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 25
NP
S
VP
D N
A boy
NP
V
saw
D N
a

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 26
NP
S
VP
D N
A boy
NP
V
saw
D N

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 27
NP
S
VP
D N
A boy
NP
V
saw
NP
PP

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 28
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 29
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N
a girl

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 30
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N
a girl
P
NP

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 31
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N
a girl
P
NP
with

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 32
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N
a girl
P
NP
with
D N

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 33
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N
a girl
P
NP
with
D N
a telescope

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 34
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
D N
a girl
P
NP
with

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 35
NP
S
VP
D N
A boy
NP
V
saw
NP
PP

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 36
NP
S
VP
D N
A boy
NP
V
saw
NP
PP
NP PP
D N P NP
D N

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 37
NP
S
VP
D N
A boy
NP
V
saw

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 38
NP
S
VP
D N
A boy
NP
V
PP

Top-down parser
S → NP VP
NP → D N｜NP PP
PP → P NP
D → a
V → saw
P → with 39
NP
S
VP
D N
A boy
V
a girl awith
PP
telescope
D N DP N
NP NP
saw

A boy saw a girl awith
NP
NP
PP
S
telescope
VP
D N V D N DP N
NP NP
少年は望遠鏡を
持った少女を見た。
40
A boy saw a girl awith
NP
PP
S
telescope
VP
D N V D N DP N
NP NP
少年は望遠鏡で
少女を見た。

Parsing with grammar rules
文法規則による構文解析の特徴
• Infinite depth trees with a finite number of
rules 有限個の規則で、無限の深さの木が生
成／解析できる
• You can parse even an unknown language. 知
らない言語でも規則に従えば生成／解析でき
る
• Effortful 努力を要する
→ これらは、System 2 による処理の特徴
41

Semantic constraints 意味的な制約
A boy saw a girl with a book.
42
本では見ることはできない。

Semantic constraints 意味的な制約
A boy saw a girl with a book.
43
本では見ることはできない。
でも、丸めると覗くことができる。
すべての制約を書き尽くすのは困難
→ good old-fashioned AIの前に立ちはだかった壁

Outline
• Top-down parser
– Integration
2. Theory of Mind
– Integration
44

45https://link.springer.com/content/pdf/10.1023/A:1022699029236.pdf

46
prediction task
Elman, 1991

47
(a)
(c)
(b)
(d)
Elman, 1991

49
勾配消失問題 → LSTMへ
（Recurrent Neural Networks (RNN) の基本的な解説スライドは省略します）

Long Short-Term Memory
50
cell state
recurrent information
Yu+, A review of recurrent neural networks, Neural Computation 31(7) 1235-1270, 2019.

original LSTM
51
Yu+, A review of recurrent neural networks, Neural Computation 31(7) 1235-1270, 2019.
cell state
recurrent
information
input
output

[Vinyals+, Show and Tell: A Neural Image Caption Generator, 2015]
キャプション生成
開始記号
次の単語の
確率分布
one-hot vector
word embedding
vector 512次元
画像分類の学習をした
CNNの最終隠れ層
この和を最大に
するよう学習
確率分布に従って1つサンプリングするのでなく、サイズ20のbeam search
キャプション付きデータは
少ないので、過学習防ぐた
めに固定
512次元

sequence-to-sequence learning
framework with attention
stacked LSTM, residual connections, bidirectional LSTM, sub-word units, …
知识就是力量
Knowledge is power
それまでに読み込んだ全ての単
語の意味を表現しているベクトル
1単語ずつ出力
どこに注目
するかを変
化させつつ
Y. Wu et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv 2016

54http://chasen.org/~daiti-m/paper/ismstat-lstm.pdf
内容は、左のリンクから見て下さい

End-To-End Memory Networks
https://arxiv.org/abs/1503.08895
55

Transformer
The Batch: How did the idea of self-attention evolve?
Shazeer: I’d been working with LSTMs, the state-of-the-art language
architecture before transformer. There were several frustrating
things about them, especially computational problems. Arithmetic is
cheap and moving data is expensive on today’s hardware. If you
multiply an activation vector by a weight matrix, you spend 99
percent of the time reading the weight matrix from memory. You
need to process a whole lot of examples simultaneously to make
that worthwhile. Filling up memory with all those activations limits
the size of your model and the length of the sequences you can
process. Transformers can solve those problems because you
process the entire sequence simultaneously. I heard a few of my
colleagues in the hallway saying, “Let’s replace LSTMs with
attention.” I said, “Heck yeah!”
56
THE BATCH, June 17, 2020

Attention Is All You Need
arXiv:1706.03762
Figure 1: The Transformer - model architecture.

The Illustrated Transformer
58
http://jalammar.github.io/illustrated-transformer/

59
512次元
512次元
512次元
2048次元
ReLU

60

61
512次元
64次元
64次元
512次元

62
512次元
512次元
512次元
2048次元
ReLU

63

BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
64
Task #1: Masked LM
Task #2: Next Sentence Prediction

BERT Rediscovers the Classical NLP
Pipeline (ACL 2019)
Ian Tenney, Dipanjan Das, Ellie Pavlick
Pre-trained text encoders have rapidly advanced the
state of the art on many NLP tasks. We focus on one
such model, BERT, and aim to quantify where linguistic
information is captured within the network. We find
that the model represents the steps of the traditional
NLP pipeline in an interpretable and localizable way,
and that the regions responsible for each step appear in
the expected sequence: POS tagging, parsing, NER,
semantic roles, then coreference. Qualitative analysis
reveals that the model can and often does adjust this
pipeline dynamically, revising lower-level decisions on
the basis of disambiguating information from higher-
level representations. 65https://arxiv.org/abs/1905.05950

Summary statistics on BERT-large
66
part-of-speech (POS), constituents (Consts.), dependencies (Deps.), entities,
semantic role labeling (SRL), coreference (Coref.), semantic proto-roles (SPR;
Reisinger et al., 2015), and relation classification (SemEval).

Theoretical studies
• Merrill (2019) showed that—in the finite
precision setting—LSTMs recognize a subset of
the counter languages, whereas GRUs and
simple RNNs recognize regular languages.
• Korsky and Berwick (2019) showed that
arbitrary-precision RNNs can emulate
pushdown automata, and can therefore
recognize all deterministic context-free
languages.
67
Hahn, Theoretical Limitations of Self-Attention in Neural Sequence Models, Transactions of the
Association for Computational Linguistics 2020 Vol. 8, 156-171.

Theoretical studies
• Siegelman and Sontag (1995) states that—
given unlimited computation time—recurrent
networks can emulate the computation of
Turing machines.
• P𝑒rez et al. (2019) have shown the same result
for both (argmax-attention) Transformers and
Neural GPUs.
68
Hahn, Theoretical Limitations of Self-Attention in Neural Sequence Models, Transactions of the
Association for Computational Linguistics 2020 Vol. 8, 156-171.

Theoretical Limitations of Self-Attention in
Neural Sequence Models
Hahn, Transactions of the Association for Computational Linguistics 2020 Vol. 8, 156-171
• Self-attention cannot model periodic finite-
state languages, nor hierarchical structure,
unless the number of layers or heads
increases with input length.
⇔ the practical success of self-attention
→ Natural language can be approximated well
with models that are too weak for the formal
languages typically assumed in theoretical
linguistics.
69

ここまでのまとめ
• RNN, LSTM, Transformer, BERTは、明示的な文法規
則により構文解析するのとは異なる方法で、構文
情報をかなり正確にとらえている。ただし、有限の
計算資源では、無限に深い再帰構造は扱えない。
• 人の System 1 による言語処理も同様の性質を持っ
ているように思われる（私見）。
• 人は、System 2 によって、無限に深い再帰構造を
扱うこともできる。が、日常の言語処理における
System 2 の働きとして重要なのは、そこではなくて、
one shotで学習できるとか、次ページに示す点の方
ではないか（私見）。
70

ここまでのまとめ（つづき）
• 日常の言語処理における System 2 の働きとし
て重要なのは、one-shotで学習できる、全体とし
て一貫した話をする、知識や記憶（エピソード記
憶や常識を含む）と整合した話をする、とかの方
ではないか（私見）。
• 深層学習（概ね System 1 と対応）で上記の下線
部のようなことを可能にしようとする研究が最近
盛んに行われている（この後のスライドで2,3紹
介）が、まだ決定打はない（と思う）。
• （参考） Symbolic AI (Good Old-Fashioned AI) では、自然言語理
解システムのために frame (Minsky, 1974) や script (Schank,
1990) などの定型的な知識を表現する枠組みが提案された。
71

ここまでのまとめ（つづき）
• Transformer, BERTは、RNN, LSTMとは異なる計
算方式(attention)に基づく。→ 飛ぶことと自然言語処理
の実現方法の対応表を作ってみた（私見）。
72
Flying Natural language
processing
Bird, Insect Human,
RNN, LSTM
Ornithopter 羽ばたき飛行機 Symbolic AI
(Good Old-Fashioned AI)
Airplane Transformer, BERT
Helicopter ?
Rocket ?
? ?
← （私見）
← （私見）
← （私見）

Outline
• Top-down parser
– Integration（網羅的ではなく3例だけ紹介）
2. Theory of Mind
– Integration
73

Building End-To-End Dialogue Systems Using
Generative Hierarchical Neural Network Models
74
Published in AAAI 2016 (Special Track on Cognitive Systems)

Mem2Seq: Effectively Incorporating Knowledge Bases
into End-to-End Task-Oriented Dialog Systems
75

Systematicity in a Recurrent Neural Network by
Factorizing Syntax and Semantics
Standard methods in deep learning fail to capture
compositional or systematic structure in their training
data, as shown by their inability to generalize outside of
the training distribution. However, human learners
readily generalize in this way, e.g. by applying known
grammatical rules to novel words. …
76
https://cognitivesciencesociety.org/cogsci20/papers/0027/index.html

後半の話題 Theory of mind
自然言語処理と似ている点／違う点
• 「心の理論」も再帰構造を扱う点（私があなたを好きだ
とあなたは知っているということを私は知らないとあなたは思っているだ
ろうけど・・・）
• System 1, 2 が並行して動いているだろう点
• 「心の理論」については、System 2 による説明
や納得が日常生活で起こりそうだが、自然言
語の構文や文法を説明したいのは、文法学
者と教員だけ。
77
似てる
似てる
違う

Outline
• Top-down parser
– Integration
2. Theory of Mind
– Integration
78

まず、人のTheory of Mindについての知見を紹
介し、その後、計算モデルを紹介します。
79

Explicit ToM
• Four-year-olds: Pass
• Three-year-olds: Fail
• My own belief ≠
Sally’s belief
80

• 以下では、Implicit or spontaneous ToMにつ
いての論文を紹介します。
• 著作権の関係で、この資料では、各論文の書
誌情報だけをリストします。内容については、
各論文を参照して下さい。
81

Implicit or spontaneous ToM
• The Social Sense: Susceptibility to Others’ Beliefs in Human Infants and
Adults
– https://science.sciencemag.org/content/sci/330/6012/1830.full.pdf
– https://science.sciencemag.org/content/suppl/2010/12/20/330.6012.1830.DC
1
• Do 18-Month-Olds Really Attribute Mental States to Others?: A Critical
Test
– http://brainmind.umin.jp/PDF/wt12/Senju2011PsycholSci.pdf
– https://journals.sagepub.com/doi/suppl/10.1177/0956797611411584
• Brain activation for spontaneous and explicit false belief tasks overlaps:
new fMRI evidence on belief processing and violation of expectation
– https://academic.oup.com/scan/article/12/3/391/2593935
• Measuring spontaneous mentalizing with a ball detection task: putting
the attention-check hypothesis by Phillips and colleagues (2015) to the
test
– https://link.springer.com/article/10.1007/s00426-019-01181-7
82

Outline
• Top-down parser
– Integration
2. Theory of Mind
– Integration
83

Baker, Chris & Jara-Ettinger, Julian & Saxe,
Rebecca & B. Tenenbaum, Joshua. (2017).
Rational quantitative attribution of beliefs,
desires and percepts in human mentalizing.
Nature Human Behaviour. 1. 0064.
DOI: 10.1038/s41562-017-0064.
mentalize: To understand the behavior of others as a product of their mental state
84

Outline
• Top-down parser
– Integration
2. Theory of Mind
– Integration
87

Machine Theory of Mind
Neil C. Rabinowitz, Frank Perbet, H. Francis
Song, Chiyuan Zhang, S.M. Ali Eslami, Matthew
Botvinick
arXiv:1802.07740v2
88

character embedding mental state embedding
next-step action probabilities
probabilities of whether certain objects will be consumed
predicted
successor
representations
89

90
Inferring goal-directed behaviour

91
Acting based on false beliefs

Theory of Mind and Language Processing, Fast and Slow

Recomendados

Recomendados

Más contenido relacionado

Más de KIT Cognitive Interaction Design

Más de KIT Cognitive Interaction Design (20)

Último

Último (20)

Theory of Mind and Language Processing, Fast and Slow