東大大学院電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太

深層学習概論と理論解析の課題
2020年4月6日@東京大学
Preferred Networks
大野健太 (oono@preferred.jp)

大野健太
• 経歴
– 数学専攻（修士）→ 2012.4 PFI → 2014.10 PFN
– 2018年4月から東京大学博士課程（情報理工学研究科）
• 業務（PFN）
– バイオインフォ・ケモインフォ系プロジェクト
– Chainer開発
• 研究（博士課程）
– 深層学習モデルの理論研究
– ResNet型のCNNの統計的学習理論 [Oono and Suzuki, 19]
– グラフNNの表現能力 [Oono and Suzuki, 20]
2
Twitter: @delta2323_
HP: https://sites.google.com/view/kentaoono/

Vision
3
自分たちの手で革新的かつ本質的な技術を開発し、未知なる領域にチャレンジしていく。
私たちはソフトウェアとハードウェアを高度に融合し、自動車やロボットなどのデバイスをより賢
く進化させます。常に変化する環境や状況に柔軟に対処できる賢いデバイスができれば、物理世界
をリアルタイムにセンシングし、現実世界そのものが計算可能になります。
技術を使って、自分たちが見たことが無い、まだ知らない世界を知りたい。すでにわかっている領
域で勝負するのではなく、技術の力で想像を超えた世界に挑戦していきます。
現実世界を計算可能にする

Company information
設立 2014年3月26日
経営陣
代表取締役社長最高経営責任者西川徹
代表取締役副社長岡野原大輔
取締役最高技術責任者奥田遼介
所在地
本社
東京都千代田区大手町1-6-1大手町ビル
米国子会社
Preferred Networks America, Inc.
330 Primrose Rd., Suite 300, Burlingame, CA 94010
従業員数約300名（2020年3月現在）
Manufacturing Visual Inspection
Transportation Bio & Healthcare
Personal Robot Entertainment
4

Demo
https://www.youtube.com/watch?v=VGj3daiFNdM
● 物をつかむ/置く、動作計画を立てる、人の指示に対応するなど、ロボットが人間の生活空間で仕事
をするために必要な物体認識・ロボット制御・音声言語理解技術に最先端の深層学習を利用
● CEATEC JAPAN 2018にて、トヨタ自動車のHSRを使い「全自動お片付けロボットシステム」を展示
5

深層学習用に設計した
スーパーコンピューター
MN-2
GPU数換算で合計2,560基
合計約200PFLOPS
の計算資源を自社で保有
1 PETA FLOPSは
浮動小数点演算を
1秒間に1,000兆回
Infrastructure
6

Frameworks & Libraries
● オープンソースのハイパーパラメータ自動最適化
フレームワーク
● 探索空間を Python の制御構文で記述可能
● 最新の最適化アルゴリズムによる効率的な探索
● 複数計算機の並列実行をサポート
● 様々な可視化手段を提供
• オープンソースのGPU向け汎用配列計算ライブラリ
• NumPy互換のインターフェース
• cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE,
cuFFT, NCCLなどのライブラリをサポート
OPTUNA CuPy
7

● 締切：2020年4月24日（金）12:00（日本時間）
● 期間：2020年8月11日から9月18日まで（希望者は9月23日から30日も同条件で勤務可能）
● 1人に対し2人程度の社員がメンターとなり，議論・研究・開発を行う
● EngineeringインターンとResearchインターンの2種類
● 選考フロー，実施テーマ詳細，待遇・勤務条件などは弊社ウェブサイトを参照
Internship
8
https://preferred.jp/ja/news/internship2020/

機械学習概論
1. 機械学習概論
2. 深層学習概論
3. 深層学習の理論的課題

なぜ機械学習を使うのか？
例：花の写真から花の種類を判定する
• 人間がルールを列挙するのは実質的に不可能
– 何を基準に分類する？
– 判断に必要な情報をどうやって抽出する？
• 機械学習のアプローチ：正解情報（花の種類）が付与された画像を機械に大量に与える
– 人間は機械にどの部分を注目すべきかを明示的には与えない
– それぞれの花に特徴的な傾向・パターンを機械自身に発見させる
10
Flickerより商用利用可能な写真を検索して使用 Illust:いらすとや
https://www.irasutoya.com

教師あり学習のレシピ
• 入力︓訓練データ
𝒟𝒟 = ( 𝒙𝒙1, 𝑦𝑦1 , 𝒙𝒙2, 𝑦𝑦2 , … , 𝒙𝒙𝑁𝑁, 𝑦𝑦𝑁𝑁 )
𝒙𝒙𝑖𝑖 ∈ 𝒳𝒳 = ℝ𝐷𝐷, 𝑦𝑦𝑖𝑖 ∈ 𝒴𝒴 = �
1, 2, 3, … , 𝐾𝐾 , （𝐾𝐾値分類問題の場合）
ℝ , （回帰問題の場合）
• 出力︓予測モデル
未知のデータ x から y を予測するアルゴリズム
• 設計者の用意するもの
仮説集合 Θ，損失関数 l(x, y; θ)
11
今日は頻度論的な立場での教師あり学習に
ついて説明します

仮説集合 Θ：学習モデルが選べるモデルの「選択肢」
12
サクラ
バラ？
𝜽𝜽𝟐𝟐 𝜽𝜽𝟑𝟑 𝜽𝜽𝟒𝟒 𝜽𝜽𝟓𝟓
𝜽𝜽𝟏𝟏
注：典型的には仮説集合は有限個ではなく（非加算）無限個ある．
例えば𝑁𝑁個のパラメータを持つ学習モデルならばΘ = 𝑹𝑹𝑁𝑁 など．
例：Θ = {𝜽𝜽𝟏𝟏, 𝜽𝜽𝟐𝟐, 𝜽𝜽𝟑𝟑, 𝜽𝜽𝟒𝟒, 𝜽𝜽𝟓𝟓}

損失関数 𝑙𝑙(𝒙𝒙, 𝑦𝑦; 𝜽𝜽)
損失関数の選択は設計者次第．通常 𝒙𝒙 に対する予測結果が望ましい結果 𝑦𝑦 に近いほど
𝐿𝐿 (𝒙𝒙, 𝑦𝑦, 𝜽𝜽) が小さくなるように定義する（例︓𝑙𝑙(𝑥𝑥, 𝑦𝑦; 𝜽𝜽) = (𝑓𝑓 (𝑥𝑥; 𝜽𝜽) – 𝑦𝑦)2
）
13
予測モデルの悪さを訓練誤差（or 経験誤差）で定量化
̂ℒ 𝜽𝜽 =
1
𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
𝑙𝑙(𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖; 𝜽𝜽)
予測モデルの“悪さ”を測る関数

訓練誤差を最小化する解でパラメータを決定する
14
𝜽𝜽∗
∈ argmin𝜽𝜽
̂ℒ(𝜽𝜽) θ︓パラメータ
̂ℒ︓損失関数
argminθ f (θ)：f (θ)を最小にするθ
注：何をもって「最適な」パラメータとするかは問題次第であり，訓練誤差最小化解はパラメータの決定
する方法の一つを与えている．様々なパラメータの決定方法（推定方法）とその最適性は統計学などで研
究されている．

確率的勾配降下法（Stochastic Gradient Descent, SGD)
15
For i = 1, …, T:
get 𝐵𝐵 samples (𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖) randomly
θ ← θ – η �
𝑖𝑖=1
𝐵𝐵
∇θ l(𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖; θ)
Output θ
深層学習の場合
• l はユーザーが設計したニューラルネットワーク
• θ はニューラルネットが持っているパラメータ
バニラなSGDの他に様々な勾配ベースの最適化手法が提案されている
（SGD, Momentum, NAG, AdaGrad, AdaDelta, RMSProp etc….）
𝒘𝒘𝒕𝒕 = 𝜇𝜇𝒘𝒘𝑡𝑡−1 + 𝜂𝜂 ∇𝜽𝜽 𝑙𝑙(𝒙𝒙, 𝑦𝑦; 𝜽𝜽)
𝜽𝜽 ← 𝜽𝜽 – 𝒘𝒘𝑡𝑡
e.g., momentum
パラメータ最適化の一手法
η (>0): 学習率

デモ：パーセプトロン
16

汎化性能と過学習
17
機械学習を適用したいタスクでは、手元にあるデー
タだけではなく、未知のデータに対する予測能力
（汎化能力, generalization）を持つようにしたい．
強力すぎる予測器を使うと，真のデータ分布だけで
はなく、データとは無関係なノイズもモデリングし
てしまい、汎化能力が得られない = 過学習
（overfitting）してしまう．
「劣微分を用いた最適化手法について(3)」より引用
https://tech.preferred.jp/ja/blog/subgradient-optimization-3/

訓練誤差と推定誤差
データセット 𝒟𝒟 は 𝒳𝒳 × 𝒴𝒴 上の未知の分布 𝒫𝒫 から独立同分布（independently identically
distributed; i.i.d.）にサンプリングされていると仮定する
𝒟𝒟 = 𝒙𝒙1, 𝑦𝑦1 , 𝒙𝒙2, 𝑦𝑦2 , … , 𝒙𝒙𝑁𝑁, 𝑦𝑦𝑁𝑁 , 𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖 ~ 𝒫𝒫 i.i.d.
18
̂ℒ(𝜽𝜽) =
1
𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
𝑙𝑙(𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖; 𝜽𝜽) ℒ(𝜽𝜽) = 𝔼𝔼(𝒙𝒙,𝑦𝑦)~𝒫𝒫[𝑙𝑙(𝒙𝒙𝑖𝑖, 𝑦𝑦𝑖𝑖; 𝜽𝜽)]
訓練誤差推定誤差
過学習している = ℒ 𝜽𝜽 − ̂ℒ(𝜽𝜽)（汎化誤差ギャップ）が大きい

トレードオフ
19
問題点：推定誤差 ℒ 𝜽𝜽 は直接は計算できない
ℒ(𝜽𝜽)
[Belkin et al., 19]から引用
̂ℒ(𝜽𝜽)

ホールドアウト
20
全データ
セット
訓練
データセット
テスト
データセット
機械学習
アルゴリズム
訓練済
学習モデル
テストデータ
で精度を確認
モデルが過学習していないかを確認する方法
推定誤差ℒ(𝜽𝜽)をテストデータセットのサンプルに対する損失平均で推定する
ランダムに分割

まとめ
• 機械学習は、ルールでは記述できないようなデータの傾向・法則を捉えることで、機械に対して高度
の判断を行わせる技術です
• 頻度論でのアプローチでの教師あり分類問題は、仮説集合と損失関数から訓練誤差を定義し、それを
最適化すると定式化されます
• モデルが学習データにフィットしすぎると過学習を起こし，汎化性能を得られなくなります．適切に
仮説集合を定め，過学習をしていないかをホールドアウト・交差検定などでチェックする必要があり
ます．
21

深層学習概論

深層学習（ディープラーニング）
機械学習の一種．予測器（これまでの記号では 𝑙𝑙 に対応）として，入力データ
に対し単純かつ微分可能な変換を何回も適用するモデルを利用する
23
x1
x2
x3
x4
y

多層パーセプトロン（Multi Layer Perceptron; MLP)
下層の入力 x を重み付きで足しあわせ、活性化関数 𝜎𝜎 をかけて出力
24
0.3
-0.2
0.4
重み
結合ユニット
𝑦𝑦(𝑥𝑥) = 𝜎𝜎(𝑊𝑊𝐿𝐿・・・𝜎𝜎(𝑊𝑊1 𝑥𝑥 + 𝑏𝑏1)・・・ + 𝑏𝑏𝐿𝐿)
𝑥𝑥 ∈ ℝ𝐷𝐷0, 𝑊𝑊𝑙𝑙 ∈ ℝ𝐷𝐷𝑙𝑙×𝐷𝐷𝑙𝑙−1, 𝑏𝑏𝑙𝑙 ∈ ℝ𝐷𝐷𝑙𝑙
最適化のために必要な勾配は「誤差逆伝播」で効率的に計算できる
𝜎𝜎の例
ReLU: 𝜎𝜎 𝑥𝑥 = max 0, 𝑥𝑥
Sigmoid: 𝜎𝜎 𝑥𝑥 = (1 + exp(−𝑥𝑥))−1
𝑥𝑥
𝜎𝜎(𝑥𝑥)
yx1
x2
x3
x4

誤差逆伝播（Backpropagation）
25
活性の伝播方向
エラーの伝播方向
x f y
x f y
→ 勾配は活性と逆向きに伝播する
Forward Propagation
損失を𝐿𝐿とすると、連鎖律より
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕
=
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕
𝑦𝑦 = 𝑓𝑓(𝑥𝑥; 𝜃𝜃)

誤差逆伝播
26
https://tutorials.chainer.org/ja/13_Basics_of_Neural_Networks.html

畳み込みニューラルネットワーク
Convolutional Neural Network (CNN)
• データに内在する構造（近いピクセル間の相関関
係など）を活用することが意図されたNN
• 畳み込み層とプーリング層からなる
• 画像認識・音声認識・生物科学（配列解析）など
に広く利用されている
• 様々な亜種：AlexNet, VGG, Inception, GoogleNet,
ResNet, SENet etc.
27
0.28
0.26
0.16
0.12
0.07
0.036 0.03 0.02
0
0.05
0.1
0.15
0.2
0.25
0.3
2010 2011 2012 2013 2014 2015 2016 2017
ILSVRC Object Classification Tio5-Error
Top5 Error
http://image-net.org/challenges/talks_2017/ILSVRC2017_overview.pdf
より作成
[LeCun et al. 98]より引用
2012年以降は深層学習モデルが優勝

フィルターによる特徴抽出
CNNはフィルターの重みをデータから学習している
28 Source: PEXELS (creative commons CC0)

Transformer
29 [Vaswani et al., 17]より引用
• 主に系列データ（自然言語など）のモデリングに利用される深層学習モデル
• 大規模データによる事前学習＋タスクごとのファインチューニングが自然言語
処理の最近のトレンド
• 学習済の言語表現モデル︓BERT [Devlin et al., 18]
• 派生（ALBERT [Lan et al., 20]，Clinical BERT [Alsentzer et al., 19] など）
https://gluebenchmark.com/leaderboard/ より作成
Rank Team Model Score
1 PING-AN Omni-Sinitic ALBERT + DAAF + NAS 90.6
2 ERNIE Team - Baidu ERNIE 90.4
3 Alibaba DAMO NLP StructBERT 90.3
4 T5 Team - Google T5 90.3
5 Microsoft D365 AI & MSR AI & GATECH MT-DNN-SMART 89.9
6 ELECTRA Team ELECTRA-Large + Standard Tricks 89.4
7 Huawei Noah's Ark Lab NEZHA-Large 88.7
8 Microsoft D365 AI & UMD FreeLB-RoBERTa (ensemble) 88.4
9 Junjie Yang HIRE-RoBERTa 88.3
10 Facebook AI RoBERTa 88.1

グラフニューラルネットワーク
• グラフ予測
– 化学：化合物の特性推定 [Gilmer et al., 17; Wu et al., 18; Yang et al., 19]
– CV：状況理解によるQuestion Answering [Xiong19VQA]
• ノード / エッジ予測
– 関係データ：論文引用関係からのジャンル予測 [Kipf and Welling, 17; Shchur et al., 18]
– 化学：蛋白・薬剤の反応リンク予測 [Zitnik et al., 18]
– CV：ポイントクラウドのセグメンテーション [Wang et al., 19]
• グラフ生成
– 化学：target optimization, decoration [Jin et al., 18, 19; Liu et al., 18]
– CV：シーングラフ生成 [Xu et al., 17 ; Yang et al., 18; Qi et al., 19]
30
左から
[De Cao, et al., 18]
[Zitnik et al., 18]
[Qi et al., 19]
[Jin et al., 18]
からの引用

深層学習の理論的課題

深層学習モデルの理論的興味
表現能力 (Expressive power)
• 深層学習モデルはどんな関数を（効率的に）近似できるか [Cybenko, 1989; Telgarsky, 2016; Eldan
and Shamir, 2016; Sonoda and Murata, 2017]
最適化 (Optimization)
• 深層学習モデルの訓練は非凸最適化問題にもかかわらず，なぜ（確率的）勾配法が「良い」解を見つ
けることができるのか [Li et al., 2018; Jacot et al, 2018; Du et al., 2018; Allen-Zhu et al., 2018]
汎化能力 (Generalization)
• 深層学習モデルは多数のパラメータを持つにも関わらず，なぜ過学習せずに汎化するのか [Hardt et al.,
2016; Belkin et al., 2018; Arora et al., 2018]
32

汎化性能の評価
多くの問題設定で，訓練データのサ
ンプリングに関して高確率で次が成
立する（統計的学習理論の一般論）
33
ℒ 𝜽𝜽 < ̂ℒ 𝜽𝜽 + 𝐶𝐶
𝑀𝑀
𝑁𝑁
1/2
̂ℒ(𝜽𝜽)︓訓練誤差
ℒ(𝜽𝜽)︓推定誤差
𝑁𝑁︓サンプル数
𝑀𝑀︓モデルの ”パラメータ数”（e.g., 𝜽𝜽 の次元）
𝐶𝐶︓訓練データには寄らない定数
[Suzuki et al., 20]より引用

深層モデルの汎化性能評価
34
訓練データ数が増えるにつれて，実際のテストデータで
の誤差は小さくなっているにも関わらず，汎化誤差バウ
ンドは大きくなっている [Nagarajan and Kotler, 19]
実際のデータセットで汎化誤差バウンドを
計算すると自明な評価式しか出ていない
[Arora et al., 18]
実験データで意味のある汎化誤差バウンドを導出する試み [Banerjee et al., 20]

Double Descent
35
特定の条件下で，訓練誤差を0になってから
（interpolation）さらにモデルの複雑度を上げた時に，
テスト誤差が下がる現象が実験的に報告される
上︓[Belkin et al., 19]，下︓[Nakkiran et al., 20]より引用
Double descent現象をEffective dimensionalityを用いて理
論的に説明する試み [Maddox et al., 20]

グラフNNの理論解析
36
• 表現能力：どちらかというと否定的な結果が多い
– MPNNタイプのグラフNNがグラフの同型判定を解ける能力はWL algorithm以下 [Xu et al. 2018]
– 層を多数積んだグラフNNはOver-smoothingを起こし，表現能力は限定的
• 線形GNN: [Li et al., 18b; Zhang and Meng, 19; Zhao and Akoglu, 19]
• ReLU（非線形）GNN: [Oono and Suzuki, 20]
• 汎化能力に関しても研究がされ始めている
[Oono and Suzuki, 20]より引用
[Li et al.,18]より引用

ここまでのまとめ
• 深層学習は機械学習の微分可能な単純な変換を繰り返してできるモデルです．誤差逆伝播で効率って
気に勾配を計算することができ，確率的勾配法で訓練できます．
• 扱うデータのドメイン・形式に応じて，CNN・Transformer・グラフNNなど様々なタイプの深層学習
モデルが利用されています
• 深層学習モデルの理論解析は，表現能力・最適化・汎化性能に関して未解決の問題を多く残っていま
す．特に汎化性能に関しては，より現実的なモデルの汎化誤差ギャップバウンドが研究されています
37

参考文献
• [Allen-Zhu et al., 19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via
over-parameterization. Proceedings of the 36th International Conference on Machine Learning, volume 97 of
Proceedings of Machine Learning Research, pages 242–252. PMLR, 2019
• [Alsentzer et al., 19] Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., and McDermott, M.
(2019). Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language
Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
• [Arora et al., 18] Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep
nets via a compression approach. Proceedings of the 35th International Conference on Machine Learning, volume
80 of Proceedings of Machine Learning Research, pages 254–263, PMLR.
• [Banerjee et al., 20] Banerjee, A., Chen, T., and Zhou, Y. (2020). De- randomized pac-bayes margin bounds:
Applications to non-convex and non- smooth predictors. arXiv preprint arXiv:2002.09956.
• [Belkin et al., 18] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning and the
bias-variance trade-off. arXiv preprint arXiv:1812.11118.
38

参考文献
• [Cybenko, 89] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
• [Devlin et al., 18] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
• [Du et al., 19] Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes
over-parameterized neural networks. In International Conference on Learning Representations, 2019.
• [Eldan and Shamir, 16] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In 29th
Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 907–940.
PMLR, 2016.
• [Gilmer et al., 17] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural
message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning,
volume 70, pages 1263–1272. PMLR, 2017.
39

参考文献
• [Hardt et al., 16] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic
gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 1225– 1234. PMLR, 2016.
• [Jacot et al., 18] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. Advances in Neural Information Process- ing Systems 31, pages 8571–8580.
Curran Associates, Inc., 2018.
• [Jin et al., 18] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for
molecular graph generation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International
Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2323–2332.
PMLR, 2018.
• [Jin et al., 19] Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola. Learning multi- modal graph-to-
graph translation for molecule optimization. In International Conference on Learning Representations, 2019.
• [Kipf and Welling, 17] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In International Conference on Learning Representations, 2017.
40

参考文献
• [Lan et al., 20] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2020). Albert: A lite bert for
self-supervised learning of language representations. In International Conference on Learning Representations.
• [LeCun et al., 98] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324.
• [Li et al., 18] Li, Q., Han, Z., and Wu, X.-M. (2018). Deeper insights into graph convolutional networks for semi-
supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
• [Li et al., 18] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape
of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems 31, pages 6389–6399. Curran Associates, Inc., 2018
• [Liu et al., 18] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph variational
autoencoders for molecule design. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R.
Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7795–7804. Curran Associates,
Inc., 2018.
41

参考文献
• [Qi et al., 19] Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. Attentive rela- tional networks
for mapping images to scene graphs. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern
Recognition, pages 3957–3966, 2019.
• [Nagarajan and Kolter, 19] Nagarajan, V. and Kolter, J. Z. (2019). Uniform convergence may be unable to explain
generalization in deep learning. In Advances in Neural Information Processing Systems 32, pages 11615–11626.
Curran Associates, Inc.
• [Nakkiran et al., 20] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2020). Deep double
descent: Where bigger models and more data hurt. In International Conference on Learning Representations.
• [Oono and Suzuki, 20] Oono, K. and Suzuki, T. (2020). Graph neural networks exponentially lose expressive power
for node classification. In Interna- tional Conference on Learning Representations.
• [Shchur et al., 18] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Gu ̈nnemann. Pitfalls
of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
42

参考文献
• [Sonoda and Murata, 17] Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is
universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
• [Suzuki et al., 2020] Suzuki, T., Abe, H., and Nishimura, T. (2020). Compres- sion based bound for non-compressed
network: unified generalization error analysis of large compressible deep neural network. In International
Conference on Learning Representations.
• [Telgarsky, 16] Matus Telgarsky. benefits of depth in neural networks. In 29th Annual Conference on Learning Theory,
volume 49, pages 1517–1539. PMLR, 2016.
• [Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and
Polosukhin, I. (2017). Atten- tion is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,
Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 5998–
6008. Curran Associates, Inc.
• [Wang et al., 19] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and Jie Shan. Graph attention convolution
for point cloud semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2019.
43

参考文献
• [Wesley et al., 2020] Wesley, J. M., Gregory, B., and Andrew, G. W. (2020). Rethinking parameter counting in deep
models: Effective dimensionality re- visited. arXiv preprint arXiv:2003.02139.
• [Wu et al., 18] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S
Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical
science, 9(2):513–530, 2018.
• [Xu et al., 17] Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative
message passing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
• [Yang et al., 18] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph R-CNN for scene graph
generation. In The European Conference on Computer Vision (ECCV), September 2018.
• [Yang et al., 19] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, An- gel Guzman-
Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Are learned molecular representations ready for prime
time? arXiv preprint arXiv:1904.01561, 2019.
44

参考文献
• [Zhang and Meng, 2019] Zhang, J. and Meng, L. (2019). Gresnet: Graph resid- uals for reviving deep graph neural
nets from suspended animation. arXiv preprint arXiv:1909.05729.
• [Zhao and Akoglu, 2020] Zhao, L. and Akoglu, L. (2020). Pairnorm: Tackling oversmoothing in {gnn}s. In
International Conference on Learning Represen- tations.
• [Zitnik et al., 18] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with
graph convolutional networks. Bioinformatics, 34(13):i457–i466, 2018.
45

東大大学院電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a 東大大学院電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太

Similar a 東大大学院電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太 (20)

Más de Preferred Networks

Más de Preferred Networks (20)

Último

Último (10)