Kaggle参加報告: Quora Insincere Questions Classification

Kaggle参加報告:
Quora Insincere Questions Classification
（4th place solution）
藤川和樹
AIシステム部 AI研究開発第三グループ
株式会社ディー・エヌ・エー

Agenda
• コンペティション概要
• 主要Kernel解説 (Pytorch starter by Heng Zheng)
• My solution
• Top 3 solutions

コンペティション概要（問題設定）
• Quoraに投稿された質問文で、不適切な質問かどうかを識別するタスク（2値分類）
• 猥褻的な表現を含む質問（Bag-of-wordsでもある程度識別できそう）
• Which races have the smallest penis?
• Would you date a girl who had sex with a horse or a bull?
• 差別的、攻撃的な質問（単語レベルでは判断が難しそう）
• Do all women tell lies?
• Why do Indians not gives a shit on Pakistan?
• 事実に基づかない質問（単語レベルでは判断が難しそう）
• Why did an 0bama fan kill ovr 50 people in Las Vegas?
• Why is the US so good at brainwashing?
• 正例が少ない不均衡なクラス分類問題
• 負例 : 正例 ≒ 15 : 1
• 評価指標 : F1 score

コンペティション概要（制約条件）
• Kernel only competition
• データロード、学習、予測といった処理全てを2hで実行可能な1枚のKernelを提出
• 2nd stage制で、締切後に差し替えられたtest setで再計算してリスコアリングされる
• インターネットアクセス禁止
• 外部のデータソースをDLして事前学習する、学習済みモデルを利用する等はNG
• ホワイトリストとして、4種の学習済みword embeddingのみ利用可能

コンペティションでのポイント
• 精度と計算コストのバランスが取れたモデルの選択
• 少数の大きなモデルで単体の性能勝負 vs 多数の小さなモデルでアンサンブル勝負
• 計算コスト削減方法
• 前処理・訓練時に行う処理をいかに効率化できるか
• 提供された学習済みword embeddingの活用法、含まれない単語の処理方法
• 表記揺れレベルの差異、ドメインの差分をどう埋めるか
• Public LBのスコアだけに頼らないsubmission戦略
• public test set は private test set と比較してデータが少なく、スコアが信用できない
（public: 5.6万件 vs private: 37.6万件）

主要カーネル解説 (Pytorch starter by Heng Zheng)
Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
Classification model
average
Word features
(300d)
BiLSTM → BiGRU
(2 * 60d)
1 Layer MLP
(480d)
concat
Raw texts
以下をconcat (4 * 2 * 60d)
• 1st layer weighted sum
with attention
• 2nd layer weighted sum
with attention
• 2nd layer avg pooling
• 2nd layer max pooling
Prediction
(1d)
BCE Loss
concat

Normalize
&
Tokenize
Raw texts
① テキストのNormalize, Tokenize
• Normalize
• 小文字化（ex. Trump → trump）
• 句読点前後にスペース追加（ex. pen. → pen .）
• 数字を特殊文字に変換（ex. 2019 → ####）
• ルールベースによる表記揺れ解消（ex. isn’t → is not）
• Keras Tokenizerによる特殊文字除去（ex. ####th → th）
• Tokenize
• spaceでsplit

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts
② 下記方針によるWord embeddingの構築
高頻度低頻度
有
Glove, Paragramの
平均ベクトル
単語自体を消去
無
単語毎に作成した
乱数ベクトル
Quoraデータセット中の単語の出現頻度
Glove,Paragram
(Pretrainedvectors)
での出現有無

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
average
Word features
(300d)
BiLSTM → BiGRU
(2 * 60d)
1 Layer MLP
(480d)
concat
Raw texts
with attention
with attention
Prediction
(1d)
concat
③ 構築されたWord featuresを入力に、Classification modelを構築

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
average
Word features
(300d)
BiLSTM → BiGRU
(2 * 60d)
1 Layer MLP
(480d)
concat
Raw texts
with attention
with attention
Prediction
(1d)
BCE Loss
concat
④ Binary Cross Entropy Loss を最小化するように
ネットワーク全体を学習
（ただし、Word featuresは固定）

⑤ 5-Fold CVで行って5つのモデルを構築し、
最終予測値は5つのモデルの予測値の平均を使う
dataset fold: 1
dataset fold: 2
dataset fold: 3
dataset fold: 4
dataset fold: 5
dataset
model: 1 prediction: 1
prediction ∈ ℝ

dataset fold: 1
dataset fold: 2
dataset fold: 3
dataset fold: 4
dataset fold: 5
dataset
prediction ∈ ℝ
prediction ∈ {0, 1}
閾値による
binning
⑥ 訓練データで最適な閾値探索を行い、binningする

My Solution 概要
ソースも公開しているので興味があればご覧下さい！
https://github.com/k-fujikawa/Kaggle-Quora-
Insincere-Questions-Classification

Agenda
• コンペティション概要
• 主要Kernel解説 (Pytorch starter by Heng Zheng)
• My solution
• Pytorch starter 微修正（LB: 0.691 → 0.705）
• Word embedding fine-tuning（LB: 0.705 → 0.708）
• 単語・文単位の統計的特徴量（LB: 0.708 → 0.708）
• Word embedding sampling（LB: 0.708 → 0.710）
• モデルの評価
• Top 3 solutions

Pytorch starter の微修正（LB: 0.691 → 0.705）
Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
average
Word features
(300d)
BiLSTM → BiGRU
(2 * 60d)
1 Layer MLP
(480d)
concat
Raw texts
with attention
with attention
Prediction
(1d)
BCE Loss
concat

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
average
Word features
(300d)
BiLSTM → BiGRU
(2 * 60d)
1 Layer MLP
(480d)
concat
Raw texts
with attention
with attention
Prediction
(1d)
BCE Loss
concat
① 前処理の小さな修正
• ルールベースの微修正
• before: I‘d’ve → I would’ve
• after: I‘d’ve → I would have
• 句読点スペースとルールベースの順序
を反転
• before: isn’t → isn ‘ t
• after: isn’t → is not
• 数字の特殊文字の前後にスペース追加
• before: 1990th → ####th
• after: 1990th → #### th
• Keras Tokenizerのfilters処理を除外
• before: #### th → th
• after: #### th → #### th

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts
Extra Features
(6d)
2 Layer MLP with BN
(480d)
Prediction
(1d)
BCE Loss
2nd layer max pooling
(2 * 128d)
Word features
(300d)
2 Layer BiLSTM
(2 * 128d)
② モデルの入れ替え
BiLSTM + BiGRU (2 * 60d) → 2 Layer BiLSTM (2 * 128d)
4種のPooling → Max pooling → 2 Layer MLP with BN

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts
③ Word embeddingの構築方法を変更
高頻度低頻度
有
Glove, Paragramの
平均ベクトル
無
乱数ベクトル
Glove,Paragram
(Pretrainedvectors)
での出現有無
高頻度低頻度
有
Glove, Paragramの
平均ベクトル
Glove, Paragramの
平均ベクトル
無
乱数ベクトル
（CV毎に生成）
0ベクトル

Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts
2 Layer MLP w/ BN
(480d)
Prediction
(1d)
BCE Loss
2nd layer max pooling
(2 * 128d)
Word features
(300d)
2 Layer BiLSTM
(2 * 128d)

データセット中の単語例
高頻度低頻度
含まれる
• the, what
(一般語)
• obama, trump
(Wikipediaに出現する単語)
• compresses
(一般語の変形)
• 0bama, germnay
(typo)
• 2sinxcosx, cython
(一部の専門用語)
含まれない
• coinbase, gdpr, brexit
(新語)
• tensorflow, kubernetes
(専門用語)
• germeny, fathar, f*ck
(typo)
• 5gfwdhf4rz
(ハッシュ値)
• ॡ
(外国語)
Glove,Paragram
(Pretrainedvectors)

高頻度低頻度
含まれる
• the, what
(一般語)
• obama, trump
• compresses
• 0bama, germnay
(typo)
含まれない
(新語)
(専門用語)
(typo)
• 5gfwdhf4rz
(ハッシュ値)
• ॡ
(外国語)
Glove,Paragram
(Pretrainedvectors)
Pretrained Vectorが持つ情報
を活用したい

高頻度低頻度
含まれる
• the, what
(一般語)
• obama, trump
• compresses
• 0bama, germnay
(typo)
含まれない
(新語)
(専門用語)
(typo)
• 5gfwdhf4rz
(ハッシュ値)
• ॡ
(外国語)
Glove,Paragram
(Pretrainedvectors)
Quoraデータセットだけでも
ある程度学習できるはず

Word embeddingの更新戦略例
計算コスト単語表現学習
（Train）
単語表現学習
（Test）
モデル例
Embeddingのみ
教師なし学習
◎ ○ ○ Word2Vec,
FastText
(unsupervised)
Embeddingのみ
半教師あり学習
○ ◎ ○ Doc2Vec,
FastText
Embeddingのみ
教師あり学習
◎ ○? ☓ Doc2Vec,
FastText
(supervised)
NN全体を
☓ ◎ ○ Masked LM +
教師あり学習
NN全体を
教師あり学習
☓（all epochs）
○（last epoch）
○? ☓ Embedding
unfreeze

（Train）
単語表現学習
（Test）
モデル例
Embeddingのみ
教師なし学習
FastText
(unsupervised)
Embeddingのみ
FastText
Embeddingのみ
教師あり学習
FastText
(supervised)
NN全体を
教師あり学習
NN全体を
教師あり学習
☓（all epochs）
○（last epoch）
○? ☓ Embedding
unfreeze
計算コスト重視でEmbeddingのみ
教師なし学習する戦略を選択

（Train）
単語表現学習
（Test）
モデル例
Embeddingのみ
教師なし学習
FastText
(unsupervised)
Embeddingのみ
FastText
Embeddingのみ
教師あり学習
FastText
(supervised)
NN全体を
教師あり学習
NN全体を
教師あり学習
☓（all epochs）
○（last epoch）
○? ☓ Embedding
unfreeze
FastText supervisedのみ、というのも
試したが簡単には上手くいかなかった

（Train）
単語表現学習
（Test）
モデル例
Embeddingのみ
教師なし学習
FastText
(unsupervised)
Embeddingのみ
FastText
Embeddingのみ
教師あり学習
FastText
(supervised)
NN全体を
教師あり学習
NN全体を
教師あり学習
☓（all epochs）
○（last epoch）
○? ☓ Embedding
unfreeze
モデルも計算コスト重視でWord2Vec
（CBOW）を選択
（Char N-gramは利用しない）

予備実験: Word embedding fine-tuning
• 以下3モデルを構築し、各モデルでベンチマーク単語と類似しているとされる
Top10件の単語を確認する
• Glove only
• 提供された事前学習モデル（Glove）をそのまま利用
• Word2Vec Scratch
• 事前学習モデルを使わずスクラッチでCBOWを 5 epoch 学習する
• Word2Vec Fine-tuning
• 事前学習モデル（Glove）を初期値にCBOWを 5 epoch 学習する

予備実験結果: Word embedding fine-tuning
• Quora 高頻度語 & Glove 既知語
• Glove: ○、Scratch: ○、Finetune: ○
obama（DF: 2838） cosx（DF: 98）
Glove, Scratch, Finetune どれでも良さそう

• Quora 高頻度語 & Glove 未知語
• Glove: ☓、Scratch: ○、Finetune: ○
coinbase（DF: 169） kubernetes（DF: 30）
Gloveではヒットしない単語ベクトルが
Scratch, Finetune で上手く学習できてそう

• Quora 低頻度語 & Glove 既知語
• Glove: ○、Scratch: ☓、Finetune: ○
0bama（DF: 5） compresses（DF: 2）
Scratchでは学習が困難だが、
Glove では上手く表現できており、
Finetuneしても悪影響は無さそう

• Quora 低頻度語 & Glove 未知語
• Glove: ☓、Scratch: △~☓、Finetune: △~☓
xgboost（DF: 8） germeny（DF: 2）
Glove（0ベクトル）よりはマシな場合も
あるが、悪影響が無視できなさそう

Word embedding fine-tuning （LB: 0.705 → 0.708）
Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts Glove & Paragram
(300d)
Finetuned
(300d)
Word2Vec
fine-tuning
average
Glove & Paragram
with Finetuned
(300d)
Word embeddingの構築方法を変更
高頻度低頻度
有
Glove, Paragramの
平均ベクトル
Glove, Paragramの
平均ベクトル
無
乱数ベクトル
0ベクトル
Glove,Paragram
(PretrainedVectors)
での出現有無
高頻度低頻度
有
下記ベクトルを平均
• Glove, Paragramの
平均ベクトル
• Word2Vec Finetune
• Glove, Paragramの
平均ベクトル
無
• 単語毎に作成した
乱数ベクトル
0ベクトル

単語・文単位の統計的特徴量（LB: 0.708 → 0.708）
Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts
Extra Features
(6d)
2 Layer MLP w/ BN
(480d)
Prediction
(1d)
BCE Loss
concat2nd layer max pooling
(2 * 128d)
Word features
(302d)
2 Layer BiLSTM
(2 * 128d)
Glove & Paragram
(300d)
Finetuned
(300d)
Word2Vec
fine-tuning
average
Glove & Paragram
with Finetuned
(300d)
Extra Features
(6d)
Extra Features
[is_unk, IDF] (2d)
単語・文単位の統計的特徴量を追加
• 以下の特徴量を算出し、単語・文の特徴ベクトルに連結した
• 単語特徴量: Glove未知語か否か、IDF
• 文特徴量: 文字数、単語数等のKernelで議論されていた特徴量
• Localでは軽微な改善があったものの、Public LBでは変化がなかった

Word Embedding sampling（LB: 0.708 → 0.710）
Glove
(300d)
Paragram
(300d)
Words
Normalize
&
Tokenize
Build word features
Glove & Paragram
(300d)
averageRaw texts
Extra Features
(6d)
1 Layer MLP
(480d)
Prediction
(1d)
BCE Loss
concat2nd layer max pooling
(2 * 128d)
Word features
(302d)
2 Layer BiLSTM
(2 * 128d)
Glove & Paragram
(300d)
Finetuned
(300d)
Word2Vec
fine-tuning
concat
Glove & Paragram
with Finetuned
(600d)
Extra Features
(6d)
Extra Features
[is_unk, IDF] (2d)
Glove & Paragram
with Finetuned
(400d)
sampling
per CV
Word embeddingの構築方法を変更
• Fine-tuning前後のベクトル平均は若干怪しさが残っていた
• 特にQuora高頻度 & Pretrained Vector非搭載の単語
• 一方Fine-tuning後のベクトルのみ利用するよりも、平均したベクトルを
利用した方が実験的に良いスコアを示していた
• Fine-tuning前後のWord Embeddingを連結した600次元のベクトルから
400次元をCV毎にサンプリングして扱うように変更
• 600次元を扱うのは計算時間的に厳しかった
• モデル間の多用性を上げてアンサンブル後の性能を改善することが狙い

モデルの評価方法（背景）
• アンサンブル後のスコアはLocal CVでは測れないため、CV外のデータで評価する必要があった
• Embedding sampling は Local CV は低くてもアンサンブル後にスコアが高くなる
可能性があるのではないかと考えていた
• Public LBのスコアは件数が少なく、乱数シードを変えただけで大きく変動してしまうため、
別の指標が必要だった

モデルの評価方法
• Holdout & CV を乱数シードを変えて複数回実施し、平均と標準偏差で評価
• 上記のスコア平均が高かった実験設定を選択し、同じ実験設定でCVのみ行って提出
Train 0 (CVに利用) Test 0
Seed: 0 でシャッフル
Test 0 ~ 4 の平均・標準偏差
Train set（ラベル付き）

1st Place Solution (The Zoo)
• Model
• Embedding (300) → BiLSTM (128) → 1DCNN (64) → MLP (128)
• Embeddings
• 0.7 * Glove + 0.3 * Paragram（重みをCVで選択）
• アンサンブル・閾値選択
• 10-fold CV, Rank Average
• Fold毎に閾値と対応するF1を計測し、
最小値が最大になる値を閾値として選択

2nd Place Solution (takapt)
• Model
• Embedding (668) → BiGRU (128) → MLP (64)
• Embeddings
• Glove (300), WikiNews (300), FastText scratch (64) 単語Feature (4) のconcat (668)
• 単語Feature: [all upper chars?, first char upper?, only first char upper?, OOV?]
• Statical Features
• 単語数、ユニーク単語数、文字数、大文字数、Bag of characters
• 6-fold CV, Local CVで閾値をFix（0.36）

3rd Place Solution (Guanshuo Xu)
• Model
• Embedding (600) → BiGRU (256) → BiGRU (128) → Linear (64)
• Embeddings
• Model1: Glove (300), FastText (300) のconcat
• Model2: Glove (300), Paragram (300) のconcat
• WordVectorの対応付けに、stemming, lemmatizing, spell correction等を適用
• 2モデルのsnapshot ensemble: 2 * (0.15 * 3rd epoch + 0.35 * 4th epoch)
• Validation setは取らない
• 閾値はLocal CVで決定（0.35）

Kaggle参加報告: Quora Insincere Questions Classification

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Kaggle参加報告: Quora Insincere Questions Classification

Similar a Kaggle参加報告: Quora Insincere Questions Classification (20)

Más de Kazuki Fujikawa

Más de Kazuki Fujikawa (15)

Kaggle参加報告: Quora Insincere Questions Classification