SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
rank phylum # of entries
1 Ascomycota 4629 (79%)
2 Basidiomycota 1116 (19%)
3 Mucoromycotina 38 (1%)
ディープラーニングを用いたDNA配列からの微生物生態属性値の予測
Prediction of ecological attribute values from microbial DNA sequences using deep learning
神沼英里1,2, 藤澤貴智1, 林史1, 中村保一1, 高木利久1, 瀬々潤2, 小笠原理1
(1. National Institute of Genetics, Center for Information Biology, 2. AIST Artificial Intelligence Research Center)
ABSTRACT
DDBJが運営している国際塩基配列データベースは、機械学習モデルのデータ素材として活用出来る。筆者等は、研究者がDDBJへDNA配列を登録する時に注釈タグを自動推薦するツール"DNASmartTagger"を提案している。今回、DDBJの生態
関連注釈タグ「/altitude(標高情報)」を対象に、深層学習モデルの畳み込みニューラルネットワーク(CNN)分類器を構築してDNA配列から真菌の注釈予測を行った。FungiのKeyword検索条件・/altitude・PLN DivisionでDDBJのデータから5,831配
列を抽出した後、キュレーションにより訓練データを構築した。訓練データの79%が子嚢菌門の注釈を、4%がITS1/ITS4プライマセットの注釈を持つ。予測対象の標高値は、定量値を高・中・低の3つのカテゴリに変換して、カテゴリ毎の2値分類
タスクとした。入力は128bpの5’末端fragmentとk-mer頻度の2種類を構築した。10分割交差検証によるCNNモデルの正解率は0.73(fragm)と0.80(5-mer)で、SVM分類器の0.72 (fragm)と0.77 (5-mer)より高かった。k-mer頻度入力は配列断片入力より
分類精度が高いが、深層学習の計算コストも高く注意が必要である。CNNより精度が低かったが、再帰型ニューラルネットワークの分類精度と計算コストの知見も紹介する。
Acknowledgments: ・Jun Mashima ・Masanori Arita ・Tatsuya Nishizawa
・ This work was supported by CREST, JST. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.
Publication:
• DNA Data Bank of Japan. Mashima J, Kodama Y, Fujisawa T, Katayama T, Okuda Y, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y, Takagi T.Nucleic Acids Res. 2017 45:D25-D31.
• The International Nucleotide Sequence Database Collaboration. Cochrane G, Karsch-Mizrachi I, Takagi T, International Nucleotide Sequence Database Collaboration, Nucleic Acids Res. 2016 44:D48-50.
• DDBJ new system and service refactoring. Ogasawara O, Mashima J, Kodama Y, Kaminuma E, Nakamura Y, Okubo K, Takagi T. Nucleic Acids Res. 2013 41:D25-9.
Result(1) ML Model’s Predictive Performance of the INSDC
Qualifier Tag /PCR_Primers
The problem for high labor costs of manual annotation
at submission stage to DNA data bank
Experimental Conditions
■ Future Work
・ Developing API System for user service
• Implementing model retraining function with successively new data release
• Extending to attribute value prediction of other INSDC Qualifier tags
Result(3) Prediction of Ecological Attribute Values from Fungal
DNA Sequences using Deep Learning
Background:Next-Generation DNA Sequencing Produces Large Quantities of Data
Problem: Detailed Manual Annotations Lead to High Labor Costs
# of Released Entries
DBCLS SRA statistics (Nakazato et al., 2013)
http://sra.dbcls.jp/
DDBJ Trad DDBJ SRA
(NGS Raw
Reads)
2017 948M 125K
2016 790M 91K
Accacactggtactgagacacgga
ccagactcctacgggaggcagcag
tgaggaatattggacaatggaggga
actctgatccagccatgccgcgtgca
ggaagactgccctatgggttgtaaac
tgcttttatacaagaagaataagaga
tacgtgtatcttgatgacggtattgtaa
gaataagcaccggctaactccgtgc
cagcagccgcggtaatacggaggg
tgcaagcgttatccggaatcattgggt
ttaaagggtccgtaggcggattaata
agtcagtggtgaaagtctgcagctta
actgtagaattgccattgatactgtta
gtcttgaattattatgaagtagttagaa
tatgtagtgtagcggtgaaatgcata
gatattaca
Input: DNA Sequence
sequence
e.g. INSDC FlatFile Format
Altitudinal zonation (continuous variable  categorical code)
Output: Annotation Tags
DNASmartTagger
Open Data
BioSample
452 attribute
tags
INSDC
89 qualifier key
tags
Machine Learning Models
GBIF, etc.
DDBJ ANNOTATION HELP
DNASmartTagger : A Proposed Machine Learning Tool
for DNA Sequence Annotation
Annotations
Result(2) Generating Training Datasets from the INSDC
Qualifier Tag “/altitude”
/altitude INSDC Qualifier Tag /PCR_primers INSDC Qualifier Tag
■ Predicting target keys for pilot studies = INSDC Qualifier Keys (/altitude, /PCR_primers)
* Sequence annotation quality of SRA BioSample indicates low and requires exhaustive data cleansing.
* INSDC sequence annotations are well controlled compared to SRA BioSample annotations.
■ Sequences and annotation data were retrieved from DDBJ ARSA search program.
TAG Output
Variable Type
# of
Entries
ML Model Design Classification Performance
(AUC wt Cross-Validation)
Data Retrieval
Condition
1 /PCR_Primers Categorical
(Multilabel)
4,850 Support Vector
Machine(SVM)
5’end fragment
(L=60)
0.83 [37 PrimerFwd models]
0.81 [104 PrimerRev models]
BCT Division
16S rRNA
TAG Output Variable
Type
# of
Entries
ML Model
Design
Classification Performance
(Accuracy)
Data Retrieval
Condition
2 /altitude Continuous
↓※
Categorical
(3 Labels)
5,831 CNN, 3 models 0.73 (5’end fragment)
0.80 (k-mer frequency)
- PLN Division
- Fungi (keyword)
SVM, 3 models 0.72 (5’end fragment)
0.77 (k-mer frequency)
acagagttttcggactgctg
acgaccggcgcacgggtg
cgtaacgcgtatacaatcta
ccttttgctaagggatagcc
cagagaaatttggattaata
ct
acagagttttcggactgctgacgaccggcgcacgggtgcgtaacgcgtatacaatctaccttttgctaa
gggatagcccagagaaatttggattaatactttatggtatgtatttatggcatcatatatacattaaaggtt
acggcaaaagatgagtatgcgttctattagctagatggtaaggtaacggcttaccatggctacgatag
ataggggccctgagagggggatcccccacactggtactgagacacggaccagactcctacggga
ggcagcagtgaggaatattggacaatggagggaactctgatccagccatgccgcgtgcaggaaga
ctgccctatgggttgtaaactgcttttatacaagaagaataagagatacgtgtatcttgatgacggtattg
taagaataagcaccggctaactccgtgccagcagccgcggtaatacggagggtgcaagcgttatcc
ggaatcattgggtttaaagggtccgtaggcggattaataagtcagtggtgaaagtctgcagcttaactg
tagaattgccattgatactgttagtcttgaattattatgaagtagttagaatatgtagtgtagcggtgaaat
gcatagatattacatagaataccgattgcgaaggcaggctactaataatatattgacgctgatggacg
aaagcgtgggtagcgaacaggattagataccctggtagtccacgccgtaaacgatggtcactagct
gttcggacttcggtctgagtggctaagcgaaagtgataagtgacccacctggggagtacgttcgcaa
gaatgaaact
16S rRNA sequences (5’3’) 5’end fragment
gccgtaaacgatggtcact
agctgttcggacttcggtctg
agtggctaagcgaaagtga
taagtgacccacctgggga
gtacgttcgcaagaatgaa
act
PrimerFWD label:F27
(agagtttgatcmtggctcag)
PrimerREV label:1525R
(aaggaggtgwtccarcc)
3’end fragment
■ Model Performances
■ Fragmentation processing of input sequence for /PCR_Primers tag prediction
■ Frequency of entries by
sequence length
#ofentriesbymodels
■ Generating ML training datasets
ZONE Attribute value Altitude zone code
ALPINE ZONE 1500m -- Z3
MONTANE ZONE 800m--1500m Z2
LOWLAND ZONE 0--800m Z1
■ Length of training sequences
Unique taxonmy ID.= 3,667
Unique attribute value = 257
Unique PrimerFWD seq. = 107
Unique PrimerREV seq,.= 115
■ Classification performance with frequency of entries
by PrimerFWD sequence [fragment length=60]
■ Classification performance
by input fragment length (training dataset)
boxplot of 37 models [# of entries ≧ 20]AUC=Area Under the Curve, 37 PrimerFwd models [# of entries ≧ 20]
rank primer set target
loci
# of
entries
1 ITS1 - ITS4 ITS 228
2 ITS5 - ITS4 ITS 103
3 ITS5 - NL4 ITS, LSU 93
4 nu-ssu-0817 –
nu-ssu-1536
SSU 76
5 niaD15F - niaD12R euknr 21
■Top5 primer set
■Top3 phyla
■ Model Performances
■ Management of imbalanced data
■ Setting task pattern into binary classification
■ Restricting a tag of /altitude
- DDBJ ARSA sequence retrieval tool
- 5,431 Sequences with Annotation
- PLN Division
- Keyword Fungi
ARSA tool
* True sequences per class
Z1(46%) Z2(12%=>40%)
downsampled
Z3(41%)
<Experimental conditions>
* 5’end fragment (L=128)
* evaluation measure (accuracy)
* convolutional neural network(CNN)
* cross validation (kfold=10)
Method Accuracy
(Training)
Epoch# Comp.Time
(Epoch#)
SVM 0.948
CNN 0.977 1000 4 Sec(10)
RNN(LSTM) 0.745 10 13 Min(10)
(※Z1 model only)
<Number of CNN parameters> <Comparison among RNN, CNN and SVM>
Three ML models

Más contenido relacionado

Más de Eli Kaminuma

[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定Eli Kaminuma
 
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要Eli Kaminuma
 
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤Eli Kaminuma
 
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化Eli Kaminuma
 
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流Eli Kaminuma
 
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...Eli Kaminuma
 

Más de Eli Kaminuma (7)

[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
[2016-12-01] DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
 
[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要[2016-07-06] DDBJデータ解析チャレンジ概要
[2016-07-06] DDBJデータ解析チャレンジ概要
 
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
[2016-06-06] CrowdR&D:クラウド協働評価のための参加型R&Dプロジェクト情報統合基盤
 
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
[2013-12-05] NGS由来ゲノムワイド多型マーカ構築とそのRDF注釈情報統合化
 
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
[2015-06-10] オンライン・クラウドサイエンス(市民科学)の潮流
 
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
[2015-11-11][DDBJing33] DDBJとNIG Supercomputerの紹介、大量配列情報解析 (第33回 DDBJing 講習会 ...
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

[2018-03-06]ディープラーニングを用いたDNA配列からの微生物生態属性値の予測

  • 1. rank phylum # of entries 1 Ascomycota 4629 (79%) 2 Basidiomycota 1116 (19%) 3 Mucoromycotina 38 (1%) ディープラーニングを用いたDNA配列からの微生物生態属性値の予測 Prediction of ecological attribute values from microbial DNA sequences using deep learning 神沼英里1,2, 藤澤貴智1, 林史1, 中村保一1, 高木利久1, 瀬々潤2, 小笠原理1 (1. National Institute of Genetics, Center for Information Biology, 2. AIST Artificial Intelligence Research Center) ABSTRACT DDBJが運営している国際塩基配列データベースは、機械学習モデルのデータ素材として活用出来る。筆者等は、研究者がDDBJへDNA配列を登録する時に注釈タグを自動推薦するツール"DNASmartTagger"を提案している。今回、DDBJの生態 関連注釈タグ「/altitude(標高情報)」を対象に、深層学習モデルの畳み込みニューラルネットワーク(CNN)分類器を構築してDNA配列から真菌の注釈予測を行った。FungiのKeyword検索条件・/altitude・PLN DivisionでDDBJのデータから5,831配 列を抽出した後、キュレーションにより訓練データを構築した。訓練データの79%が子嚢菌門の注釈を、4%がITS1/ITS4プライマセットの注釈を持つ。予測対象の標高値は、定量値を高・中・低の3つのカテゴリに変換して、カテゴリ毎の2値分類 タスクとした。入力は128bpの5’末端fragmentとk-mer頻度の2種類を構築した。10分割交差検証によるCNNモデルの正解率は0.73(fragm)と0.80(5-mer)で、SVM分類器の0.72 (fragm)と0.77 (5-mer)より高かった。k-mer頻度入力は配列断片入力より 分類精度が高いが、深層学習の計算コストも高く注意が必要である。CNNより精度が低かったが、再帰型ニューラルネットワークの分類精度と計算コストの知見も紹介する。 Acknowledgments: ・Jun Mashima ・Masanori Arita ・Tatsuya Nishizawa ・ This work was supported by CREST, JST. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics. Publication: • DNA Data Bank of Japan. Mashima J, Kodama Y, Fujisawa T, Katayama T, Okuda Y, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y, Takagi T.Nucleic Acids Res. 2017 45:D25-D31. • The International Nucleotide Sequence Database Collaboration. Cochrane G, Karsch-Mizrachi I, Takagi T, International Nucleotide Sequence Database Collaboration, Nucleic Acids Res. 2016 44:D48-50. • DDBJ new system and service refactoring. Ogasawara O, Mashima J, Kodama Y, Kaminuma E, Nakamura Y, Okubo K, Takagi T. Nucleic Acids Res. 2013 41:D25-9. Result(1) ML Model’s Predictive Performance of the INSDC Qualifier Tag /PCR_Primers The problem for high labor costs of manual annotation at submission stage to DNA data bank Experimental Conditions ■ Future Work ・ Developing API System for user service • Implementing model retraining function with successively new data release • Extending to attribute value prediction of other INSDC Qualifier tags Result(3) Prediction of Ecological Attribute Values from Fungal DNA Sequences using Deep Learning Background:Next-Generation DNA Sequencing Produces Large Quantities of Data Problem: Detailed Manual Annotations Lead to High Labor Costs # of Released Entries DBCLS SRA statistics (Nakazato et al., 2013) http://sra.dbcls.jp/ DDBJ Trad DDBJ SRA (NGS Raw Reads) 2017 948M 125K 2016 790M 91K Accacactggtactgagacacgga ccagactcctacgggaggcagcag tgaggaatattggacaatggaggga actctgatccagccatgccgcgtgca ggaagactgccctatgggttgtaaac tgcttttatacaagaagaataagaga tacgtgtatcttgatgacggtattgtaa gaataagcaccggctaactccgtgc cagcagccgcggtaatacggaggg tgcaagcgttatccggaatcattgggt ttaaagggtccgtaggcggattaata agtcagtggtgaaagtctgcagctta actgtagaattgccattgatactgtta gtcttgaattattatgaagtagttagaa tatgtagtgtagcggtgaaatgcata gatattaca Input: DNA Sequence sequence e.g. INSDC FlatFile Format Altitudinal zonation (continuous variable  categorical code) Output: Annotation Tags DNASmartTagger Open Data BioSample 452 attribute tags INSDC 89 qualifier key tags Machine Learning Models GBIF, etc. DDBJ ANNOTATION HELP DNASmartTagger : A Proposed Machine Learning Tool for DNA Sequence Annotation Annotations Result(2) Generating Training Datasets from the INSDC Qualifier Tag “/altitude” /altitude INSDC Qualifier Tag /PCR_primers INSDC Qualifier Tag ■ Predicting target keys for pilot studies = INSDC Qualifier Keys (/altitude, /PCR_primers) * Sequence annotation quality of SRA BioSample indicates low and requires exhaustive data cleansing. * INSDC sequence annotations are well controlled compared to SRA BioSample annotations. ■ Sequences and annotation data were retrieved from DDBJ ARSA search program. TAG Output Variable Type # of Entries ML Model Design Classification Performance (AUC wt Cross-Validation) Data Retrieval Condition 1 /PCR_Primers Categorical (Multilabel) 4,850 Support Vector Machine(SVM) 5’end fragment (L=60) 0.83 [37 PrimerFwd models] 0.81 [104 PrimerRev models] BCT Division 16S rRNA TAG Output Variable Type # of Entries ML Model Design Classification Performance (Accuracy) Data Retrieval Condition 2 /altitude Continuous ↓※ Categorical (3 Labels) 5,831 CNN, 3 models 0.73 (5’end fragment) 0.80 (k-mer frequency) - PLN Division - Fungi (keyword) SVM, 3 models 0.72 (5’end fragment) 0.77 (k-mer frequency) acagagttttcggactgctg acgaccggcgcacgggtg cgtaacgcgtatacaatcta ccttttgctaagggatagcc cagagaaatttggattaata ct acagagttttcggactgctgacgaccggcgcacgggtgcgtaacgcgtatacaatctaccttttgctaa gggatagcccagagaaatttggattaatactttatggtatgtatttatggcatcatatatacattaaaggtt acggcaaaagatgagtatgcgttctattagctagatggtaaggtaacggcttaccatggctacgatag ataggggccctgagagggggatcccccacactggtactgagacacggaccagactcctacggga ggcagcagtgaggaatattggacaatggagggaactctgatccagccatgccgcgtgcaggaaga ctgccctatgggttgtaaactgcttttatacaagaagaataagagatacgtgtatcttgatgacggtattg taagaataagcaccggctaactccgtgccagcagccgcggtaatacggagggtgcaagcgttatcc ggaatcattgggtttaaagggtccgtaggcggattaataagtcagtggtgaaagtctgcagcttaactg tagaattgccattgatactgttagtcttgaattattatgaagtagttagaatatgtagtgtagcggtgaaat gcatagatattacatagaataccgattgcgaaggcaggctactaataatatattgacgctgatggacg aaagcgtgggtagcgaacaggattagataccctggtagtccacgccgtaaacgatggtcactagct gttcggacttcggtctgagtggctaagcgaaagtgataagtgacccacctggggagtacgttcgcaa gaatgaaact 16S rRNA sequences (5’3’) 5’end fragment gccgtaaacgatggtcact agctgttcggacttcggtctg agtggctaagcgaaagtga taagtgacccacctgggga gtacgttcgcaagaatgaa act PrimerFWD label:F27 (agagtttgatcmtggctcag) PrimerREV label:1525R (aaggaggtgwtccarcc) 3’end fragment ■ Model Performances ■ Fragmentation processing of input sequence for /PCR_Primers tag prediction ■ Frequency of entries by sequence length #ofentriesbymodels ■ Generating ML training datasets ZONE Attribute value Altitude zone code ALPINE ZONE 1500m -- Z3 MONTANE ZONE 800m--1500m Z2 LOWLAND ZONE 0--800m Z1 ■ Length of training sequences Unique taxonmy ID.= 3,667 Unique attribute value = 257 Unique PrimerFWD seq. = 107 Unique PrimerREV seq,.= 115 ■ Classification performance with frequency of entries by PrimerFWD sequence [fragment length=60] ■ Classification performance by input fragment length (training dataset) boxplot of 37 models [# of entries ≧ 20]AUC=Area Under the Curve, 37 PrimerFwd models [# of entries ≧ 20] rank primer set target loci # of entries 1 ITS1 - ITS4 ITS 228 2 ITS5 - ITS4 ITS 103 3 ITS5 - NL4 ITS, LSU 93 4 nu-ssu-0817 – nu-ssu-1536 SSU 76 5 niaD15F - niaD12R euknr 21 ■Top5 primer set ■Top3 phyla ■ Model Performances ■ Management of imbalanced data ■ Setting task pattern into binary classification ■ Restricting a tag of /altitude - DDBJ ARSA sequence retrieval tool - 5,431 Sequences with Annotation - PLN Division - Keyword Fungi ARSA tool * True sequences per class Z1(46%) Z2(12%=>40%) downsampled Z3(41%) <Experimental conditions> * 5’end fragment (L=128) * evaluation measure (accuracy) * convolutional neural network(CNN) * cross validation (kfold=10) Method Accuracy (Training) Epoch# Comp.Time (Epoch#) SVM 0.948 CNN 0.977 1000 4 Sec(10) RNN(LSTM) 0.745 10 13 Min(10) (※Z1 model only) <Number of CNN parameters> <Comparison among RNN, CNN and SVM> Three ML models