SlideShare una empresa de Scribd logo
1 de 38
DNABind: A hybrid algorithm for structure-based prediction of
DNA-binding residues by combining machine learning- and
template-based approaches. Proteins. 2013 Jun 5.

20131019
生物物理若手関西支部 Journal Club
Topics
Prediction of protein-DNA binding residues
Statistics of network
Machine learning
Result: DNABind, a hybrid method of machine learning and template-based
approaches showed excellent performance on predicting DNA-binding residues.
Template

DNABind

EcoRV(1RVE:A)

CprK (3E6C:C)

Machine learning

True positive residues.
DNABind improves classification.
Query protein, Template protein, TP,

, FN
Aim

Protein-DNA interactions is important for cell biology.
Its determination by experiments is time- and cost-consuming.

Computational approaches are desirable.
Computational approaches
Data bank (PDB)
Binding residues characters
Exposed solvents
Higher electrostatics potential
More conserved
Hotspots as clusters of conserved residues

Structural properties (DNA-binding residue vs surface)
Packing density
Surface curvature
B-factor
Residue fluctuation
Hydrogen bond donor
http://www.rcsb.org/pdb/home/home.do
Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best match

Template!!
Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best match

Template!!
Computational algorithms
Feature-based
Extract effective features

Template-based
Align template and retrieve the best match

Template!!
Features used in machine learning
Structure-based
PSSM (position specific scoring matrix)
Evolutionally conservation
Solvent accessibility
Local geometry (depth and protrusion index)
Topological features
degree, closeness, betweenness, clustering coefficient

Relative position (distance to centroid)
Statistical potential (Boltzmann distribution)

Sequence-based (more difficult than structure)
Amino acid identity
Residue physicochemical properties
polarity, secondary structure, molecular volume, codon diversity, electrostatic charge

Predicted structure (Not need 3D structure !!)
Features used in machine learning
Structure-based
PSSM
Relative solvent accessibility
Depth and protrusion index
Topological features
Distance to centroid
Statistical potentials

Sequence-based
PSSM
Predicted structures
Amino acid indices
Statistical potentials

Construct machine learning (SVM)
Template-based approach
Used in image recognition, etc…
Recognition of faces in the camera.
Template!!
Template-based approach
Used in image recognition, etc…
Recognition of faces in the camera.
Match!!

Template!!
Template-based prediction
Template-based
Structural alignment and statistical potential
The binding residue prediction will be conducted only if the
target protein was considered as a DNA-binding protein.

312 templates were selected.
Network

Degree is a commonly used measure to reflect the local
connectivity of a node.
Closeness is a global centrality metric used to determine
how critical a residue is in a residue interaction network.
Betweenness of residue i is defined to be the sum of the
fraction of shortest paths between all pairs of residues
that pass through residue i.
Motif, hub, and community
are also important…

Clustering coefficient (transitivity) quantifies how close
its neighbors are to being a clique. Probability that the
adjacent vertices of a vertex are connected.
Network sample; human protein interactome
Scale-free
Small-world
Cluster
Power law (Pareto distribution)

Bioinformatics. 2012 Jan 1;28(1):84-90.
Machine learning
Example; spam
4601 samples, 57 parameters.
Classification; spam or nonspam
Machine learning
Support vector machine (SVM)
Decision tree
RandomForest
Logistic regression
LASSO (Elastic net and Ridge)
Neural networks (Deep learning)
Evolutionary algorithm
Gaussian processing
k nearest neighbor
Clustering
Bayesian networks
Association rule learning
Inductive logic programming (ILP)
Support vector machine (SVM)
Make hyperplane to divide groups.
Kernel method; non-linear to linear
Easy to do.
Much computational time.
Tuning is very difficult.
Decision tree
Make many trees.
Easy to understand graphically.
Performance is not so good.
RandomForest
Make many decision trees.
Much precise.
A little time consumer.
Logistic regression
Many medical researchers use…
Easy to use but tuning is very difficult.
(to tell the truth…)
LASSO, Elastic net, and Ridge regression
Least Absolute Shrinkage and Selection Operator

LASSO
Elastic Net
Ridge
Neural networks
Artificial mammal brain (perceptron).
Hidden multi-layer.
Deep learning is hot topic!!
(hard to understand…)

http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data
n-fold cross validation
To evaluate how the results of a statistical analysis will
generalize to an independent data set.
Train data

Test 1
One-leave out CV
Performance

SVM

Tree

RandomForest

LASSO

Elastic net

Ridge

Logistic

nnet

Recall

0.917

0.872

0.927

0.894

0.892

0.852

0.893

0.930

Precision

0.948

0.914

0.954

0.932

0.926

0.926

0.930

0.935

F

0.932

0.893

0.940

0.913

0.911

0.887

0.911

0.932

MMC

0.890

0.826

0.902

0.858

0.856

0.821

0.856

0.888
Combine two approaches
Statistical features of structure
A: Binding residues are highly solvent
accessible.
B, C: Binding residues have low depth and
high protrusion.
D-G: Not so much difference in networks.
H: Binding residues are less distant to the
centroid.
Performance
Performance

Higher TM score is required for good prediction.

TM-score is a measure of similarity between two protein structures with different tertiary
structures. < 0.2 is random relation and > 0.5 is highly related.
Proteins. 2004 Dec 1;57(4):702-10.
Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.
Performance
Comparison among ML, TL, and DNABind.

Comparison between DNABind and other software.
Result: DNABind, a hybrid method of machine learning and template-based
approaches showed excellent performance on predicting DNA-binding residues.
Template

DNABind

EcoRV(1RVE:A)

CprK (3E6C:C)

Machine learning

True positive residues.
DNABind improves classification.
Query protein, Template protein, TP,

, FN

Más contenido relacionado

La actualidad más candente

A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
IAEME Publication
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan
 

La actualidad más candente (20)

Rna seq
Rna seqRna seq
Rna seq
 
Illumina sequencing introduction
Illumina sequencing introductionIllumina sequencing introduction
Illumina sequencing introduction
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 
P24120125
P24120125P24120125
P24120125
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
 
Deep Learning and Modern NLP
Deep Learning and Modern NLPDeep Learning and Modern NLP
Deep Learning and Modern NLP
 
Open science resources for `Big Data' Analyses of the human connectome
Open science resources for `Big Data' Analyses of the human connectomeOpen science resources for `Big Data' Analyses of the human connectome
Open science resources for `Big Data' Analyses of the human connectome
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
27 20 dec16 13794 28120-1-sm(edit)genap
27 20 dec16 13794 28120-1-sm(edit)genap27 20 dec16 13794 28120-1-sm(edit)genap
27 20 dec16 13794 28120-1-sm(edit)genap
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
Kefed introduction 12-05-10-2224
Kefed introduction 12-05-10-2224Kefed introduction 12-05-10-2224
Kefed introduction 12-05-10-2224
 
Recurrent Convolutional Neural Networks for Text Classification
Recurrent Convolutional Neural Networks for Text ClassificationRecurrent Convolutional Neural Networks for Text Classification
Recurrent Convolutional Neural Networks for Text Classification
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Finding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopFinding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/Hadoop
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
Myers CV_2015
Myers CV_2015Myers CV_2015
Myers CV_2015
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defense
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 

Similar a 20131019 生物物理若手 Journal Club

Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
International Journal of Engineering Inventions www.ijeijournal.com
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological data
Dmitry Grapov
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Arinze Akutekwe
 
Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...
ijfcstjournal
 

Similar a 20131019 生物物理若手 Journal Club (20)

2224d_final
2224d_final2224d_final
2224d_final
 
2015-03-31_MotifGP
2015-03-31_MotifGP2015-03-31_MotifGP
2015-03-31_MotifGP
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 
Tamil Character Recognition based on Back Propagation Neural Networks
Tamil Character Recognition based on Back Propagation Neural NetworksTamil Character Recognition based on Back Propagation Neural Networks
Tamil Character Recognition based on Back Propagation Neural Networks
 
Deep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpointsDeep learning methods applied to physicochemical and toxicological endpoints
Deep learning methods applied to physicochemical and toxicological endpoints
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
PPT
PPTPPT
PPT
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological data
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
 
Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...Implementation of energy efficient coverage aware routing protocol for wirele...
Implementation of energy efficient coverage aware routing protocol for wirele...
 
Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...
 
Data mining
Data mining Data mining
Data mining
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machines
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 
Masters Thesis Defense: Minimum Complexity Echo State Networks For Genome and...
Masters Thesis Defense: Minimum Complexity Echo State Networks For Genome and...Masters Thesis Defense: Minimum Complexity Echo State Networks For Genome and...
Masters Thesis Defense: Minimum Complexity Echo State Networks For Genome and...
 
1207.2600
1207.26001207.2600
1207.2600
 

Más de Med_KU

20131110 第3回ニコニコ学会β データ研究会
20131110 第3回ニコニコ学会β データ研究会20131110 第3回ニコニコ学会β データ研究会
20131110 第3回ニコニコ学会β データ研究会
Med_KU
 
20131109 TokyoR#35 Rでネットワーク解析とGIS
20131109 TokyoR#35 Rでネットワーク解析とGIS20131109 TokyoR#35 Rでネットワーク解析とGIS
20131109 TokyoR#35 Rでネットワーク解析とGIS
Med_KU
 
20131011 KashiwaR#9
20131011 KashiwaR#920131011 KashiwaR#9
20131011 KashiwaR#9
Med_KU
 
20121120 検査と臨床判断
20121120 検査と臨床判断20121120 検査と臨床判断
20121120 検査と臨床判断
Med_KU
 
20130701 統計論文勉強会 遺伝的差異の定量的解析法
20130701 統計論文勉強会 遺伝的差異の定量的解析法20130701 統計論文勉強会 遺伝的差異の定量的解析法
20130701 統計論文勉強会 遺伝的差異の定量的解析法
Med_KU
 
20130220 Kashiwa.R#6
20130220 Kashiwa.R#620130220 Kashiwa.R#6
20130220 Kashiwa.R#6
Med_KU
 

Más de Med_KU (20)

20160730tokyor55
20160730tokyor5520160730tokyor55
20160730tokyor55
 
20151205japanr
20151205japanr20151205japanr
20151205japanr
 
20140308 第四回 ニコニコ学会β データ研究会 アニメ・声優・二次創作における百合ネットワーク
20140308 第四回 ニコニコ学会β データ研究会 アニメ・声優・二次創作における百合ネットワーク20140308 第四回 ニコニコ学会β データ研究会 アニメ・声優・二次創作における百合ネットワーク
20140308 第四回 ニコニコ学会β データ研究会 アニメ・声優・二次創作における百合ネットワーク
 
20131216 Stat Journal
20131216 Stat Journal20131216 Stat Journal
20131216 Stat Journal
 
20131207 Japan.R#4 LT
20131207 Japan.R#4 LT20131207 Japan.R#4 LT
20131207 Japan.R#4 LT
 
20131110 第3回ニコニコ学会β データ研究会
20131110 第3回ニコニコ学会β データ研究会20131110 第3回ニコニコ学会β データ研究会
20131110 第3回ニコニコ学会β データ研究会
 
20131109 TokyoR#35 Rでネットワーク解析とGIS
20131109 TokyoR#35 Rでネットワーク解析とGIS20131109 TokyoR#35 Rでネットワーク解析とGIS
20131109 TokyoR#35 Rでネットワーク解析とGIS
 
20131011 KashiwaR#9
20131011 KashiwaR#920131011 KashiwaR#9
20131011 KashiwaR#9
 
20121120 検査と臨床判断
20121120 検査と臨床判断20121120 検査と臨床判断
20121120 検査と臨床判断
 
20130701 統計論文勉強会 遺伝的差異の定量的解析法
20130701 統計論文勉強会 遺伝的差異の定量的解析法20130701 統計論文勉強会 遺伝的差異の定量的解析法
20130701 統計論文勉強会 遺伝的差異の定量的解析法
 
20130609 アイドルマスター解析
20130609 アイドルマスター解析20130609 アイドルマスター解析
20130609 アイドルマスター解析
 
20130201 脳神経外科 脳腫瘍の浸潤数理モデル
20130201 脳神経外科 脳腫瘍の浸潤数理モデル20130201 脳神経外科 脳腫瘍の浸潤数理モデル
20130201 脳神経外科 脳腫瘍の浸潤数理モデル
 
20130609 Wako.R トピックモデルを用いたボーカロイド楽曲の流行解析
20130609 Wako.R トピックモデルを用いたボーカロイド楽曲の流行解析20130609 Wako.R トピックモデルを用いたボーカロイド楽曲の流行解析
20130609 Wako.R トピックモデルを用いたボーカロイド楽曲の流行解析
 
20130608 Kashiwa.R#8 Rでプロット
20130608 Kashiwa.R#8 Rでプロット20130608 Kashiwa.R#8 Rでプロット
20130608 Kashiwa.R#8 Rでプロット
 
20130318 統計手法勉強会 外れ値検出 FRaC
20130318 統計手法勉強会 外れ値検出 FRaC20130318 統計手法勉強会 外れ値検出 FRaC
20130318 統計手法勉強会 外れ値検出 FRaC
 
20130220 Kashiwa.R#6
20130220 Kashiwa.R#620130220 Kashiwa.R#6
20130220 Kashiwa.R#6
 
20121210 統計論文勉強会
20121210 統計論文勉強会20121210 統計論文勉強会
20121210 統計論文勉強会
 
20121130 Kashiwa.R#5
20121130 Kashiwa.R#520121130 Kashiwa.R#5
20121130 Kashiwa.R#5
 
20130727niconico
20130727niconico20130727niconico
20130727niconico
 
20130727niconicoLT
20130727niconicoLT20130727niconicoLT
20130727niconicoLT
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

20131019 生物物理若手 Journal Club

  • 1. DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins. 2013 Jun 5. 20131019 生物物理若手関西支部 Journal Club
  • 2. Topics Prediction of protein-DNA binding residues Statistics of network Machine learning
  • 3.
  • 4. Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding residues. Template DNABind EcoRV(1RVE:A) CprK (3E6C:C) Machine learning True positive residues. DNABind improves classification. Query protein, Template protein, TP, , FN
  • 5. Aim Protein-DNA interactions is important for cell biology. Its determination by experiments is time- and cost-consuming. Computational approaches are desirable.
  • 6. Computational approaches Data bank (PDB) Binding residues characters Exposed solvents Higher electrostatics potential More conserved Hotspots as clusters of conserved residues Structural properties (DNA-binding residue vs surface) Packing density Surface curvature B-factor Residue fluctuation Hydrogen bond donor http://www.rcsb.org/pdb/home/home.do
  • 7. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  • 8. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  • 9. Computational algorithms Feature-based Extract effective features Template-based Align template and retrieve the best match Template!!
  • 10. Features used in machine learning Structure-based PSSM (position specific scoring matrix) Evolutionally conservation Solvent accessibility Local geometry (depth and protrusion index) Topological features degree, closeness, betweenness, clustering coefficient Relative position (distance to centroid) Statistical potential (Boltzmann distribution) Sequence-based (more difficult than structure) Amino acid identity Residue physicochemical properties polarity, secondary structure, molecular volume, codon diversity, electrostatic charge Predicted structure (Not need 3D structure !!)
  • 11. Features used in machine learning Structure-based PSSM Relative solvent accessibility Depth and protrusion index Topological features Distance to centroid Statistical potentials Sequence-based PSSM Predicted structures Amino acid indices Statistical potentials Construct machine learning (SVM)
  • 12. Template-based approach Used in image recognition, etc… Recognition of faces in the camera. Template!!
  • 13. Template-based approach Used in image recognition, etc… Recognition of faces in the camera. Match!! Template!!
  • 14. Template-based prediction Template-based Structural alignment and statistical potential The binding residue prediction will be conducted only if the target protein was considered as a DNA-binding protein. 312 templates were selected.
  • 15. Network Degree is a commonly used measure to reflect the local connectivity of a node. Closeness is a global centrality metric used to determine how critical a residue is in a residue interaction network. Betweenness of residue i is defined to be the sum of the fraction of shortest paths between all pairs of residues that pass through residue i. Motif, hub, and community are also important… Clustering coefficient (transitivity) quantifies how close its neighbors are to being a clique. Probability that the adjacent vertices of a vertex are connected.
  • 16. Network sample; human protein interactome Scale-free Small-world Cluster Power law (Pareto distribution) Bioinformatics. 2012 Jan 1;28(1):84-90.
  • 17. Machine learning Example; spam 4601 samples, 57 parameters. Classification; spam or nonspam
  • 18. Machine learning Support vector machine (SVM) Decision tree RandomForest Logistic regression LASSO (Elastic net and Ridge) Neural networks (Deep learning) Evolutionary algorithm Gaussian processing k nearest neighbor Clustering Bayesian networks Association rule learning Inductive logic programming (ILP)
  • 19. Support vector machine (SVM) Make hyperplane to divide groups. Kernel method; non-linear to linear Easy to do. Much computational time. Tuning is very difficult.
  • 20. Decision tree Make many trees. Easy to understand graphically. Performance is not so good.
  • 21. RandomForest Make many decision trees. Much precise. A little time consumer.
  • 22. Logistic regression Many medical researchers use… Easy to use but tuning is very difficult. (to tell the truth…)
  • 23. LASSO, Elastic net, and Ridge regression Least Absolute Shrinkage and Selection Operator LASSO Elastic Net Ridge
  • 24. Neural networks Artificial mammal brain (perceptron). Hidden multi-layer. Deep learning is hot topic!! (hard to understand…) http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html
  • 25. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set.
  • 26. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 27. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 28. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 29. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 30. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data
  • 31. n-fold cross validation To evaluate how the results of a statistical analysis will generalize to an independent data set. Train data Test 1 One-leave out CV
  • 34. Statistical features of structure A: Binding residues are highly solvent accessible. B, C: Binding residues have low depth and high protrusion. D-G: Not so much difference in networks. H: Binding residues are less distant to the centroid.
  • 36. Performance Higher TM score is required for good prediction. TM-score is a measure of similarity between two protein structures with different tertiary structures. < 0.2 is random relation and > 0.5 is highly related. Proteins. 2004 Dec 1;57(4):702-10. Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.
  • 37. Performance Comparison among ML, TL, and DNABind. Comparison between DNABind and other software.
  • 38. Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding residues. Template DNABind EcoRV(1RVE:A) CprK (3E6C:C) Machine learning True positive residues. DNABind improves classification. Query protein, Template protein, TP, , FN