Report

Summer Training Report (June-July 2016) A Sypnosis submitted in partial fulfillment of the requirements for the award of

Views 196 Downloads 5 File size 946KB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

Summer Training Report (June-July 2016) A Sypnosis submitted in partial fulfillment of the requirements for the award of the degree of

Bachelor of Engineering in Department of Electronics and Communication Engineering

by Rahul Kumar Reg No: 101456010 Under the supervision of Prof. PS Rana

Thapar University Patiala-147004, Punjab, India

Submission Date - August 8, 2016

Abstract The nature inspired algorithms have been gaining much popularity in recent years due to the fact that many real world optimization problems have become increasingly large, complex and dynamic. Nature inspired algorithms find out the acceptable results rather than exact solution within a reasonable amount of time. Machine learning is a branch of artificial intelligence focuses on the construction and study of systems that can learn from data. Due to advancements in machine learning approaches several classification and regression models have been developed for efficient and accurate prediction from huge amount of experimental data. Protein structure prediction, considered to be the hardest problem in computational biology and remains intractable even after six decades since the report of the first crystal structure. The objective is to find the native stable structure that have minimum energy, maximum total surface area, minimum euclidian distance and minimum secondary structure penalty. Due to high degree of computational complexity protein structure prediction is considered as NP-Hard problem. This report contains a brief review of Differential Evolution (DE) algorithms and its advances, Machine learning approaches and two real world optimization problems i.e cluster structure stability and protein structure prediction. Both the real world problems are solved using Nature Inspired algorithms and Machine Learning approaches. Furthermore, to improve the efficiency, accuracy and reliability, a new variant of DE is proposed. The diversity analysis of DE and its advance variants is also studied. Further, research gaps are identified and objectives are formulated. This report contains the brief overview of the all proposed methods and their comparison with well known existing methods. Keywords: Evolutionary Algorithms, Differential Evolution, Custer Structure Stability, Protein Structure Prediction, Machine Learning Models.

Table of Contents

Title

Page No.

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

i

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Key Research Area . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.1

Differential Evolution Algorithm and its Advances . . . . . . . . .

4

2.2

Machine Learning Approaches

. . . . . . . . . . . . . . . . . . .

6

2.3

Cluster Structure Stability . . . . . . . . . . . . . . . . . . . . . .

9

2.4

Protein Structure Prediction

. . . . . . . . . . . . . . . . . . . .

9

Research Gaps and Objectives . . . . . . . . . . . . . . . . . . . . . . . .

14

3.1

Research Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.2

Research Objectives

. . . . . . . . . . . . . . . . . . . . . . . . .

14

4

Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

5

Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2

3

5.1

Diversity Analysis of Differential Evolution and its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

5.2

Guided Reproduction in Differential Evolution (GRDE)

. . . . .

17

5.3

Cluster Structure Stability Analysis of Nin (n=2-22) using Nature Inspired Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .

19

Protein Tertiary Structure Prediction using Differential Evolution and Physicochemical Properties . . . . . . . . . . . . . . . . . . .

21

5.5

Protein Tertiary Structure Data Set

23

5.6

Protein Tertiary Structure Prediction using Physicochemical Prop-

5.4

erties, SaDE and Machine Learning Approach . . . . . . . . . . .

25

Classification of Protein Tertiary Structure using Physicochemical Properties, OBDE and Machine Learning . . . . . . . . . . . . .

27

Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

5.7 6

. . . . . . . . . . . . . . . .

ii

7 8 9

Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 30 30

Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

iii

List of Figures

Figure No.

Title

Page No.

1 2 3

Machine learning categorization. . . . . . . . . . . . . . . . . . . . . . . . Protein structure prediction problem (Source: SCFBoi, IIT Delhi). . . . . Levinthals’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 10 10

4 5

Participations in CASP (Source: Protein prediction center) . . . . . . . . Protein structure prediction prediction accuracy (Source: Protein predic-

11

6 7

tion center) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . GRDE methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 15 18

8 9

Cluster structure stability methodology . . . . . . . . . . . . . . . . . . . Methodology used for protein structure prediction using DE. . . . . . . .

20 21

10 11

Result comparison of protein structure prediction using GA, PSO and DE. 23 Snapshot of web application for protein structure prediction using machine 26 learning approaches; Web link - http://www.psrana.com/psp.html . . . .

12 13

Methodology used for protein structure prediction using regression models Methodology used for classification of proteins using classification models

27 27

14

Research contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

iv

List of Tables

Table No. Title Page No. 1 Advances in Differential Evolution . . . . . . . . . . . . . . . . . . . . . . 2 Machine learning models (Source: Author) . . . . . . . . . . . . . . . . .

5 8

3 4

Protein structure prediction methods (Source: Author) . . . . . . . . . . Diversity measure parameters . . . . . . . . . . . . . . . . . . . . . . . .

12 16

5 6 7

Test problems used for diversity measures . . . . . . . . . . . . . . . . . Result comparison of GRDE and DE . . . . . . . . . . . . . . . . . . . . Parameter settings for NIA for energy minimization . . . . . . . . . . . .

16 18 20

8 9

Minimum energy getting from different NIA. . . . . . . . . . . . . . . . . Parameter settings for PSP using GA, PSO and DE . . . . . . . . . . . .

20 23

10 11 12

Data set used in Protein Structure Prediction . . . . . . . . . . . . . . . Description of the data set used in Protein Structure Prediction . . . . . Description of the features used in Protein Structure Prediction . . . . .

24 24 24

13 14

Sample dataset used in Protein Structure Prediction . . . . . . . . . . . . Regression machine learning methods . . . . . . . . . . . . . . . . . . . .

25 26

15 16 17

Avg. performance comparison∗ of all nine regression models on all data sets. 26 Classification machine learning methods. . . . . . . . . . . . . . . . . . . 28 Avg. performance comparison of classification models on all data sets. . . 28

v

LIST OF NOTATIONS xi (G)

parent vector of population G

ui (G)

trial vector of population G

x′i (G)

offspring

f (xi (G))

fitness of parent vector of population G

jr → − X best,G

jump rate used in OBDE

NP

population size

D

population diameter

R

population radius

DA

average distance around population center

DGM

geometric average distance around population center

DN

normalized average distance around the population center

S

population coherence

CR

crossover rate

F

scaling factor

R2

coefficient of determination

Enia

energy getting through nature inspired algorithms

Er

energy getting after relaxing

∆Eerr

error in the energy

Ebe

binding energy

∆En,m

fragmentation energy between atom n and atom m

∆2 En

second difference in energy

best individual of population G

vi

LIST OF ABBREVIATIONS ACO : GA : OBDE : PSO : SaDE : PSP : GRDE : BFO : NIA : DE : ABC :

Ant Colony Optimization Genetic Algorithm Opposition Based DE Particle Swarm Optimization Self Adaptive DE Protein Structure Prediction Guided Reproduction in DE Bacterial Foraging Nature Inspired Algorithm Differential Evolution Artificial Bee Colony

vii

viii

1

Introduction

In recent years, many researchers and scientists are showing their interest in the field of Nature Inspired Algorithms (NIA). This field began in the late 1950s and early 1960s as the availability of computing permitted scientists and engineers to build and experiment with various models of evolutionary processes. The early work produced a number of important evolutionary computation paradigms which served as the basis for much of the work done in the 1970s, a period of intense exploration and refinement of these ideas. The result was a variety of robust algorithms with significant potential for addressing difficult scientific and engineering problems. It is based on the decentralized, collective and cooperative behavior of social insects like flocks of birds, or schools of fish. However, a swarm can be considered as any collection of interacting agents or individuals. Bonabeau has said about the term swarm intelligence that “any attempt to design algorithms or distributed problem-solving devices inspired by the collective behaviors of social insect colonies and other animal societies” [1]. The methods currently available in literature for solving global optimization problems may be broadly classified as deterministic and probabilistic methods. The deterministic methods produce exact solution and do not use any probabilistic technique. But drawbacks are that they demand for analytic properties of the objective functions like linearity, differentiability, convexity etc. Furthermore, they take a lot of time or sometimes unable to solve real world problems. On the other hand, probabilistic methods are preferred over the deterministic because these do not require any characteristic properties of the function and so can be applied to wider class of problems. NIA cover a large number of algorithms such as Gentic Algorithms (GA) [2], Differential Evolution Algorithms (DE) [3], Particle Swarm Optimization (PSO) [4], Artificial Bee Colony Optimization (ABC) [5], Ant Colony Optimization (ACO) [6], Bacterial Foraging Optimization (BFO) [7] etc. This report contains a brief review of DE algorithms and its advances, Machine learning approaches and two real world optimization problems i.e cluster structure stability and protein structure prediction. Both the real world problems are solved using NIA and machine learning approaches. Furthermore, to improve the efficiency, accuracy and reliability, a new variant of DE is proposed. The diversity analysis of DE and its advance variants is also studied. 1

1.1

Research Motivation

Optimization is the process to search the best alternative(s) from permissible options which a given function can attain in its domain of definition, to produce desirable responses. For example, in mechanical engineering, one is interested to get best possible design of a car to produce a safe and economic structure. These type of problems are aiming to get desirable response can be scientifically resolved through modeling and optimization. Modeling refers to convert the original problem in to mathematical structure and to do the model must be designed in such a way that it includes all key characteristics of the original system because it will represent the original problem under whole algorithmic optimization process. The created models are usually formulated as functions, called objective functions, in single or more variables that correspond to adoptable parameters of the system. Complex systems are modeled with complicated multi-dimensional functions in most of the optimization problems and therefore, they cannot be easily communicated. To solve such optimization problems numerically, algorithmic procedures which take full advantage of modern computer systems can be implemented. Thus, time constraint, computation accuracy and implementation efforts become important points of the numerical optimization procedure. In most of such problems, the objective function is noisy, discontinuous and with lack of analytical representation. Under these restrictions, the applicability and efficiency of classical and deterministic optimization algorithms are questionable. The main drawbacks with deterministic algorithms are that they are not robust i.e. can only be applied to restricted class of problems and too time consuming or sometimes unable to solve real world problems. To avoid these drawbacks, there is a great need to develop fast, accurate and efficient algorithm that can solve real world optimizations problems.

1.2

Key Research Area

NIA often performs well and give approx solution to most of the NP-Hard problems, where deterministic method cannot give the result in the desired period of time. It comes under machine learning techniques. Most of the NIA work in two phases, namely, the variation phase and the selection phase which are responsible to maintain the balance between exploration and exploitation. The variation phase explores the different areas in the search space, and the selection phase works for the exploitation of the previous experience. However, these algorithms are also suffering from the drawbacks of premature convergence and stagnation i.e, these may occasionally stop proceeding toward the global 2

optimum even though the set of potential solutions has not converged to a local optimum [8, 9, 10]. There are many algorithms that are comes under NIA viz. GA, PSO, ABC, DE, ACO, BFO, Firefly Algorithm, Glowworm Algorithms and many more. All the algorithm have its own merits and demerits. The key research focuses on the (a) Hybridization and/or Modification of the DE, (b) Machine learning Approaches and (c) Two NP-Hard problems viz. cluster structure stability and protein structure prediction methodologies. Following are the key research areas:1. The inherent drawback with most of the NIA algorithms is premature convergence. Any population based algorithm is regarded as an efficient algorithm if it is fast in convergence and able to explore the maximum area of the search space. In other words, if a population based algorithm is capable of balancing between exploration and exploitation of the search space, then the algorithm is regarded as an efficient algorithm. 2. Furthermore, Mezura-Montes et al. [10] show that the DE also not establishes a proper trade off between exploitation and exploration of the search space. Some studies also proved that stagnation is a significant drawback with DE i.e. DE sometimes stop proceeding towards the global optima even though the population has not converged to local optima or any other point. 3. Protein structure prediction has been one of the main holy grails of computational biology for many decades. PSP is an open problem and consider as an optimization problem. The prediction of structural aspects of protein residues are generally treated as machine learning problems. However, the structure of a protein is very difficult to determine experimentally and in some cases almost impossible due to it large dimension. 4. Impact of having better protein structure models can help in Genetic therapy, Synthesis of drugs for incurable diseases, improved crops, environmental remediation and many more.

3

2

Literature Review

This section briefly reviewed the DE and its advance variants, machine learning approaches, cluster structure stability problem and protein structure prediction problem.

2.1

Differential Evolution Algorithm and its Advances

DE scheme is relatively a simple, fast and population based stochastic search technique, proposed by Storn and Price [3]. DE falls under the category of EA. But in some sense it differs significantly from EAs, e.g. trial vector generation process uses the information of distance and direction from current population to generate a new trial vector. Furthermore, in EAs, crossover is applied first to generate a trial vector, which is then used within the mutation operation to produce one offspring while, in DE, mutation is applied first and then crossover. In DE, initially the population is generated randomly with uniform distribution followed by mutation, crossover and selection operation to generate a new population. The generation of trial vector is a crucial step in DE process. The two operators mutation and crossover are used to generate the trial vectors. The selection operator is used to select the best trial vector for the next generation. DE operators are explained briefly in following subsections.

Mutation: A trial vector is generated by the DE mutation operator for each individual of the current population. For generating the trial vector, a target vector is mutated with a weighted differential. An offspring is produced in the crossover operation using the newly generated trial vector. If G is the index for generation counter, the mutation operator for generating a trial vector ui (G) from the parent vector xi (G) is defined as follows: • Select a target vector, xi1 (G), from the population, such that i 6= i1.

• Again, randomly select two individuals, xi2 and xi3 , from the population such that i 6= i1 6= i2 6= i3. • Then the target vector is mutated for calculating the trial vector as follows: Variation Component }| { z ui (G) = xi1 (G) + F × (xi2 (G) − xi3 (G)) {z } | Step size 4

(1)

where F ∈ (0, 1) is the mutation scale factor which is used in controlling the amplification of the differential variation [11].

Crossover: Offspring x′i (G) is generated using the crossover of parent vector, xi (G) and the trial vector, ui (G) as follows: x′ij (G) =

(

uij (G), if j ∈ J xij (G), otherwise.

(2)

J is the set of crossover points or the points that will go under perturbation, xij (G) is the j th element of the vector xi (G). Selection: There are two functions of the selection operator, first it select the individual for the mutation operation to generate the trial vector and second, it selects the best between the parent and the offspring based on their fitness value for the next generation. If the fitness of the offspring is better than the parent than it replaces the parent else parent remain in the population. xi (G + 1) =

(

x′i (G), if f (x′i (G)) > f (xi (G)). xi (G), otherwise.

(3)

This ensures that the population’s average fitness does not deteriorate. The pseudo code for classical DE strategy is described in Algorithm 0.1 [11], where F is the scale factor, CR is the crossover rate and P is the population vector. Researchers are continuously working to improve the efficiency and accuracy of DE algorithm. Table 1 shows a brief review of DE modifications. Table 1: Advances in Differential Evolution S.No.

Modification and Description

Yr/Ref.

1

Determined different parameter values for DE

2002[12]

2

Proposed a self-adapted strategy for crossover rate CR to solve multi-objective optimization problems

2002[13]

3

Proposed a parameter adaptation strategy for DE (ADE) which is based on controlling the population diversity

2003[14]

4

Two improved DE schemes for faster global search

2005[15]

5

Improved DE for handling noisy optimization problems

2005[16]

to be cont’d on next page

5

Table 1: Advances in Differential Evolution (Cont.) S.No.

Modification and Description

Yr/Ref.

6

Introduced a self-adaptive scaling factor.

2005[17]

7

Introduced a crossover-based local search method for DE called the Fittest Individual Refinement (FIR).

2005[18]

8

Performance of modified DE for optimal design of complex and non-linear chemical processes

2006[19]

9

Self-Adapting Control Parameters in DE (jDE)

2006[20]

10

Proposed a new variant of DE called simulated annealing DE (SADE). In SADE algorithm, each individual contains a set of F values instead of single value within the range [0.1, 1], control parameters F and CR] are encoded into individual and their values are changed, based on the two new probability factors.

2006[21]

11

Opposition Based DE

2008[22]

12

An adaptive DE variant (JADE)

2009[23]

13

Introduced a Self-adaptive DE (SaDE) algorithm, in which all the control parameters that are used in the trial vector generation strategies and selection process are steadily self-adapted by learning from their previous experiences

2009[24]

14

Proposed a self-adaptive strategy called scale factor local search DE (SF LSDE) strategy. SF LSDE is a self-adaptive scheme with the two local search algorithms.

2009[25]

15

Proposed a new variants of DE called DE Using a Neighborhood-Based Mutation Operator (DEGL). The proposed scheme balances the exploration and exploitation abilities of DE.

2009[26]

16

Proposed modifications in the cross over probability and mutation operator

2010[27]

17

Proposed DE with Composite Trial Vector Generation Strategies and Control Parameters

2011[28]

18

Proposed a hybrid DE with PSO

2011[29]

19

Self-adaptive DE with a small and varying population size

2012[30]

20

Adaptive DE with novel mutation and crossover strategies

2012[31]

2.2

Machine Learning Approaches

Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and nonspam folders. Supervise learning classify into (i) Classification - predict discrete valued output (ii) Regression - predict real valued output. Clustering is unsupervised learning. Figure 1 shows the categorization of machine learning approaches. It is divided into two categories i.e. supervise learning and unsupervised learning. Supervised learning generates a function that maps inputs to desired outputs (also called labels, because they are often provided by human experts labeling the training examples). For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function. Unsupervised learning models a set of inputs, like clustering. See also data mining and knowledge discovery. Here, labels are not known during training. 6

Algorithm 0.1 : Classical DE Initialize the control parameters; Initialize the population; while stopping condition(s) not true do for each individual xi (G) ∈ P (G) do (i) Evaluate the fitness, f (xi (G)); (ii) Create the trial vector, ui (G) by applying the mutation operator; (iii) Create an offspring, x′i (G), by applying the crossover operator; if f (x′i (G)) is better than f (xi (G)) then Add x′i (G) to P (G + 1); else Add xi (G) to P (G + 1); end if end for end while Return the individual with the best fitness as the solution;

Figure 1: Machine learning categorization.

7

The Table 2 shows the recently developed classification and regression models. All the models are available in R open source software. R is licensed under GNU GPL. Table 2: Machine learning models (Source: Author) S.No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Model Name

Model Type

Method

Package

Tuning Parameter(s)

ada avNNet bag bdk blackboost Boruta bstTree C5.0 cforest ctree cubist enet foba GAMens gamLoess gbm gcvEarth glm icr J48 JRip knn lars lda leapSeq Linda lm logforest M5 nb neuralnet nnet obliqueTree OneR ORFsvm pam parRF PART partDSA pcaNNet pcr pda plr rbf rbfDDA rda relaxo rf rpart RRF

Classification Dual Use Dual Use Dual Use Dual Use Dual Use Dual Use Classification Dual Use Dual Use Regression Regression Regression Classification Dual Use Dual Use Dual Use Dual Use Regression Classification Classification Dual Use Regression Classification Regression Classification Regression Classification Regression Classification Regression Dual Use Dual Use Classification Classification Classification Dual Use Classification Dual Use Dual Use Regression Classification Classification Dual Use Classification Classification Regression Dual Use Dual Use Dual Use

ada avNNet bag bdk blackboost Boruta bstTree C5.0 cforest ctree cubist enet foba GAMens gamLoess gbm gcvEarth glm icr J48 JRip knn lars lda leapSeq Linda lm logforest M5 nb neuralnet nnet obliqueTree OneR ORFsvm pam parRF PART partDSA pcaNNet pcr pda plr rbf rbfDDA rda relaxo rf rpart RRF

ada caret caret kohonen mboost Boruta bst C50 party party Cubist elasticnet foba GAMens gam gbm earth stats caret RWeka RWeka caret lars MASS leaps rrcov stats LogForest RWeka klaR neuralnet nnet obliqueTree RWeka obliqueRF pamr randomForest RWeka partDSA caret pls mda stepPlr RSNNS RSNNS klaR relaxo randomForest rpart RRF

maxdepth, iter, nu decay, size, bag vars xweight,topo,xdim,ydim maxdepth, mstop mtry maxdepth, nu, mstop winnow, trials, model mtry mincriterion committees, neighbors lambda, fraction lambda, k fusion, iter, rsm size degree,span trees, shrinkage,depth degree None n.comp C NumOpt k fraction None nvmax None None None smoothed,pruned,rules usekernel, fL layer1, layer2, layer3 size, decay splits, selection None mtry threshold mtry pruned, threshold cut.off.growth, MPD decay, size ncomp lambda cp, lambda size negativeThreshold lambda, gamma phi, lambda mtry cp mtry,coefReg,coefImp

to be cont’d on next page

8

Table 2: Machine learning models (Source: Author) (Cont.)

2.3

S.No

Model Name

Model Type

Method

Package

Tuning Parameter(s)

51 52 53 54 55 56 57 58 59

rvmLinear sda slda smda svmLinear svmPoly svmRadial svmRadialCost treebag

Regression Classification Classification Classification Dual Use Dual Use Dual Use Dual Use Dual Use

rvmLinear sda slda smda svmLinear svmPoly svmRadial svmRadialCost treebag

kernlab sda ipred sparseLDA kernlab kernlab kernlab kernlab ipred

None diagonal None R, lambda, NumVars C degree, scale, C C, sigma C None

Cluster Structure Stability

Molecular dynamics (MD) is a computer simulation of physical movements of atoms, molecules, residue, ligand, etc. Cluster structure determination is the problem of computational chemistry/physics and considered as NP-Hard problem. In MD searching the lowest potential energy of a cluster is the first step in understanding the physical and chemical properties of a biomolecule. There is a large number of local minima present for a cluster e.g. a cluster with 13 atoms may have (103 ) local minima[32]. The local minima increases rapidly with rate of (n2 ) [33, 34], where n is the number of atoms in a cluster. The searching of the lowest energy cluster is NP Hard problem [35, 36], therefore it is impossible to search all the possible structures through an exhaustive search, thus there is a need to use global optimization method.

2.4

Protein Structure Prediction

Proteins are complex, high molecular mass, organic compounds that consist of amino acids joined by peptide bonds. Proteins are essential to the structure and function of all living cells. Protein structure prediction (or protein folding), considered to be the holy grail of molecular biology, remains intractable even after six decades since the report of the first crystal structure. Figure 2 describe the PSP problem. The objective is to find the native structure that have minimum energy and stable structure. Levinthals’s Paradox: In 1969, Cyrus Levinthal proposed that protein folding is not trial-and-error. Suppose a protein with 100 peptide bonds (101 amino acids) and assume 3 states for each of the 100 phi and psi bond angles, then total combination are 3100 ≈ 1047 . 9

Figure 2: Protein structure prediction problem (Source: SCFBoi, IIT Delhi).

Figure 3: Levinthals’s paradox Lets, 1013 samples process per second then it required ≈ 1027 years[37] and it is greater than the age of the universe. but proteins fold in milliseconds to seconds Critical Assessment of Structure Prediction (CASP): CASP is a world wide competition for protein structure prediction held once in two year. The aim is to help advance the methods in identifying the protein structure from sequence. Figure 4 shows the number of participation in the each CASP and Figure 5 shows the increasing in the accuracy of predictions slowly but steadily. Computerized protein-structure prediction will enable faster development of novel drugs and treatments for many diseases, and is becoming more feasible due to better models of protein folding and increasing computing power. Ab initio structure prediction attempts to model proteins by starting from an extended chain and folding up the sequence on the computer. This method has an advantage that it does not depend on the existence of a previous determined structure to serve as a template. It is generally assumed that a protein sequence folds to a native confirmation or ensemble of conformations that is at or near the global free-energy minimum. 10

Figure 4: Participations in CASP (Source: Protein prediction center)

Figure 5: Protein structure prediction prediction accuracy (Source: Protein prediction center) 2.4.1

Protein structure prediction models

There are many methods that are developed in the last 30 years of research in protein structure prediction and some of them are briefly describes in Table 3. Most of the methods are broadly classify into following categories: ˆ The Template Based Modeling (TBM) category will include domains where a

suitable template can be identified that covers all or nearly all of the target. ˆ The Template free modeling (FM) category will include models of proteins for which no suitable template can be identified. ˆ The Refinement Modeling (RM) category will include selected targets from

11

among those released in the main modeling experiment to analyze success in refining models beyond the quality obtained by simply copying from a single template. For suitable CASP10 targets, we will select one of the best models received during the prediction season, and reissue it as a starting structure for refinement. ˆ The Contact Assisted Structure Modeling (CASM) category will show how the knowledge of a few (usually 3 to 5) long-range contacts influences the ability of predictors to model the complete structure. This experiment will be carried out only for the more challenging CASP10 targets where we can get coordinates in advance and have at least two weeks for re-prediction. ˆ The Chemical Shifts Guided Modeling (CSGM) of NMR structures will be performed on the selected CASP10 targets, for which we can get a chemical shifts table from the NMR-spectroscopists at least two weeks ahead of public release of the target. ˆ The Structure Modeling (SM) based on molecular replacement with ab initio

models and crystallographic diffraction data will be carried out for selected targets provided we get the structure factors from the crystallographers. Table 3: Protein structure prediction methods (Source: Author) S.No. 1 2 3

PSP Method AIDISORDER ANTHROPIC DREAMS AOBA-SERVER

4 5

BAKER BHAGEERATH-H

6

BILAB-ENABLE

7 8 9

BIOMINE CASPITA CHUO-FAMS

10 11

COFACTOR-HUMAN CORNELL-GDANSK

12

FALCON-TOPO

13

FEIG

14

FLOUDAS

15

FLOUDAS-REFINE

16 17 18 19

FOLDIT FRESS-SERVER HHPREDA ICOS

20

INTFOLD2

21 JIANG-FOLD to be cont’d on next page

Brief Description protein disorder prediction using machine learning and trivial features. multiplayer online game-based homology and ab-initio modeling. protein structure modeling guided by homology and hydrophobic residue interactions. modeling of protein structures using rosetta. a homology/ ab-initio method for predicting tertiary structures of soluble monomeric proteins. tertiary structure prediction by combination of fold recognition and fragment assembly and semi-automated quaternary structure prediction from monomer structure. prediction of disordered regions by multilayered information fusion. ring: a new network method for quality assessment of homology models. tertiary structure prediction of chuo-fams team by the consensus method composed of GDT TS and secondary structure arrangement under the check of the packing free energy using stage2 server models. protein-ligand binding sites prediction using cofactor. physics-based protein-structure prediction with the coarse-grained unres force field. improving remote homologue recognition using evolutionarily conserved topology of protein structures. structure refinement with molecular dynamics simulations in combination with an effective scoring and selection protocol. astro-fold 2.0: prediction of protein tertiary structure from first principles and global optimization. hydrogen bond network optimization for improved tertiary structure refinement. multiplayer online game-based homology and ab-initio modeling. structure refinement using energy-guided fragment regrowth. template-based structure prediction with hhpredareda-thread. residue-residue contact prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. fully automated prediction of tertiary structures intrinsic disorder and binding site residues using the intfold2 server. a packing cluster-based fold recognition server.

12

Table 3: Protein structure prediction methods (Source: Author) (Cont.) S.No. 22 23

PSP Method JONES-UCL KIAS-GDANSK

24

KIM-KIHARA

25 26 27

KNOWMIN LAUFERCENTERMETA MEILERLAB

28 29

MUFOLD2 OSSIA

30 31

PCONSD PHYRE2-A

32 33 34

PRDOS-CNF RAPTORX STRINGS

35

TSLAB-TBQA

36

WFFUGT

37 38

ZHANG-SERVER ZHOU-SPARKS-X

Brief Description protein structure prediction using pgenthreader and fragfold. prediction of protein structure with the use of unres force field with conformational space annealing and dynamic fragment assembly. structure prediction and refinement using constraints-based cabs model and knowledge-based secondary structural fragments interaction and residue environmental potential. iterative knowledge-based minimization protocol. ranking protein structures using free energy as a scale. bcl::fold - de novo prediction of complex and large protein topologies by assembly of secondary structure elements. EA for protein model refinement. a novel procedure for constructing multiple alignments for protein structure prediction. distance-based modeling. simulated protein synthesis and folding with residue-residue distance constraints from templates and sequence. prediction of protein disordered region based on conditional neural fields. protein threading by maximizing a new alignment potential. selection of templates recursively by integrating exhaustive strategies (strings) server. single-model quality assessment method using template-based evaluation score. foldit with selection by knowledge-based potential goap and refinement by tasser. protein structure predictions by a combination of i-tasser and quark pipelines. improving the single fold-recognition technique by employing statistical error potentials.

There are several kinds of prediction problems within the scope of PSP. ˆ Tertiary structure predictions (TS): ˆ Detecting residue-residue contacts in proteins (RR). ˆ Identifying disordered regions in target proteins (DR). ˆ Function prediction (prediction of binding sites) (FN). ˆ Quality assessment of models in general (without knowing native structures) and

the reliability of predicting certain residues in particular (QA).

2.4.2

Quality assessment for PSP models

The Critical Assessment of protein Structure Prediction (CASP) has been assessing the state of the art in the a priori estimation of accuracy of protein structure prediction since 2006. The QA predictions were accepted in two modes: ˆ QMODE 1 (QA1) for the assessment of the overall reliability of models. In QMODE

1, predictors were asked to score each model on a scale from 0 to 1, with higher values corresponding to better models and value of 1.0 corresponding to a model virtually identical to the native structure. 13

ˆ QMODE 2 (QA2) for the assessment of the per-residue accuracy of models. In

QMODE 2, predictors were asked to report estimated distances in Angstroms between the corresponding residues in the model and target structures after optimal superposition.

3

Research Gaps and Objectives

3.1

Research Gaps

1. Mezura-Montes et al. [10] show that the DE also not establishes a proper trade-off between exploitation and exploration of the search space. Some studies also proved that stagnation is another inherent drawback with DE [38]. 2. In DE, due to large step size, there is enough chance to skip true solution [39]. 3. As there is no fully efficient NIA so it is highly required to develop a new efficient nature-inspired algorithm and/or improved hybrid algorithm compared to the existing algorithms. 4. Solving real world complex optimization problems such as cluster structure stability, protein structure structure prediction, Data Clustering, Data Mining, Image processing, Pattern recognition problem, etc.

3.2

Research Objectives

The following research objectives are formulated: 1. Study of diversity analysis of DE and its advance variants. 2. Propose a new variant of DE that maintain the proper balance between exploration and exploitation in the search space. 3. Solve Cluster structure stability problem using nature inspired algorithm and study their performance. 4. Search the native protein structure using DE and its advance variants and test the performance with the benchmark structure. 5. Predict the protein tertiary structure using physicochemical properties using machine learning models. 6. Use advance variants of DE for feature selection and new algorithms such as Binary Random Forest are need to be explored. 14

7. Development of web application for protein structure prediction problem using machine learning approaches.

4

Research Methodology

In this section research methodology which will be adopted to carry-out the research work is discussed. Every step of research process is briefly shown in Fig 6 are Research Problem Definition Protein Structure Prediction using Differential Evolution and Machine Learning Approaches

1. Nature Inspired Algorithms; 2. Differential Evolution; 3. Machine learning approaches; 4. Cluster structure stablity; 5. Protein structure prediction

Literature Survey

1. Maximum utilization of the fast solving speed of DE. 2. Minimize the drawback of DE 3. Use advance variants of DE for feature selection. 4. Solve cluster structure stability problem; 5. Solve Protein structure prediction problem using DE; 6. Development of web application for protein structure prediction sing machine learning approaches.

Identification of Research Gaps

Research Objective

Work Done

1. Diversity study of DE and its advances; 2. Proposed modification in DE; 3. Solve cluster structure stability problem; 4. Solve Protein structure prediction problem using DE; 5. Precdict protein structure using machine learning approaches.

Testing over benchmark problem

Analysis of the results and comparing with the exesting solutions

Publications and Report writing

Figure 6: Research methodology

15

5

Work Done

This section briefly discuss the work done so far.

5.1

Diversity Analysis of Differential Evolution and its Variants

1

In this work, seven important diversity measures (Table 4) have been used to quantify the diversity of DE and its prominent variants such as JADE [23], jDE [20], OBDE [22] and SaDE [24]. The Table 5 list the benchmark problems that are taken into consideration for diversity measures to quantify the dispersion of individuals in the population. Further, effect of outliers has been analyzed over the diversity measures. Table 4: Diversity measure parameters S.No. 1 2 3 4 5 6 7

Diversity Measure Population Diameter Population Radius Average Distance around Population Center Geometric Average Distance around the Population Center Normalized Average Distance around the Population Center Average of the Average Distance around all Individuals in the Population. Population Coherence.

Table 5: Test problems used for diversity measures Test Problem Sphere Griewank Rosenbrock Ackley

Objective function P f1 (x) = Ii=1 x2i QI xi 1 PI 2 √ f2 (x) = 1 + 4000 i=1 xi − i=1 cos( i ) PI 2 2 f3 (x) = i=1 (100(xi+1 − xiq) + (xi − 1)2 ) PI 3 f4 (x) = −20 + e + exp(− 0.2 i=1 xi ) I P I 1 −exp( I i=1 cos (2πxi )xi )

Search Range [-5.12, 5.12] [-600,600]

I 30 30

Acceptable Error 1.0E − 15 1.0E − 15

[-30, 30]

30

1.0E − 15

[-1, 1]

30

1.0E − 15

Key findings of this study are as follows: 1. Diversity measures proportionally decreases with the iterations of the DE and its variants. 2. A high value of diversity measure shows dispersion in the population and low value shows convergence of the individuals in the population to a solution point. 3. It is found that the diversity measures are more or less effected by the outliers in the population. 4. Some diversity measures are less affected by the outliers and could be used for analyzing the exploration and exploitation. 1

Prashant Singh Rana, Mahua Bhattcharya, Anupam Shukla, “A diversity based comparative study for advance variants of Differential Evolution”, International Conference on Soft Computing for Problem Solving (SocPros 2012), Springer, December (2012). [In Press]

16

5.2

Guided Reproduction in Differential Evolution (GRDE) 2

To overcome the drawbacks of classical DE a novel strategy named Guided Reproduction in DE (GRDE) algorithm is proposed. GRDE try to maintain the proper balance between exploration and exploitation,for this two new phases are introduced that maintains the diversity in the population. Figure 7 describes the methodology of GRDE. During experiments, it is observed that classical DE solves the most of the benchmark problems very quickly or unable to solve. GRDE trying to utilize the fast solving capability of DE. In the proposed strategy two new phases are introduced namely phase 2 and phase 3. In the phase 1, the classical DE is executed for finding the required result of the given problem. Therefore phase 1 is executed for 500 iterations that explore and exploit the search space. If the required results are not obtained in the phase 1 then it enters into the phase 2. Both Phase 2 and Phase 3 are run for 50 iterations and consider as learning period. In Phase 2, worst 30% population is replaced by using the best 50% population. Every dimension of new individual is selected from randomly selected individuals corresponding dimensions from the best 50%population. This strategy is considered as guided reproduction of the population. The new individuals are reproduced using the best population, therefore the new population exploits the search space around the best populations. Further, as shown in Figure 7, the DE algorithm is executed for next 50 iterations and checked that the objective is achieved or not. Phase 3 maintains the exploration in which the worst 10% population is replaced by randomly initialized population in the search space. This phase is used to maintain the diversity in the population and to get-ride of stagnation problem. After it, the DE algorithm is again executed for further 50 iteration and checked the objective value. The phase 1 and 2 is repeatedly executed till the termination criteria is not satisfied, therefore it is expected that these two phases maintain the balance between exploration and exploitation, without increasing the function evaluation. The percentage of population selected for reproduction and randomly initialization are adopted based on the empirical study. There may be other strategies also for selection of the percentage of population for regeneration and randomly initialization. DE and GRDE are also compared through standard deviation (SD), mean error (ME), average function evaluations (AF E) and success rate (SR). First SR is compared for both the algorithms and if it is not possible to distinguish the algorithms based on SR then comparison is made on the basis of ME. AF E is used for comparison if it is not 2 Prashant Singh Rana, Harish Sharma, Mahua Bhattacharya, and Anupam Shukla. Guided reproduction in differential evolution. SEAL, volume 7673, pages 117-127. Springer, December 2012.

17

Figure 7: GRDE methodology

possible on the basis of SR and ME both. In Table 6, ’+’ indicates that the goodness of GRDE and ’-’ indicate that there is no significant improvement of GRDE over classical DE. It is observed from Table 6 that the GRDE outperforms for 15 problems compared to the classical DE. GRDE is implemented in C++ and source code and manual is available at http://bit.ly/grde

Table 6: Result comparison of GRDE and DE Test Problem f1 f2 f3 f4 f5 f6 f7 f8

GRDE + + + + +

Test Problem f9 f10 f11 f12 f13 f14 f15 f16

18

GRDE + + + + + -

Test Problem f17 f18 f19 f20 f21 f22 f23

GRDE + + + + +

5.3

Cluster Structure Stability Analysis of Nin (n=2-22) using Nature Inspired Algorithms

3

To solve the cluster structure stability problem, NIA such as GA, DE, PSO and ABC are used with PWSCF (Plane-Wave Self-Consistent Field), DFT (Density Functional Theory) and pseudo-potentials, the minimization of potential energy and relative stability for Nin clusters (n=2-22) were studied. NIA are used for global structural optimizations with space-fixed cartesian coordinates. Clusters obtained from NIA search were further optimized by using first principals total energy calculations. We found that GA gives minimum energy for Nin (n=6,11,16,21), DE gives minimum energy for Nin (n=2,4,7,9,15,17), PSO gives minimum energy for Nin (n=3,10,12,14,18,20) and ABC gives minimum energy for Nin (n=5,8,13,19,22). Here real value encoding scheme is used for the all NIAs. There are two representations method that are available for the atomic cluster, i.e. space-fixed cartesian coordinates [32] and internal-coordinates method. The space-fixed representation required 3n variables whereas internal-coordinates method required n (n-1)/2 variables. Most of the studies have shown that space-fixed cartesian coordinates representation improves the efficiency of the algorithm as compared with internal-coordinates representation, therefore spacefixed cartesian coordinates representation is used. The set of atomic coordinates are represented as x = (x1 , y1 , z1 , x2 , y2 , z2 ,.......) and V (x) is the potential energy, then the objective is to find x that minimize the V (x). F itnessF unction = MIN(V (x))

(4)

The initial population is generated randomly, with each sampled uniformly between (0, √ ω 3 6n), where ω = 2.0591, e.g. a four atom cluster is represented by real string of length 12. x1 5.547

y1 0.035

z1 1.116

x2 2.732

y2 4.917

z2 5.060

x3 1.686

y3 3.890

z3 0.231

x4 5.719

y4 2.749

z4 3.266

At the end of the global minimum structure search, the selected candidates are further optimizes using first principal total energy calculation. Process Flow: Figure 8 describe the flow of process. In the phase 1, the population is initialize randomly then fitness is evaluated, update or generate the new population according to the NIA till the termination criteria is met. The optimize coordinates are 3

Prashant Singh Rana, Sandeep Kumar Jain, Mahua Bhattcharya, Anupam Shukla, “Structure Stability analysis of Nin (n=2-22) using Nature Inspired Algorithms: A Performance Study”, International Journal of Advanced Intelligence Paradigms, InderScience, Nov 2012. [In Press]

19

followed by relaxation in the phase 2. In the final phase 3 the post processing and analysis of the result is carried out. The whole process is automated and the final result is stored in the file that contain the energy getting from NIA and relaxation. The final optimized coordinates are also stored in the separate file. The supplement data contain the detail documentation.

Figure 8: Cluster structure stability methodology Table 7: Parameter settings for NIA for energy minimization Algorithm GA DE PSO ABC

Parameters CR=0.9, MR=0.1 CR=0.9, F=0.5 c1,c2=1.5,w=0.8 NP=20,FOOD SOURCE=NP/2,

Table 8: Minimum energy getting from different NIA. Algorithm GA DE PSO ABC

Minimum energy Nin (n=6,11,16,21) Nin (n=2,4,7,9,15,17) Nin (n=3,10,12,14,18,20) Nin (n=5,8,13,19,22)

The parameter setting is very important for NIA. The initial population size is 20 and each algorithm is proceeded up to 200 generation. The DIM is 3n and every coordinate √ is bound between (0, ω 3 6n). Table 7 summarizes the parameter setting for each NIA . 20

5.4

Protein Tertiary Structure Prediction using Differential Evolution and Physicochemical Properties

4

To solve protein structure prediction problem, DE is used for global structural optimizations with space-fixed cartesian coordinates with six physicochemical properties such as total empirical energy, total surface area, non polar exposed area, euclidian distance, secondary structure penalty and spacial distribution constraint. Figure 9 briefly describes the methodology used. First, initial population are generated and physicochemical properties are measured. After calculating the fitness value ever population population have to undergo for DE operators viz. mutation, crossover and then selection. The role of the DE is to generate protein sequences and the selection is based on the objection function defined in eq 5.

Figure 9: Methodology used for protein structure prediction using DE. Here, real value encoding scheme is used for DE with fixed space cartesian coordinates representation. The set of amino acid coordinates, of protein sequence P , are represented as x = (x1 , y1 , z1 , x2 , y2 , z2 ,.......), then the objective is to find x that have minimum fitness value defined in eq 5. F itnessF unction = MIN(min(f1 (P )) + max(f2 (P )) + max(f3 (P ))+ min(f4 (P )) + min(f5 (P )) + min(f6 (P 6)))

(5)

where: P is the protein sequence, f1 to f6 defined below. The objection is the minimize 4 Prashant Singh Rana, Mahua Bhattcharya, Anupam Shukla, “Protein tertiary structure prediction using DE and physicochemical properties. BMC Bioinformatics, Oxford [Under Review]

21

the sum of all the physicochemical properties. ˆ f1 (P ) is the total empirical energy [? ] of P and defined as:

∆G = Wvdw ∗ ∆Gvdw + WsolvH ∗ ∆GsolvH + WsolvP ∗ ∆GsolvP + ∆Gwb + ∆Ghbond + ∆Gel + ∆GKon + Wmc ∗ T ∗ ∆Smc + Wsc ∗ T ∗ ∆Ssc where ∆Gvdw is the sum of the van der Waals contributions of all atoms with respect to the same interactions with the solvent. ∆GsolvH and ∆GsolvP are the differences in sol0vation energy for apolar and polar groups respectively when these change from the unfolded to the folded state. ∆Ghbond is the free energy difference between the formation of an intra-molecular hydrogen bond compared to intermolecular hydrogen-bond formation (with solvent). ∆Gwb is the extra stabilising free energy provided by a water molecule making more than one hydrogen bond to the protein (water bridges) that cannot be taken into account with non- explicit solvent approximations. ∆Gel is the electrostatic contribution of charged groups, including the helix dipole. ∆Smc is the entropy cost of fixing the backbone in the folded state; this term is dependent on the intrinsic tendency of a particular amino acid to adopt certain dihedral angles. Finally ∆Ssc is the entropic cost of fixing a side chain in a particular conformation. ∆GKon , which reflects the effect of electrostatic interactions on the association constant kon (this applies only to the subunit binding energies) and ∆Str , which is the loss of translational and rotational entropy that ensues on formation of the complex. The latter term cancels out when we are looking at the effect of point mutations on complexes. The energy values of ∆Gvdw , ∆GsolvH , ∆GsolvP and∆Ghbond attributed to each atom type have been derived from a set of experimental data, and ∆Smc have been taken from theoretical estimates. The terms Wvdw , WsolvH , WsolvP , Wmc andWsc correspond to the weighting factors applied to the raw energy terms. They are all 1, except for the van der Waals contribution which is 0.33 (the van der Waals contributions are derived from vapor to water energy transfer, while in the protein we are going from solvent to protein). ˆ f2 (P ) calculate the total surface area of P . ˆ f3 (P ) calculate the non polar exposed area of P . ˆ f4 (P ) calculate the euclidian distance of P . ˆ f5 (P ) calculate the secondary structure penalty of P . ˆ f6 (P ) calculate the spacial distribution constraint on P .

The initial population is generated randomly, with each sampled uniformly between (0, 22

√ ω 3 6n), where ω = 2.0591, e.g. a four amino acid sequence of protein P is represented by real string of length 12. x1 5.547

y1 0.035

z1 1.116

x2 2.732

y2 4.917

z2 5.060

x3 1.686

y3 3.890

z3 0.231

x4 5.719

y4 2.749

z4 3.266

Table 9 summarizes the parameter setting for each GA, PSO and DE. The initial population size is 100 and each algorithm is proceeded up to 1400 generation. The DIM is 3n √ and every coordinate is bound between (0, ω 3 6n). At the end of the global minimum structure search, the selected candidates are further optimizes using first principal total energy calculation. The Figure 10 shows the total empirical energy getting in number of generations. The results clearly shows that DE is converges faster as compare with GA and PSO. Table 9: Parameter settings for PSP using GA, PSO and DE Algorithm GA PSO DE

Parameters CR=0.9, MR=0.1 c1,c2=1.5,w=0.8 CR=0.9, F=0.5

Figure 10: Result comparison of protein structure prediction using GA, PSO and DE.

5.5

Protein Tertiary Structure Data Set

5

The raw experimental data from CASP-5 to CASP-9 is downloaded from protein prediction center [40]. Then raw data is filtered, normalized and transformed into experimental data set. After cleansing and removing duplicate entries from each dataset, the CASP5, CASP-6, CASP-7, CASP-8 and CASP-9 have data count 3579, 6728, 7025, 13191 5

Data set is available at http://archive.ics.uci.edu/ml/datasets/ProteinTertiaryStructureDS

23

and 15207 respectively and total data count in CASP-(5-9) is 45730 (Table 10). The each data set is in the CSV (comma separated values) file format and available in UCI Machine Learning Repository (Web link - http://archive.ics.uci.edu/ml/datasets/ ProteinTertiaryStructureDS ). Table 10 and Table 11 describes the data set and its properties.

The dataset is multivariate, widely distributed and highly overlap. The Table 13 shows the sample of data set. The description of the features are described in Table 12. Each data set have nine features of real values and the size of each decoy size is lies between 0˚ A and 21˚ A. F1 to F9 are the features and the class is the size of the decoy in ˚ A. Table 10: Data set used in Protein Structure Prediction S.No. Data Set 1 CASP-5 2 CASP-6 3 CASP-7 4 CASP-8 5 CASP-9 Total CASP-(5-9)

Data Count (No. of Decoys) 3579 6728 7025 13191 15207 45730

Table 11: Description of the data set used in Protein Structure Prediction Number of features Feature characteristics No. of classes (RMSDs) Dataset characteristics Associated tasks Missing values

9 Real 0˚ A to 21˚ A Multivariate Regression No

Table 12: Description of the features used in Protein Structure Prediction Feature F1 F2 F3 F4 F5 F6 F7 F8 F9

Information Total surface area. Non polar exposed area. Fractional area of exposed non polar residue. Fractional area of exposed non polar part of residue. Molecular mass weighted exposed area. Average deviation from standard exposed area of residue. Euclidian distance. Secondary structure penalty. Spacial distribution constraint(NK Value).

approaches 24

Table 13: Sample dataset used in Protein Structure Prediction Class 4.5 3.0 12.7 11.5 14.9 2.5 2.2 18.8 19.4 19.6

5.6

F1 5769.3 12962.3 5960.2 9926.8 6658.5 12272.7 12579.2 11969.7 21779.3 9020.8

F2 1634.9 3389.2 2230.7 3276.7 2590.6 2836.1 3473.6 4721.9 8269.9 2509.4

F3 0.3 0.3 0.4 0.3 0.4 0.2 0.3 0.4 0.4 0.3

F4 57.0 141.3 64.3 102.0 62.2 140.0 129.4 110.1 250.2 97.9

F5 769460.0 1761875.6 845557.5 1386953.4 951212.1 1665676.7 1737181.1 1570036.7 2957489.5 1239283.1

F6 89.6 159.8 92.7 159.7 92.8 157.2 188.4 187.2 393.6 134.4

F7 2705.3 3971.0 2246.9 4095.0 2366.4 3950.5 4186.3 4553.8 5686.3 3717.6

F8 33 186 60 71 65 226 71 82 171 62

F9 46.55 32.37 28.65 31.98 30.88 38.64 33.55 32.30 35.60 31.79

Protein Tertiary Structure Prediction using Physicochemical Properties, SaDE and Machine Learning Approach 6

Physicochemical properties of protein always guide us in determining the quality of the protein structure, therefore it has been rigorously used to distinguish native or native like structure from other predicted structures. Here, we extracted eight physicochemical properties from protein structure prediction center (CASP) such as area, secondary structure penalty, geometrical constraints and all atom energy of 45730 decoys, whose RMSD space lies between 0˚ A to 20.99˚ A. A real coded Self-adaptive DE (SaDE) is used for feature selection that gives optimal weight to each feature. The selected features are used by nine machine learning approaches such as random forest, support vector machine, neural network and linear model. Other machine learning methods such as M5P, cubist, foba and decision stump are also tested for the prediction of protein structure from its true native state. Our work supports that random forest method outperforms other machine learning approaches in protein structure prediction. The performance result shows the average R2 is 0.85, average RMSE is 2.35˚ A and average prediction accuracy is 77.68% with acceptable error of ±2. The data used in the study and source code is available at

http://bit.ly/native regression

The approach is described in Fig 12. First and second steps are explain in section 5.5, Third, a real coded Self-adaptive DE (SaDE) is used for feature selection [24]. Feature selection makes the prediction of regression and classification model efficient and accurate. Table 12 describe the feature information. The eq 6 define the objective function used for feature selection. Forth, all the nine machine learning approaches (Table 14) were trained and tested on each data set with default parameters. Finally, the evolution of the 6

Prashant Singh Rana, Mahua Bhattacharya, Anupam Shukla, “Protein structure prediction using physicochemical properties and SaDE based machine learning approaches”, IEEE Transactions on computational biology and bioinformatics. [Under Review]

25

model is done on RMSE (Root Mean Square Error), R2 (Coefficient of Determination) and Accuracy. Table 14: Regression machine learning methods M1 M2 M3 M4 M5 M6 M7 M8 M9

Model Decision Trees Random Forest SVM LM NN M5P DecisionStump cubist foba

Model Type Regression Regression Regression Regression Regression Regression Regression Regression Regression

Method C5.0 rf svm lm neuralnet M5P DecisionStump cubist foba

Package C50 randomForest e1071 stats neuralnet RWeka RWeka Cubist foba

Tuning Parameter(s) winnow, model, trials mtry nu, epsilon None layer2, layer1, layer3 rules, pruned, smoothed rules, pruned, smoothed committees, neighbors lambda, k

Ref. [41] [42] [43] [44] [45] [46] [47] [48] [49]

Figure 11: Snapshot of web application for protein structure prediction using machine learning approaches; Web link - http://www.psrana.com/psp.html Table 15: Avg. performance comparison∗ of all nine regression models on all data sets.

M1 M2 M3 M4 M5 M6 M7 M8 M9

Model

RMSE

DecisionTree RandomForest SVM LM NN M5P DecisionStump cubist foba

3.51 2.31 2.97 4.11 4.1 2.73 4.84 2.38 4.13

R2

Accuracy%

0.65 60.13 0.86 78.47 0.73 69.68 0.49 44.03 0.49 44.2 0.79 72.41 0.26 28.01 0.84 79.11 0.48 43.66 ∗ Best three are

Model Learn Time (Sec) 0.307 45.397 7.709 0.017 3.319 1.567 0.079 1.407 0.159 underlined and bold

26

Model Prediction Time (Sec) 0.006 0.380 0.925 0.006 0.013 0.040 0.094 0.100 0.005

Total Time (Sec) 0.313 45.777 8.634 0.023 3.332 1.608 0.174 1.507 0.164

Figure 12: Methodology used for protein structure prediction using regression models



ObjectiveF unction = min 

Figure 13: Methodology used for classification of proteins using classification models

v

T u X u i=1

t α − i

n X j=1

!2  wj .Fi,j 

(6)

where, T is the total number of instances in training data set, α is the class of training data set, n is the number of features, F is the feature and w is the weight given to each feature defined in [0,1].

5.7

Classification of Protein Tertiary Structure using Physicochemical Properties, OBDE and Machine Learning

7

Evolutionary information such as physicochemical properties of protein always guide us in determining the quality of the protein structure, therefore it has been rigorously used to distinguish native or native like structure from other modeled or predicted structures. Here, we extracted eight physicochemical properties from protein structure prediction center (CASP), such as area, secondary structure penalty, geometrical constraints and all atom energy of 45730 decoys whose RMSD space lies between 0˚ A to 21˚ A. A real coded OBDE is used for feature selection that gives optimal weight to each feature. The selected features are used by five machine learning approaches such as Decision Trees, SVM, LM, Random Forest and Naive Bayesian for the prediction of protein structure from its true native state. Our work supports that random forest method outperform as compared to other machine learning methods. The performance result shows the average R2 is 0.91, average RMSE is 2.03 and average prediction accuracy of 82.72% with acceptable error of ±2. The data used in the study and source code is available at 7 Prashant Singh Rana, Mahua Bhattacharya, Anupam Shukla, “Classification of proteins from its native structure using evolutionary information”, Genomics Proteomics and Bioinformatics (Elsevier) [Under Review]

27

http://bit.ly/native classification

The approach is described in Fig 13. First and second steps are explain in section 5.5, Third, a real coded OBDE is used for feature selection [22]. Table 12 describe the feature information. Feature selection makes the prediction of regression and classification model efficient and accurate. The eq 6 define the objective function used for feature selection. Forth, all the five machine learning approaches (Table 16) were trained and tested on each data set with default parameters. Finally, the evolution of the model is done on RMSE (Root Mean Square Error), R2 (Coefficient of Determination) and Accuracy. Table 16: Classification machine learning methods. M1 M2 M3 M4 M5

Model Decision Trees SVM LM Random Forest Naive Bayesian

Model Type Classification Classification Classification Classification Classification

Method C5.0 ksvm lm rf naiveBayes

Package C50 e1071 stats randomForest e1071

Tuning Parameter(s) winnow, model, trials nu, epsilon None mtry None

Ref. [41] [43] [44] [42] [50]

Table 17: Avg. performance comparison of classification models on all data sets.

M1 M2 M3 M4 M5

6

Model

RMSE

R2

Accuracy%

DecisionTree SVM LM RandomForest Naive Bayes

5.19 3.21 4.47 2.03 8.66

0.33 0.74 0.50 0.91 0.10

60.31 74.05 63.66 82.72 26.70

Model Learn Time (Sec) 3.60 57.84 27.10 10.63 0.05

Model Prediction Time (Sec) 0.01 12.63 0.02 0.41 1.70

Total Time (Sec) 3.60 70.47 27.12 11.04 1.75

Research Contribution

This section briefly discuss the research contribution. 1. Study of diversity analysis of DE and its advance variants (section 5.1). 2. GRDE: A new variant of DE is developed (section 5.2). 3. Solve cluster structure stability of Nin (n=2-22) using NIA (section 5.3). 4. Solve protein tertiary structure prediction using DE and implemented in Bhageerath, protein tertiary structure prediction server at IIT Delhi, (section 5.4). 5. Protein tertiary structure prediction using physicochemical properties, SaDE and machine learning approaches (section 5.6). 6. Classification of protein tertiary structure using OBDE and machine learning approaches (section 5.7). 7. Use SaDE & OBDE for feature selection (section 5.6 and 5.7). 28

Figure 14: Research contribution 8. Develop as web based application for protein structure based on physicochemical properties (Figure 11). Web link - http://www.psrana.com/psp.html 9. Donate protein tertiary structure dataset to UCI Machine Learning Repository (section 5.5). Web link - http://archive.ics.uci.edu/ml/datasets/ProteinTertiaryStructureDS

7

Conclusion and Future Work

The complexity of real world optimization problems is increasing day by day. Thus it demands for robust, fast, and accurate optimizers in various fields. NIA such as PSO, ABC, DE and GA and their advance approaches for global optimization over continuous and discrete spaces. This report attempted to provide a brief review of DE and its advance variants, machine learning approaches, cluster structure stability problem and protein structure prediction problem. Furthermore some inherent drawbacks of NIA are analyzed and research gaps are identified accordingly. On the basis of research gaps, the report suggests some research objectives to surmount the drawbacks. There is a high demand for fast and accurate optimization methods to complex real world optimization problems, In order to reduce the premature convergence and stagnation problem DE, a new modification namely GRDE is proposed. SaDE and OBDE are used for feature selection. Two real world problems are soled using DE and Machine Learning Approaches (i) Cluster structure stability and (ii) Protein structure prediction. The future work may contain of ˆ Many new machine learning approaches are available, they need to be explored for

accurate and fast results. 29

ˆ New variants of DE can be used for feature selection and multi-objective optimiza-

tion. ˆ Apply DE algorithms to solve real world bioinformatics optimization problems like such as prediction of novel gene or novel drug combination. ˆ New feature selection such as Binary Random Forest, etc can be used for feature selection.

8

Organization of Thesis

The thesis will be organized into eight chapters. A brief outline is given below: ˆ Chapter 1: This chapter introduces the basic optimization, DE and its advance

variants, machine learning approaches, cluster structure stability problem and protein structure prediction problem. This chapter also presents a survey of the literature in the related area and identified the prominent research gaps. Further, research objectives are defined with the research methodology used in this study. ˆ Chapter 2: This chapter study the diversity analysis of DE and its advance vari-

ants. ˆ Chapter 3: This chapter describe the proposed strategies i.e. GRDE, that uti-

lizes the fast solving speed of DE and maintaining the proper balance between exploration and exploitation. ˆ Chapter 4: This chapter analyzed the cluster structure stability using NIA. ˆ Chapter 5: This chapter will the solve the protein structure prediction problem

using DE. ˆ Chapter 6: This chapter describe the prediction of protein structure using physicochemical properties and SaDE based machine learning approaches. ˆ Chapter 7: This chapter classify the protein structures using evolutionary infor-

mation using OBDE and classification machine learning approaches. ˆ Chapter 8: This chapter summarizes the key findings and main contributions of

the Thesis and lists possible future research directions.

9

List of Publications

Published 1. Prashant Singh Rana, Sandeep Kumar Jain, Mahua Bhattcharya, Anupam Shukla, “Structure Stability analysis of Nin (n=2-22) using Nature Inspired Algorithms: A Performance

30

Study”, International Journal of Advanced Intelligence Paradigms, InderScience, Nov 2012. [In Press] 2. Prashant Singh Rana, Harish Sharma, Mahua Bhattacharya, and Anupam Shukla. Guided reproduction in differential evolution. SEAL, volume 7673, pages 117-127. Springer, December 2012. 3. Prashant Singh Rana, Mahua Bhattcharya, Anupam Shukla, “A diversity based comparative study for advance variants of Differential Evolution”, International Conference on Soft Computing for Problem Solving (SocPros 2012), Springer, December (2012). [In Press]

Under Review 1. Prashant Singh Rana, Mahua Bhattacharya, Anupam Shukla, “Protein structure prediction using physicochemical properties and SaDE based machine learning approaches”, IEEE Transactions on computational biology and bioinformatics. [Under Review] 2. Prashant Singh Rana, Mahua Bhattacharya, Anupam Shukla, “Classification of proteins from its native structure using evolutionary information”, Genomics Proteomics and Bioinformatics (Elsevier) [Under Review]

31

32

Key References

[1] E. Bonabeau, M. Dorigo, and G. Theraulaz. Swarm intelligence: from natural to artificial systems. Oxford University Press, 1999. [2] J.H. Holland. Adaptation in natural and artificial systems. MIT University Press, 1975. [3] K.V. Price. Differential evolution: a fast and simple numerical optimizer. In IEEE Conference on Fuzzy Information Processing Society, volume 4, pages 524–527, 1996. [4] J. Kennedy and R. Eberhart. Particle swarm optimization. In IEEE International Conference on Neural Networks, volume 4, pages 1942–1948, 1995. [5] D. Karaboga. An idea based on honey bee swarm for numerical optimization. Technical report, Erciyes University Press, 2005. [6] M. Dorigo and T. St¨ utzle. Ant colony optimization. MIT University Press, 2004. [7] K.M. Passino. Bacterial Foraging Optimization. International Journal of Swarm Intelligence Research (IJSIR), IGI Global, volume 1:pages 1–16, 2010. [8] D. Karaboga and B. Akay. A comparative study of artificial bee colony algorithm. Applied Mathematics and Computation, Elsevier, 214(1):108–132, 2009. [9] J. Vesterstrom and R. Thomsen. A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems. In IEEE Congress on Evolutionary Computation (CEC), volume 2, pages 1980–1987, 2004. [10] E. Mezura-Montes, J. Vel´ azquez-Reyes, and C.A. Coello Coello. A comparative study of differential evolution variants for global optimization. In ACM Eight annual conference on Genetic and evolutionary computation, pages 485–492, 2006. [11] A.P. Engelbrecht. Computational intelligence: an introduction. Wiley, 2007. [12] R. Gamperle, S.D. Muller, and A. Koumoutsakos. A parameter study for differential evolution. Advances in Intelligent Systems, Fuzzy Systems and Evolutionary Computation, Citeseer, 10:293–298, 2002. [13] H.A. Abbass. The self-adaptive pareto differential evolution algorithm. In IEEE Congress on Evolutionary Computation (CEC), volume 1, pages 831–836, 2002. [14] D. Zaharie. Control of population diversity and adaptation in differential evolution algorithms. In Proceedings of MENDEL, Citeseer, pages 41–46, 2003. [15] S. Das, A. Konar, and U.K. Chakraborty. Two improved differential evolution schemes for faster global search. In ACM International conference on Genetic and evolutionary computation, pages 991–998, 2005. [16] S. Das, A. Konar, and U.K. Chakraborty. Improved differential evolution algorithms for handling noisy optimization problems. In IEEE Congress on Evolutionary Computation (CEC), volume 2, pages 1691–1698, 2005. [17] M. Omran, A. Salman, and A. Engelbrecht. Self-adaptive differential evolution. Computational intelligence and security, pages 192–199, 2005. [18] N. Noman and H. Iba. Enhancing differential evolution performance with local search for high dimensional function optimization. In ACM International conference on Genetic and evolutionary computation, pages 967–974, 2005. [19] R. Angira and BV Babu. Performance of modified differential evolution for optimal design of complex and non-linear chemical processes. Journal of Experimental & Theoretical Artificial Intelligence, Taylor & Francis, 18(4):501–512, 2006. [20] J. Brest, S. Greiner, B. Boskovic, M. Mernik, and V. Zumer. Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. IEEE Transactions on Evolutionary Computation, 10(6):646–657, 2006. [21] J.Y. Yan, Q. Ling, and D.M. Sun. A differential evolution with simulated annealing updating method. In IEEE International Conference on Machine Learning and Cybernetics, pages 2103–2106, 2006.

33

[22] S. Rahnamayan, H.R. Tizhoosh, and M.M.A. Salama. Opposition-based differential evolution. IEEE Transactions on Evolutionary Computation, 12(1):64–79, 2008. [23] J. Zhang and A.C. Sanderson. JADE: adaptive differential evolution with optional external archive. IEEE Transactions on Evolutionary Computation, 13(5):945–958, 2009. [24] AK Qin, VL Huang, and PN Suganthan. Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Transactions on Evolutionary Computation, 13(2):398–417, 2009. [25] F. Neri and V. Tirronen. Scale factor local search in differential evolution. Memetic Computing, Springer, 1(2):153–171, 2009. [26] S. Das, A. Abraham, U.K. Chakraborty, and A. Konar. Differential evolution using a neighborhood-based mutation operator. IEEE Transactions on Evolutionary Computation, 13(3):526–553, 2009. [27] L. Qu, D. He, and Y. Li. Self-adaptive improved differential evolution algorithm. In Sixth IEEE International Conference on Natural Computation (ICNC), volume 5, pages 2472–2475, 2010. [28] Y. Wang, Z. Cai, and Q. Zhang. Differential Evolution with Composite Trial Vector Generation Strategies and Control Parameters. IEEE Transactions on Evolutionary Computation, 15(1):55–66, 2011. [29] HM Elragal, MA Mangoud, and MT Alsharaa. Hybrid differential evolution and enhanced particle swarm optimisation technique for design of reconfigurable phased antenna arrays. Microwaves, Antennas & Propagation, IET, 5(11):1280–1287, 2011. [30] J. Brest, B. Boskovic, A. Zamuda, I. Fister, and M.S. Maucec. Self-adaptive differential evolution algorithm with a small and varying population size. In IEEE Congress on Evolutionary Computation (CEC), volume 2, pages 1–8, 2012. [31] S.M. Islam, S. Das, S. Ghosh, S. Roy, and P.N. Suganthan. An adaptive differential evolution algorithm with novel mutation and crossover strategies for global numerical optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 42(2):482–500, 2012. [32] J.A. Niesse and H.R. Mayne. Global geometry optimization of atomic clusters using a modified genetic algorithm in space-fixed coordinates. The Journal of chemical physics, 105:47–51, 1996. [33] MR Hoare. Structure and dynamics of simple microclusters. Advances in Chemical Physics, Wiley Online Library, pages 49–135, 1979. [34] JA Northby. Structure and binding of Lennard-Jones clusters: 13 N 147. The Journal of chemical physics, 87:6166, 1987. [35] D.M. Deaven and K.M. Ho. Molecular geometry optimization with a genetic algorithm. Physical Review Letters, APS, 75(2):288–291, 1995. [36] L.T. Wille and J. Vennik. Computational complexity of the ground-state determination of atomic clusters. Journal of Physics A: Mathematical and General, IOP Publishing, 18:419–425, 1985. [37] Robert Zwanzig, Attila Szabo, and Biman Bagchi. Levinthal’s paradox. Proceedings of the National Academy of Sciences, National Acad Sciences, 89(1):20–22, 1992. [38] J. Lampinen and I. Zelinka. On stagnation of the differential evolution algorithm. In Proceedings of MENDEL, Citeseer, pages 76–83, 2000. [39] G. Zhu and S. Kwong. Gbest-guided artificial bee colony algorithm for numerical function optimization. Applied Mathematics and Computation, Elsevier, 217(7):3166–3173, 2010. [40] CASP. Protein Structure Prediction Center, 2012. www.predictioncenter.org. [41] J.R. Quinlan. Induction of decision trees. Machine learning, Springer, 1(1):81–106, 1986. [42] Andy Liaw and Matthew Wiener. Classification and Regression by randomForest. R News(http://CRAN.R-project.org/doc/Rnews/), 2(3):18–22, 2002. [43] S.S. Keerthi and E.G. Gilbert. Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, Springer, 46(1):351–360, 2002. [44] J.M. Chambers. Computational methods for data analysis. Applied Statistics, Wiley, 1(2):1–10, 1977.

34

[45] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In IEEE International Conference on Neural Networks, pages 586– 591, 1993. [46] E. Frank, Y. Wang, S. Inglis, G. Holmes, and I.H. Witten. Using model trees for classification. Machine Learning, Springer, 32(1):63–76, 1998. [47] M. Dettling and P. B¨ uhlmann. Boosting for tumor classification with gene expression data. Bioinformatics, Oxford University Press, 19(9):1061–1069, 2003. [48] Rulequest. Data Mining with Cubist (http://www.rulequest.com/cubist-info.html last date accessed: 10 Feb 2013), 2006. [49] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689–4708, 2011. [50] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine learning, Springer, 29(2):131–163, 1997.

35