### Aggregate gene function prediction

### Supplementary figures

Benchmarking ‘off-the-shelf’ machine learning algorithms with participants of the MouseFunc critical assessment. (A) The performance of the submission in the MouseFunc critical assessment. The biological processes from the gene ontology with three intervals of functional annotations are used to compare the performance of the participants. The participants were A calibrated ensembles of SVM, B kernel logistic regression, C GeneMANIA, D multi-label hierarchical classification and Bayesian integration, E combination of classifier ensemble and gene network, F GeneFAS, and H function prediction through query retrieval. (B) ‘Off-the-shelf’ algorithms are benchmarked with the parsed biological data from participant C (GeneMANIA). The ‘off-the-shelf’ machine learning algorithms have comparable performance compared with participant C (A neighbor voting, B logistic regression, C GeneMANIA, D random forest, E stochastic gradient descent, and F passive aggressive).

Comparing overlap of the different performance metrics. The metrics used include: (i.) area under the ROC curve (AUROC), (ii.) area under the precision recall curve, (iii.) maximum F-measure (Fmax), and (iv.) precision by 20 % recall (P20R). Algorithms perform at similar levels on different data sources, but the overlap is between 78-95%. We ran the six algorithms on four different data resources: protein-protein interaction (PPI), sequence similarity (SQ), co-expression (Co), and semantic similarity (SM). The six algorithms used are: neighbor voting (NV), GeneMANIA (GM), logistic regression (LR), random forests (RF), passive aggressive (PA), and stochastic gradient descent (SGD).

### Machine learning algorithms

We selected a set of machine learning algorithms based upon mathematical formulations that are really distinctive. The first two algorithms, neighbor voting and GeneMANIA are network inference based algorithms – they infer novel function-associated-genes based upon the network connectivity. The other four algorithms are machine learning algorithms that are very well established. Logistic regression and random forest are offline or eager learning algorithms. Stochastic gradient descent and passive aggressive are machine learning algorithms applied in an online or lazy setting. An online or lazy setting means that the algorithm processes its learning data sample by sample in a serial sequence.

Typically, machine learning algorithms applied in gene function prediction are constructed with a rather complicated procedure. During this research we picked ‘default’ implementations of machine learning algorithms; these algorithms are benchmarked according to the recently organized MouseFunc assessment. Finally, we constructed meta-learners; in the manuscript also called algorithm aggregation. This facilitates the quantification of performance improvement of algorithm aggregation.

We compared the aggregation based upon averaging the probabilistic output scores or correlation-based feature selection (CFS). Correlation-based feature selection, in analogy with an ensemble algorithm, keeps learners that are uncorrelated (e.g., provide additional predictive power) between each other and correlated with the output label. In the following sections, a concise description of each of the algorithms is provided.

##### Neighbor voting

Neighbor voting is one of the first algorithms used for inference of gene function prediction. In essence, this algorithm applies the guilt-by-association (GBA) on the interactome; it scores each gene as a fraction of its number of connections with function-associated-genes and the total number of connections of that gene in the network.

This algorithm has the advantage to be very fast and contains easy mathematical formulation; it also contains a drawback that it only predicts novel candidates that are neighbors of function-associated-genes (also called local consistency constraint). After the application of neighbor- or majority voting in PFP, novel proposed algorithms provide a new heuristic for predicting novel candidates not only based upon the neighbors of function-associated-genes. The following algorithm, GeneMANIA, applies a label propagation heuristic for scoring novel candidates.

##### GeneMANIA

GeneMANIA is a fast real-time machine learning algorithm based upon Multiple Association Network Integration Algorithm that was the best overall performing in the Mousefunc I benchmark. This algorithms scores genes in two distinct steps: (1) linear regression with ridge regularization for the computation of a single consensus functional association network from heterogenous “omics” networks and (2) Gaussian field label propagation based upon this consensus network.

The Gaussian label propagation incorporates a local- and global consistency constraint. In order to overcome the limitations of neighbor voting (local consistency constraint), GeneMANIA is less restrictive and scores nodes that are part of the same module in the network equally by iteratively propagating label scores.

##### Logistic regression

Logistic regression is one of the most traditional algorithms used for classification in machine learning. In the course of this study, we used logistic regression based upon LIBLINEAR. This very popular implementation applies coordinate descent method and solves the dual form of logistic regression; most of the logistic regression algorithms solve the primal problem. A coordinate descent method is an optimization method that iteratively updates a block of variables and solving the subproblem. This is an alternative for the kernel-based logistic regression proposed for gene function prediction.

We apply L2 regularized logistic regression with inverse regularization strength of 1.0. Weights to compensate for the class imbalance are inversely proportional.

`model = LogisticRegression(penalty='l2', dual=True, class_weight="auto", C=1.0, tol=0.0001)`

model.fit(network, labels)

scores = model.predict_proba(network)

##### Random forest

Random forest is a popular ensemble machine learning algorithm based upon decision trees. Decision trees are one of the first rule-based algorithms suggested in artificial intelligence (AI). The algorithm learns a rule-based tree from the training data and the leafs of the tree represent the final decision. Ensemble methods are a divide-and-conquer approach (induction of decision rules that construct the decision tree) that combines a collection of “weaker” uncorrelated learners and fuses them together to a “stronger” learner. Ensemble, also called aggregation, is a method to avoid overfitting and random forest based estimators are often very robust.

Random forest classifiers are applied in a wide range of bioinformatics tasks: prediction of protein-protein interaction, variant priorization, etc. Also, random forest has been applied for PFP.

The number of trees in the forest is set to 300, the criterion to split a decision tree is set to the Gini impurity. The maximum number of proteins for building one tree is the root square of the total number of proteins.

`model = RandomForestClassifier(n_estimators=nTree, criterion="gini", compute_importances=True,\`

bootstrap=True, max_features="auto", n_jobs=8)

model.fit(network, labels)

scores = model.predict_proba(network)

##### Stochastic gradient descent

Stochastic gradient descent, an example of a stochastic approximation algorithm, became very recently more popular in machine learning because it is suitable for applications in ‘big-data’ analysis; in problems with high sample size and high dimensionality. Stochastic gradient descent is often used for parameter inference methods, in back-propagation methods in neural networks, etc.

In the case of stochastic gradient descent in an online setting will randomly select a sample in the data set to update the weight vector of the classifier in order to minimize the loss function.

The L2-regularized stochastic gradient descent was used with a pseudo Huber loss function. This function compensates for the effect of outliers.

`model = SGDClassifier(loss='modified_huber', class_weight='auto', penalty='l2', \`

n_iter=10, shuffle=True)

model.fit(network, labels)

scores = model.decision_function(network)

##### Passive aggressive

Passive aggressive is a margin-based online learning algorithm. Margin-based classifiers, i.e. support vector machines and boosting, learn the maximum margin of decision boundaries. The passive aggressive algorithm is successfully used to perform the bioinformatics task of gene prediction or gene finding.

Passive aggressive operates in two modes to update a classifier: (1) aggressive- and (2) passive update step. The passive update step does not update the weight vector of a classifier whenever loss function is zero (falls into a margin with Euclidian distance ε). The aggressive update step updates the current weight vector of the classifier so the loss function becomes zero.

We applied the PA-I implementation.

` model = PassiveAggressiveClassifier(loss='hinge', n_iter=10)`

model.fit(network, labels)

scores = model.decision_function(network)

### Networks

Networks are constructed based upon (i.) protein-protein interactions, (ii.) sequence similarity, (iii.) co-expression, and (iv.) semantic similarity. The sequence similarity -and co-expression networks are very dense. Please contact us if you are interested in the complete network (threshold is used to reduce the connectivity).

### GO annotation

### Performance analysis

The difference in performance increment between algorithm -and data aggregation will be illustrated in the following sections.

#### Algorithm aggregation

###### Area under the ROC curve

Algorithm aggregation improves gene function prediction performance modestly. (A) The mean performance improvement for gene function prediction by algorithm aggregation by averaging predictions (mean std AUROC = 0.026 over window size 100; Wilcoxon rank sum one sided p = 3.8 e-132). (B) The improvement with algorithm aggregation for each data type and two aggregation schemes, aggregation by averaging and aggregation by using correlation based feature selection (CFS). (C) The distribution of mean increment in performance for gene function prediction for all GO groups by algorithm aggregation (mean AUROC improvement is 0.025 and std is 0.030). (D) The increment in performance of gene function prediction for four data types and two schemes for aggregation (aggregation by averaging and correlation based features selection). The increase in performance by algorithms is only modestly for each data type.

###### Area under the PR curve

Algorithm aggregation improves gene function prediction performance modestly. (A) The mean performance improvement for gene function prediction by algorithm aggregation by averaging predictions (mean std AUPR = 0.015 over window size 100; Wilcoxon rank sum one sided p = 3.9 e-212). (B) The improvement with algorithm aggregation for each data type and two aggregation schemes, aggregation by averaging and aggregation by using correlation based feature selection (CFS). (C) The distribution of mean increment in performance for gene function prediction for all GO groups by algorithm aggregation (mean AUPR improvement is 0.032 and std is 0.025). (D) The increment in performance of gene function prediction for four data types and two schemes for aggregation (aggregation by averaging and correlation based features selection). The increase in performance by algorithms is only modestly for each data type.

###### Maximum F-measure

Algorithm aggregation improves gene function prediction performance modestly. (A) The mean performance improvement for gene function prediction by algorithm aggregation by averaging predictions (mean std Fmax = 0.018 over window size 100; Wilcoxon rank sum one sided p = 7.1 e-213). (B) The improvement with algorithm aggregation for each data type and two aggregation schemes, aggregation by averaging and aggregation by using correlation based feature selection (CFS). (C) The distribution of mean increment in performance for gene function prediction for all GO groups by algorithm aggregation (mean Fmax improvement is 0.037 and std is 0.025). (D) The increment in performance of gene function prediction for four data types and two schemes for aggregation (aggregation by averaging and correlation based features selection). The increase in performance by algorithms is only modestly for each data type.

#### Data aggregation

###### Area under the ROC curve

Data aggregation improves gene function prediction performance substantially. (A) The mean performance improvement for gene function prediction by data aggregation by averaging predictions (mean std AUROC = 0.023 over window size 100; Wilcoxon rank sum one sided p = 6.9 e-216). (B) The improvement with data aggregation for each algorithm and two aggregation schemes, aggregation by averaging and aggregation by using a correlation based feature selection (CFS) (C) The distribution of mean increment in performance for gene function prediction for all GO groups by data aggregation (mean AUROC improvement is 0.128 and std is 0.033). (D) The increment in performance of gene function prediction for the six machine learning algorithms and two schemes for aggregation (aggregation by averaging and correlation based features selection). Data aggregation improves performance substantially among a set of algorithms.

###### Area under the PR curve

Data aggregation improves gene function prediction performance substantially. (A) The mean performance improvement for gene function prediction by data aggregation by averaging predictions (mean std AUPR = 0.035 over window size 100; Wilcoxon rank sum one sided p = 1.1 e-215). (B) The improvement with data aggregation for each algorithm and two aggregation schemes, aggregation by averaging and aggregation by using a correlation based feature selection (CFS) (C) The distribution of mean increment in performance for gene function prediction for all GO groups by data aggregation (mean AUPR improvement is 0.093. and std is 0.067). (D) The increment in performance of gene function prediction for the six machine learning algorithms and two schemes for aggregation (aggregation by averaging and correlation based features selection). Data aggregation improves performance substantially among a set of algorithms.

###### Maximum F-measure

Data aggregation improves gene function prediction performance substantially. (A) The mean performance improvement for gene function prediction by data aggregation by averaging predictions (mean std Fmax = 0.035 over window size 100; Wilcoxon rank sum one sided p = 7.6 e-216). (B) The improvement with data aggregation for each algorithm and two aggregation schemes, aggregation by averaging and aggregation by using a correlation based feature selection (CFS) (C) The distribution of mean increment in performance for gene function prediction for all GO groups by data aggregation (mean Fmax improvement is 0.098. and std is 0.057). (D) The increment in performance of gene function prediction for the six machine learning algorithms and two schemes for aggregation (aggregation by averaging and correlation based features selection). Data aggregation improves performance substantially among a set of algorithms.

### ‘Generic’ experiment

Each experiment uses a combination of an algorithm and data resource (i.e., function annotation are always based upon GO). This section will describe code used to run an experiment and meant as a guidance for the source code publicly available. The code for the neighbor voting and GeneMANIA is available upon request.

###### Load the data

` # IDs of GO terms`

goid = loadmat(GOIDFileName, squeeze_me=False, mat_dtype=True)

goid = goid['GOID']

# IDs of genes

geneid = loadmat(geneIDFileName, squeeze_me=False, mat_dtype=True)

geneid = geneid['geneid']

# Gene and GO annotation matrix

gene2goMatrix = loadmat(gene2GOFileName, squeeze_me=False, mat_dtype=True)

gene2goMatrix = gene2goMatrix['gene2goMatrix']

# Load the network

network = loadmat(networkFileName, squeeze_me=False, mat_dtype=True)

network = network['network']

###### Run the experiment

# 3-fold cross-validation

nFold = 3

# For each GO term

pp = permutation(numproteins)

resultid = 0

i = 0

for goid in self.__goid:

# Get label for GO term

labels = gene2go[:,i]

# One of the algorithms

model = LogisticRegression(penalty=’l2′, dual=True, class_weight=”auto”, C=1.0, tol=0.0001)

model.fit(network, labels)

scores = model.predict_proba(network)

# Performance analysis

per = Performance(annotations, scores)

roc = per.AUROC()

pr = per.AUPR()

fmax = per.Fmax()

del per

meanroc = []

meanpr = []

meanfmax = []

while fold < nFold:

lastelem = min(numproteins, offset+floor(numproteins/nFold))

ix = []

for index in pp[offset+1:lastelem]:

ix.append(labelIx[index])

offset = lastelem

labeltmp = []

#for value in label:

for value in annotations:

labeltmp.append(float(value))

for index in ix:

labeltmp[index] = 0

# One of the algorithms

model = LogisticRegression(penalty=’l2′, dual=True, class_weight=”auto”, C=1.0, tol=0.0001)

model.fit(network, labeltmp)

scores = model.predict_proba(network)

# Performance analysis

per = Performance(annotations, scores)

roc = per.AUROC()

pr = per.AUPR()

fmax = per.Fmax()

del per

roc_mean = reduce(lambda x, y: x + y / float(len(meanroc)), meanroc, 0)

pr_mean = reduce(lambda x, y: x + y / float(len(meanpr)), meanpr, 0)

fmax_mean = reduce(lambda x, y: x + y / float(len(meanfmax)), meanfmax, 0)

i += 1