**Introduction**

My name is Samyadeep Basu and I am currently a Senior undergraduate student in Computer Science and Biology at BITS Pilani, India. This is my second Google Summer of Code with Open Genome Informatics. In the past I worked with Tripal under the guidance of Dr. Stephen Ficklin, Assistant Professor at Washington State University. In my previous year with Google Summer of Code, I developed a Data Visualisation tool for Tripal to visualise large Biological Graphs consisting of around 100,000 edges just using the web browser’s rendering capability. At the end of the project we were successfully able to load large Biological Graphs faster than their existing visualisation tool powered by Cytoscape. The open source project was a great learning opportunity and motivated me to apply to GSoC again.

So coming to this year’s GSoC !

This time around, I am working with Joshua from the InterMine team based out of University of Cambridge. The project deals with the development of a statistical module to find Similar Genes in InterMine by leveraging the huge amount of Biological Information lying in the InterMine Databases ultimately leading to the development of a recommendation engine. For the first month of the coding period I have used concepts from Graph Theory, Data Mining and Machine Learning to get some interesting results about the Similarity of Genes, which are described in depth in the following sections of the Blog.

The InterMine data currently resides in a relational database, though currently a concurrent GSoC project is dealing with the development of a Neo4j API for storage of information in Neo4j, a Graph Database. Considering the various advantages a Graph Database provides, in the initial stages of the project, we developed a toy loader for storing some essential information into Neo4j (It’s not a full fledged loader, but works fine for our project! ) which can be essential for the Similarity Measurement.

**A brief introduction to the InterMine model**

InterMine is a open source data warehouse for storage and analysis of complex biological data. In InterMine complex queries could be formed by the Biological Researchers to answer various Biological Questions. However, currently the InterMine warehouse lacks any kind of recommendation system which might assist users in their searches .

Lets see a simple example to see how the recommendation system for InterMine might work.

Let’s say the user is obtaining certain information corresponding to Gene X to answer some Biological Question. In hindsight, he might also be interested to find any other genes which might be similar to the Gene X. This Google Summer of Code project aims to solve this problem with the help of the huge amount of Biological Data lying in InterMine with some statistics, math, data visualisation and Machine Learning.

Before going into an in depth analysis of the approaches used, just to get a head start, following are the implementations that have been carried out in the first month of the coding period :

– Highly Connected Sub-graph Algorithm to find components of interest in the Gene Interaction Network.

– Girvan-Newmann Algorithm to detect communities of interest in the Gene Interaction Network.

– Spectral Clustering Algorithm to find similar Genes from the Gene Interaction Network.

-Construction of Features using Network Centrality measures from Gene Interaction Network and combining them with Genetic Features (eg. Protein Domains, Pathways)

-An extension to the Hierarchical Clustering Algorithm to specifically deal with the Mixed type of Data Set we have.

-SimRank Algorithm to find Similar Nodes from Gene Regulatory Network.

-MinHash Algorithm to find similar sets of Genes from the Genetic Features.

-Usage of Tf-Idf measures to construct Genetic feature vectors and cluster them using K-means (Similar to how documents are clustered).

-Neo4j Analytics

-A toy Neo4j loader

**InterMine Dataset and Feature Construction**

Until now, we have worked our way with the FlyMine Dataset which is actually a good starting point for the project considering it contains lesser number of genes with all the essential genetic features which are relevant in our case. The data consists of certain features which can be very well expressed in the form of Graphs and certain features which can be treated as categorical variables. For example we have treated information pertaining to the Gene Interaction Network as an Undirected Graph, whereas the regulatory networks have been treated as Directed Graph. Augmenting this information, for each gene, we have important features like Protein Domains in the proteins translated from the transcripts of the Gene, Pathways in which the Gene participates, Tissues in which the Gene is expressed, chromosomal information, Phenotypes associated with the Gene, diseases associated with any mutation in the given Gene etc. The basic idea that we have used is to leverage both the categorical as well as the information from the Graphs to construct meaningful features which can be used for similarity measurement.

For Graphical Features, we have used information from the Interaction network as well as the Regulatory Networks in the form of centrality measures. The centrality measures also give a rough idea about the important genes in the network based on topological properties.

**Drawback of the InterMine Dataset**

Although the InterMine Dataset is rich in information to come to a certain conclusion about Similar Genes, there is a major problem underlying the dataset. This disadvantage is mainly owing to the property that a Gene can have multiple values for a single categorical feature. For example : A gene can have multiple protein domains if we treat protein domains as a categorical feature. Most of the existing Machine Learning Algorithms for clustering are not capable of handling such datasets without modification or any workaround. One way to tackle this problem is to use one hot encoding for the categorical features, however that ends up blowing our dimensionality to a large number. Let’s say we treat protein domains as a categorical variable and in total there are 5000 unique protein domains which cover all the genes. One hot encoding will result into increasing the dimensionality of the feature vector by 5000. Add to this the other features like pathway, diseases, tissues etc and we might end up looking at a dimensionality of around 60000 for the small Flymine Dataset itself. Dimensionality of 60000 is still manageable, however as we use more complex datasets, we will face the issue of curse of dimensionality which will in turn hinder in the efficiency of the clustering algorithms.

To tackle this problem, Hierarchical Agglomerative Clustering, an Unsupervised Machine Learning Algorithm was modified to handle our InterMine Datasets.

**An Extension to HAC for Similarity Measurement**

Let’s start with a brief summary of the original HAC Algorithm!

There are quite a few variants of HAC, however here we will focus on the basic method. HAC follows a bottom up approach in which initially all the features vectors are treated as different clusters. At each step, two clusters are merged which are nearest to one another. At the end of all iterations the algorithm will produce a single cluster. If there are N feature vectors, then the algorithm converges in N-1 iterations. The merging at each and every step can be visualised as a dendogram.

Considering that our feature vectors consists of both numerical(due to Network Centrality Measures) as well as categorical variables, we made certain changes in the step where the shortest distance between two clusters are found. The redefined distance measure is a combination of Jaccard Distance and Euclidean Distance.

Due to this modification, there is no need to convert our vectors into one hot encodings and therefore the dimensionality will not be blown up.

However as in the world of Machine Learning there is no free lunch, therefore this method also has certain disadvantages associated with it. The main issue this algorithm faces is the running time, which will create a problem if we have let’s say a million Genes compared to the run of the mill K-means algorithm. Inspite of the drawback of the running time, for our particular dataset, it is wise to use this particular method over K-means considering the inability of K-means to handle non-convex datasets and difficulty in assigning centroids over mixed datasets like the InterMine Data.

However if we can somehow find a way to assign centroids to mixed data types, we can very well create an extension of K-means which will run much faster than the above method (Work in progress for that! ). Centroids for mixed type data can also help us to determine the optimum number of clusters by measuring the compactness of the clusters using Silhouette Analysis. Currently in our module, we have used Silhouette Analysis only for K-means algorithm which was fed with the one-hot encoded vectors for Genes. For our HAC extension for mixed type of dataset, it is quite tricky to apply Silhouette Analysis as they are devoid of any centroids. Hence this feature is still lacking in our HAC extension.

**Similarity Measurement using Tf-Idf**

In the world of Natural Language Processing, for constructing feature vectors for documents, Tf-Idf measures are used to construct feature vectors. A similar idea was implemented for genes, as they hold information just like documents in the form of tokens.

Although they are also of high dimensions, these features vectors are of much better quality than one-hot encoded vectors. Due to the betterment in quality, on the top of this, a dimensionality reduction technique called SVD was applied. As the feature vectors are sparse in nature, SVD works much better than PCA which is in general widely used.

The number of dimensions were chosen by checking the variance ratios, which were then fed to either the HAC extension that we built or the K-means Algorithm.

**Min Hash with Locality Sensitive Hashing**

MinHash and LSH are used in a wide variety of Data Mining problems to find Similarity amongst sets. Lets say we have around 100 sets and we have to find the similarity measurement amongst all the sets. One of the naive approaches could be to compare each and every pair of sets using Jaccard Distance. The higher the Jaccard Value is the higher is the similarity between the two sets. However this approach is not at all scalable. If we have around N genes to compare, then there will be around the complexity will be to the tune of N^2, which is pretty abysmal if we say have around 10 million Genes.

One way to tackle this problem is to construct signatures, which is a type of compression of information (Done by MinHash) and use these signatures to find compare only those genes, who similarity value might be of interest (Done by LSH).

**SimRank for Gene Regulatory Network**

The Gene Regulatory Network is a Directed Graph where the directed edges signify how the regulations take place. In general the source gene is responsible for regulating the destination gene. The SimRank method leverage this directed information to construct a Similarity Matrix for the Genes of the regulatory network.

The algorithm is implemented in the GitHub repository given at the end of the blog.

**Spectral Clustering for Gene Interaction Network**

The module developed until now also consists of a clustering method for the Interaction Network by Spectral Analysis. The basic idea is to convert the Graph into a Normalized Laplacian Matrix and then extract Eigen Values and Eigen Vectors for the Laplacian. In graph theory, Eigen values of a Laplacian can give us quite a few information about the Global Structure of the Graphs. For example if Lambda is designated as the Eigen Values and for a particular Laplacian ‘k’ Lambdas are equal to zero, then it has been proved that there are K connected components.

Leveraging these informations, Spectral Clustering extracts the top few Eigen values for the Normalized Laplacian along with their corresponding Eigen Vectors. Then a generic clustering algorithm like K-means can be used to cluster the Eigen vectors with the top Eigen values.

Following are the visualisations of the Gene Interaction Network for different parameters corresponding to number of clusters:

**Fig** : For k = 10

**Fig** : For k = 30

**Fig** : For k = 40

**Fig** : For k = 50

From the visualisations it can be deciphered that as we increase the number of clusters, a community structure is being developed in the graphs, with the similar type of subgraphs getting clustered and hence coloured similarly.

**Community Detection in Interaction Networks using Girvan Newman Modularity Approach**

The basic idea of the method is to utilise the connections in the interaction network to find genes which interact strongly amongst themselves, therefore forming a community. The common mathematical approach to extract communities from an undirected graph relies on an objective function. The objective function is defined with respect to the definition of community for the problem context and then the objective function is maximised or minimised to extract communities. In our problem statement, we define the objective function with respect to modularity. The modularity function inherently measures how compact the communities are within themselves and our objective is to maximise this modularity function.

Currently our module is only capable of detecting communities in undirected graphs (For Gene Interaction Network). For directed graphs, the task of detecting communities is a bit trickier considering the significance of directions. Currently I am working on a flow-based approach for detecting communities in directed graphs with respect to Gene Regulatory Networks. The basic idea is for this specific algorithm is quite simple. Let’s say a user randomly starts from a certain point in the directed graph, then I will define community as that region where the probability of getting trapped is maximum for the user.

** Figure 1** : Different communities in the Gene Interaction Network are coloured differently. The above is small part of the Interaction Network on which the algorithm was applied.

**Highly Connected Sub Graph Algorithm**

HCS is yet another Graph Clustering Algorithm quite frequently used in solving network problems. The detailed code along with implementation has been described in the GitHub repository.

**Neo4j Analytics and Loader**

One of the major reasons why we have used Neo4j for storing our data lies in the interesting features it offers. In the module until now, I have added certain features which might be biologically relevant, which are as follows :

Cycles in the Interaction Network and the genes in each cycle

Given Gene X and Gene Y, find all the paths between them.

Find triangular relationships

Along with this, we also have a toy Neo4j loader (while the main one is being developed by the other Google Summer of Code Project) to just be able to load essential data relevant to our project.

**Plans for the Month of July and August**

In the coming months there are certain new algorithms that I will be implementing along with packaging the whole module into a usable form, which can be integrated easily with InterMine front end.

So coming to the new algorithms!

-From the datasets in InterMine we have seen that there are certain genes for whom annotations are missing. We plan to predict these missing annotations by using a semi-supervised learning method with the help of the Gene Interaction Network, Regulatory Network and Genetic Features of the nodes which are available. This will also give a superior result with the above mentioned algorithms to detect similarities thus solving the problem of missing values.

-I plan to use a flow based approach for detecting communities in the Gene Regulatory Network.

-A Deep Representation of the Interaction Network with Neural Networks (Autoencoders). This implementation is based on a paper published by Microsoft Research, where they were able to extract the essential features of the Graph using Deep Learning.

-Implementation of an alternative to Mixed HAC for speeding up clustering as well as a method to find centroids for mixed type of data.

-Develop an approach to use all the algorithms together in an ensemble, give weights and produce a superior similarity measurement.

For the month of July, I have extensively gone through the literature of Graph Theory and Machine Learning to find some new algorithms suitable to solving our problem and add to the ensemble module that we are building. I strongly believe as there is no gold standard for solving our problem as well as no ground truth for the data set in question, hence a combination of Similarity Measurement methods is suitable to gain some insight into the datasets that we have!

**Markov Clustering Algorithm**

Markov Clustering Algorithm is a scalable unsupervised learning algorithm for clustering nodes in a network based on the simulation of stochastic flows in the network. I have added this algorithm in the module for finding similar genes in the Gene Interaction Network. The basic concept of the algorithm is as follows :

“A random walk in a graph G will not leave a dense cluster until all the vertices in the cluster is visited”. So in a way, it simulates promotes flow in a highly connected region in a graph or network. The random walks in the Graph are calculated using Markov Chains, which essentially state that while moving to the next node from the present node, the dependency is only on the current node but not on the previous visited nodes.

The steps of the algorithm are as follows:

- Create an adjacency matrix for the given graph, transpose it and normalise to create an associated matrix.
- Addition of Self Loops to the Graph, which is optional.
- Normalise the matrix
- Expand by taking a power (input parameter 1) of the matrix.
- Inflate by taking power (input parameter 2) of the columns in the matrix.
- Repeat Steps 4 and 5 until convergence is reached.
- Once convergence is reached, interpret the results to obtain dense clusters in the network.

**Checking for convergence in the matrix**

In the published paper, convergence hasn’t been proved, but it can be observed after a number of iterations. Generally, it can be observed that once you have a convergence, then the matrix almost always converges to a double idempotent matrix. The clusters could be extracted by those columns which have the same value in a particular row. However it has been proven that once the algorithm is in the neighbourhood of being a double idempotent matrix, then the algorithm converges in a quadratic manner.

**Interpreting Clusters **

For each row, check the columns which have almost the same value and put those nodes in the network into the same cluster.

The implementation has been done in Python and is present in the Git repository, which produces results on the FlyMine Dataset.

**Analysis of Regulatory Networks **

The InterMine Dataset also contains information about the Gene Regulatory Networks in the form of directed graphs. The basic idea of the analysis was to construct various features related to Graph Theory (like centrality measures, shortest paths, degrees), reduce the feature space and do some clustering on them to get the similar nodes in the network.

Although the constructed features were not of high dimensions, but still it was important to reduce the dimensions so that we can visualise the results efficiently. Most of the commonly used dimension reduction techniques like PCA are unable to capture the non linear structure in the datasets, which is extremely essential for visualisation purposes. Therefore this time around, I explored a bit of non linear dimension reduction techniques published in the literature like IsoMaps, Diffusion Maps, t-SNE and Auto encoders. Another important think is to capture the global as well the local structure of the data when we project it in lower dimensions. Out of the above mentioned methods only t-SNE and Auto Encoders are able to do so. t-SNE is extremely beneficial for high dimensional data visualisation, however it suffers from a major problem of crowding effect and parameter tuning (perplexity) to get the best result. It might be required to run t-SNE multiple times with tuning the perplexity to get the best results. Auto-encoders on the other hand have been shown to capture both the local structure as well as the global structure of the data quite efficiently. However the catch with auto encoders is that they are a bit hard to train. But considering the dataset that we have and recent tweaks about efficient auto encoder training, it was possible to train them for our dataset to reach convergence with a minimalistic loss value.

The shape of the two dimensional data was ellipsoid in nature and clustering via generic K-means was not useful as they are unable handle ellipsoid shapes in general. Therefore, for this specific case, an Expectation-Maximization algorithm was run to fit a Gaussian distribution to the dataset to obtain the clusters. From the visualised data, it was seen that there were three distinct clusters and our clustering algorithm based on Gaussian Mixture Models also recognised the three clusters efficiently.

We can also get a better 2D subspace representation by stacking auto encoders (This will be updated in the code in the next few days!).

The above mentioned method was though run on a smaller dataset is scalable to large regulatory network !

** Figure 2** : Linear Dimensionality Reduction via PCA and then clustering using Gaussian Mixture Models.

** Figure 3** : Nonlinear Dimensionality Reduction using Auto encoders and then fitted a Gaussian Mixture Model to cluster the information. As it can be seen after dimensionality reduction that the shape of the data came out to be ellipsoid, which is tailor made for fitting a Gaussian Mixture Model.

**Overlapping Community Detection using K-Cliques**

Our module already has one efficient implementation of Community detection algorithm based on the Girvan Newman Method. However it does not take into account that a gene can belong to multiple communities that is have multiple biological functions. Therefore to tackle this problem I have added an implementation of detecting overlapping communities based on the K-Clique method or the clique percolation method.

* How does the clique percolation method work*?

A clique in a graph is a set of nodes, which are fully connected amongst themselves. Therefore k-clique refers to a completely connected subgraph consisting of k nodes. In general two K-cliques are known to be adjacent if they share k-1 nodes between them. The definition of community in this respect is the maximum union of k-cliques that can be reached through a series of adjacent k-cliques. This method thus ensures the possibility of a node to be present in multiple communities due to the percolation of cliques. One of the major benefits of this method is that it is local in nature. What it means is that, if a particular subgraph of the network is designated to be a community by the algorithm, then it is not affected by what happens in some other part of the network.

This approach could be bettered in terms of finding maximal cliques, as the graph might have multiple k-cliques, however finding maximal cliques is much more tougher than finding individual k-cliques in terms of computational complexity as it is a NP-hard problem.

Currently the method implemented in the module does not take into account the weights present in the network. However this algorithm could be modified to take into account weights to detect overlapping communities (Weighted Clique Percolation Method).

**Blondel Graph Similarity Measures**

Blondel Graph Similarity Measure is based on the paper published by Blondel as ‘A Measure of Similarity between Graph Vertices: Applications to Synonym Extraction and Web Searching’ . It has been widely used in the detection of similar web pages in the internet by treating it as a graph. In our module, we have used the algorithm on the detection of similar nodes in the Gene Regulatory Network. The algorithm is backed by some really strong mathematics and has some long proofs associated with it! Hence I will skip to the main equation / update rule which is used until the algorithm reaches convergence.

** The Equation** :

X(k+1) =A.X(k).A(Transpose) +A(Transpose). X(k). A

In the implementation :

**A** : Adjacency Matrix

**X** : Matrix initialised with random values

The rule is used to update the matrix until and unless a convergence is reached! However one of the major disadvantage of this method is the ability to reach convergence when produced with a large number of nodes. However for the InterMine datasets, this algorithm will work perfectly fine as it does not have an extremely large number of nodes!

**Plans Ahead**

Currently the statistical module for InterMine is complete with the implementations of different algorithms from the literature for finding similar nodes in InterMine from Gene Interaction Graphs, Gene Regulatory Networks as well as feature information about Genes (eg. Protein Domains, Pathways, ). In the coming few days, I am going to run the algorithms implemented on a larger dataset to get visualisations, store the results and also get the results analysed by a Biologist !

Also in the pipeline is a singular script to call the different algorithms in the module, score the results and get a singular similarity matrix! I am currently working on the method to score the different algorithms according to some importance score to leverage the ensemble method.