GitHub - lucagiovagnoli/Markov_clustering-Graph_API: Graph API and Markov Clustering implementation in Python using numpy.

lucagiovagnoli / Markov_clustering-Graph_API Public

Notifications You must be signed in to change notification settings
Fork 2
Star 4

Graph API and Markov Clustering implementation in Python using numpy.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.txt		README.txt
input_graph.txt		input_graph.txt

Repository files navigation

Graph Structure in the test.py example:

0 4
/|\ /|\
1---2 ------ 6---5
\|/ \|/
3 7

Output clusters of the Markov Clustering Algorithm after only 14 iterations:
['1', '0', '3', '2']
['5', '4', '7', '6']

The Graph is represented using the adjacency list method. For the example above:
1: 0->3->2->
0: 1->3->2->
3: 1->0->2->
2: 1->0->3->6->
5: 4->7->6->
4: 5->7->6->
7: 5->4->6->
6: 2->5->4->7->

It is slower than the matrix representation but better from a memory point of view if the graph is sparse. Matrix
representation for the example above (indexes mapping to keys: ['1', '0', '3', '2', '5', '4', '7', '6']):
[[0 1 1 1 0 0 0 0]
[1 0 1 1 0 0 0 0]
[1 1 0 1 0 0 0 0]
[1 1 1 0 0 0 0 1]
[0 0 0 0 0 1 1 1]
[0 0 0 0 1 0 1 1]
[0 0 0 0 1 1 0 1]
[0 0 0 1 1 1 1 0]]

The list of nodes is a python dictionary in order to be able to retrieve node's information and list of neighbors in
O(1). The lists of neighbors are python dictionaries too, so it takes O(1) to retrieve a edge's weight.
The methods have been partially implemented, more details can be found in the docstrings inside the code.

GRAPH ALGORITHMS - High level Concepts

Depth First Search
DFS is a search algorithm that takes two nodes of a graph as input and returns a path between them. Depth-First is
referred to the visiting order of the neighbor nodes. A stack is used in order to store all nodes to be visited (a
queue is used in BFS). Instead of visiting all neighbor nodes at current level before processing the next level (Breadth
First Search), we keep going deep level after level visiting always the last node added to the stack (pop operation).
Steps:
1) Push neighbors on a stack data structure.
2) Pop node and visit it.
4) Repeat the process until stack is empty or visited node is the one we were looking for.

Dijkstra
Computes the shortest path from a node to all other nodes.
The starting node has a distance of zero, all others start with an infinite distance value. The idea is to scan
through all nodes in the graph updating these distances until at the end each of them will have the minimum
possible distance from the starting node.
Relaxation is obtained checking all possible ways of arriving to a certain node and saving the minimum. To be sure
that the distances of the previous nodes in the path are already definitive we need to visit the nodes of the graph in a
certain way. Hence nodes are put in a priority queue and processed in ascending value of distance.

Kruskal
Computes the minimum spanning tree of the graph. The minimum spanning tree is a tree composed by all nodes of the graph
so that the sum of the arc costs connecting these nodes is the minimum.
Input is the list of all edges.
We keep track of two sets, the connected nodes and the nodes yet to be visited.
We process the edges in order of ascending weight.
For each processed edge, we mark the two nodes as visited and we add the edge to the solution. The algorithm stops
after processing all edges or when the set nodes-yet-to-be-visited is empty. We can use a connectivity algorithm like
the union-find for the connectivity purposes of the algorithm.
At the end there will be a list of edges representing the minimum spanning tree.

Centrality
There are many ways of defining centrality. Degree centrality is a measure of how likely it is for a node to “catch” a
flow going through the network. Higher the node’s degree, higher the centrality. The degree of a node is the number of
adjacent edges to which it is connected. We can keep track of this number each time we add or remove edges and nodes
from the graph. The algorithm simply returns a list of nodes associated to their degrees.
For example in a social network graph, the degree of a node could be the friends of a person. Higher the number of
friends, higher Its centrality.

Connectivity (Union-Find)
Union-find is an algorithm for computing connectivity of a graph. There are two main connectivity algorithms. The
quick-find allows lookup for connectivity in O(1) and connecting (union) of nodes in O(n). Once connectivity computation
is completed, using the data structure will be very fast. The alternative is the quick-union algorithm where we can
connect nodes in O(1) but lookup takes O(n) time.
Input is a list of all edges in the graph. Each time we process an edge, we make sure to connect the nodes clusters
together. An auxiliary hash table is used for this purpose. At each hashed node corresponds a value that is equal for
all nodes connected together. To connect two nodes together we scan through the data structure putting the same value to
all nodes belonging to either the clusters of the two edge’s nodes.

MARKOV CLUSTERING ALGORITHM

MCL algorithm.
Unlike some other clustering techniques like K-means clustering, in MCL
the number of clusters is not predetermined. The idea is to operate random walks through the graph.
While walking inside a cluster the probability of staying inside the same cluster is high.
Numpy library is used in order to speed up matrix operations.

Arguments:
matrix -- a square matrix representation of a graph. Matrix should be symmetric.
'The reason that mcl dislikes uni-directed graphs is not very mcl specific, it has more to do with the
clustering problem itself. Somehow, directionality thwarts the notion of cluster structure.' (Stijn van
Dongen)
e -- power parameter. The matrix is multiplied by itself e times each iteration
r -- inflation parameter. Increasing r will make inflation stronger and will
increase granularity/tightness of clusters).

Algorithm - steps:
1) column normalize the matrix
2) raise the matrix to the e-th power (simulates the random walk)
3) raise each element to the r-th power and column normalize. It strengthens more likely currents
while weakening the less likely (faster convergence).
4) repeat steps 2 and 3 until we are in a steady state or max number of iterations T is reached
5) interpret clusters (non-zero elements along rows)

A possible future improvement could be the pruning of cells in case of huge graphs. It is similar
to inflation, we set little matrix values directly to zero to speed up the process.

Sources:
http://www.micans.org/mcl/
http://www.cs.ucsb.edu/~xyan/classes/CS595D-2009winter/MCL_Presentation2.pdf