algorithms.clustering.hierarchical_clustering

Module: algorithms.clustering.hierarchical_clustering

Inheritance diagram for nipy.algorithms.clustering.hierarchical_clustering:

digraph inheritance1a5d4ec8c5 { rankdir=LR; size="8.0, 12.0"; "clustering.hierarchical_clustering.WeightedForest" [URL="#nipy.algorithms.clustering.hierarchical_clustering.WeightedForest",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="This is a weighted Forest structure, i.e. a tree"]; "graph.forest.Forest" -> "clustering.hierarchical_clustering.WeightedForest" [arrowsize=0.5,style="setlinewidth(0.5)"]; "graph.forest.Forest" [URL="nipy.algorithms.graph.forest.html#nipy.algorithms.graph.forest.Forest",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="Forest structure, i.e. a set of trees"]; "graph.graph.WeightedGraph" -> "graph.forest.Forest" [arrowsize=0.5,style="setlinewidth(0.5)"]; "graph.graph.Graph" [URL="nipy.algorithms.graph.graph.html#nipy.algorithms.graph.graph.Graph",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="Basic topological (non-weighted) directed Graph class"]; "graph.graph.WeightedGraph" [URL="nipy.algorithms.graph.graph.html#nipy.algorithms.graph.graph.WeightedGraph",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="Basic weighted, directed graph class"]; "graph.graph.Graph" -> "graph.graph.WeightedGraph" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

These routines perform some hierrachical agglomerative clustering of some input data. The following alternatives are proposed: - Distance based average-link - Similarity-based average-link - Distance based maximum-link - Ward’s algorithm under graph constraints - Ward’s algorithm without graph constraints

In this latest version, the results are returned in a ‘WeightedForest’ structure, which gives access to the clustering hierarchy, facilitates the plot of the result etc.

For back-compatibility, *_segment versions of the algorithms have been appended, with the old API (except the qmax parameter, which now represents the number of wanted clusters)

Author : Bertrand Thirion,Pamela Guevara, 2006-2009

Class

WeightedForest

class nipy.algorithms.clustering.hierarchical_clustering.WeightedForest(V, parents=None, height=None)[source]

Bases: nipy.algorithms.graph.forest.Forest

This is a weighted Forest structure, i.e. a tree - each node has one parent and children (hierarchical structure) - some of the nodes can be viewed as leaves, other as roots - the edges within a tree are associated with a weight: +1 from child to parent -1 from parent to child - additionally, the nodes have a value, which is called ‘height’, especially useful from dendrograms

__init__(V, parents=None, height=None)[source]
Parameters

V: the number of edges of the graph

parents=None: array of shape (V)

the parents of the graph by default, the parents are set to range(V), i.e. each node is its own parent, and each node is a tree

height=None: array of shape(V)

the height of the nodes

set_height(height=None)[source]

Set the height array

get_height()[source]

Get the height array

check_compatible_height()[source]

Check that height[parents[i]]>=height[i] for all nodes

plot(ax=None)[source]

Plot the dendrogram associated with self the rank of the data in the dendogram is returned

Parameters

ax: axis handle, optional

Returns

ax, the axis handle

partition(threshold)[source]

Partition the tree according to a cut criterion

split(k)[source]

idem as partition, but a number of components are supplied instead

plot_height()[source]

Plot the height of the non-leaves nodes

list_of_subtrees()[source]

returns the list of all non-trivial subtrees in the graph Caveat: theis function assumes that the vertices are sorted in a way such that parent[i]>i forall i Only the leaves are listeed, not the subtrees themselves

adjacency()

returns the adjacency matrix of the graph as a sparse coo matrix

Returns

adj: scipy.sparse matrix instance,

that encodes the adjacency matrix of self

all_distances(seed=None)

returns all the distances of the graph as a tree

Parameters

seed=None array of shape(nbseed) with valuesin [0..self.V-1]

set of vertices from which tehe distances are computed

Returns

dg: array of shape(nseed, self.V), the resulting distances

Notes

By convention infinite distances are given the distance np.inf

anti_symmeterize()

anti-symmeterize self, i.e. produces the graph whose adjacency matrix would be the antisymmetric part of its current adjacency matrix

cc()

Compte the different connected components of the graph.

Returns

label: array of shape(self.V), labelling of the vertices

check()

Check that self is indeed a forest, i.e. contains no loop

Returns

a boolean b=0 iff there are loops, 1 otherwise

Notes

Slow implementation, might be rewritten in C or cython

cliques()

Extraction of the graphe cliques these are defined using replicator dynamics equations

Returns

cliques: array of shape (self.V), type (np.int)

labelling of the vertices according to the clique they belong to

compact_neighb()

returns a compact representation of self

Returns

idx: array of of shape(self.V + 1):

the positions where to find the neighors of each node within neighb and weights

neighb: array of shape(self.E), concatenated list of neighbors

weights: array of shape(self.E), concatenated list of weights

compute_children()

Define the children of each node (stored in self.children)

copy()

returns a copy of self

cut_redundancies()

Returns a graph with redundant edges removed: ecah edge (ab) is present ony once in the edge matrix: the correspondng weights are added.

Returns

the resulting WeightedGraph

define_graph_attributes()

define the edge and weights array

degrees()

Returns the degree of the graph vertices.

Returns

rdegree: (array, type=int, shape=(self.V,)), the right degrees

ldegree: (array, type=int, shape=(self.V,)), the left degrees

depth_from_leaves()

compute an index for each node: 0 for the leaves, 1 for their parents etc. and maximal for the roots.

Returns

depth: array of shape (self.V): the depth values of the vertices

dijkstra(seed=0)

Returns all the [graph] geodesic distances starting from seed x

seed (int, >-1, <self.V) or array of shape(p)

edge(s) from which the distances are computed

Returns

dg: array of shape (self.V),

the graph distance dg from ant vertex to the nearest seed

Notes

It is mandatory that the graph weights are non-negative

floyd(seed=None)

Compute all the geodesic distances starting from seeds

Parameters

seed= None: array of shape (nbseed), type np.int

vertex indexes from which the distances are computed if seed==None, then every edge is a seed point

Returns

dg array of shape (nbseed, self.V)

the graph distance dg from each seed to any vertex

Notes

It is mandatory that the graph weights are non-negative. The algorithm proceeds by repeating Dijkstra’s algo for each seed. Floyd’s algo is not used (O(self.V)^3 complexity…)

from_3d_grid(xyz, k=18)

Sets the graph to be the topological neighbours graph of the three-dimensional coordinates set xyz, in the k-connectivity scheme

Parameters

xyz: array of shape (self.V, 3) and type np.int,

k = 18: the number of neighbours considered. (6, 18 or 26)

Returns

E(int): the number of edges of self

get_E()

To get the number of edges in the graph

get_V()

To get the number of vertices in the graph

get_children(v=-1)

Get the children of a node/each node

Parameters

v: int, optional

a node index

Returns

children: list of int the list of children of node v (if v is provided)

a list of lists of int, the children of all nodes otherwise

get_descendants(v, exclude_self=False)

returns the nodes that are children of v as a list

Parameters

v: int, a node index

Returns

desc: list of int, the list of all descendant of the input node

get_edges()

To get the graph’s edges

get_vertices()

To get the graph’s vertices (as id)

get_weights()
is_connected()

States whether self is connected or not

isleaf()

Identification of the leaves of the forest

Returns

leaves: bool array of shape(self.V), indicator of the forest’s leaves

isroot()

Returns an indicator of nodes being roots

Returns

roots, array of shape(self.V, bool), indicator of the forest’s roots

kruskal()

Creates the Minimum Spanning Tree of self using Kruskal’s algo. efficient is self is sparse

Returns

K, WeightedGraph instance: the resulting MST

Notes

If self contains several connected components, will have the same number k of connected components

leaves_of_a_subtree(ids, custom=False)

tests whether the given nodes are the leaves of a certain subtree

Parameters

ids: array of shape (n) that takes values in [0..self.V-1]

custom == False, boolean

if custom==true the behavior of the function is more specific - the different connected components are considered as being in a same greater tree - when a node has more than two subbranches, any subset of these children is considered as a subtree

left_incidence()

Return left incidence matrix

Returns

left_incid: list

the left incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[0] = i

list_of_neighbors()

returns the set of neighbors of self as a list of arrays

main_cc()

Returns the indexes of the vertices within the main cc

Returns

idx: array of shape (sizeof main cc)

merge_simple_branches()

Return a subforest, where chained branches are collapsed

Returns

sf, Forest instance, same as self, without any chain

normalize(c=0)

Normalize the graph according to the index c Normalization means that the sum of the edges values that go into or out each vertex must sum to 1

Parameters

c=0 in {0, 1, 2}, optional: index that designates the way

according to which D is normalized c == 0 => for each vertex a, sum{edge[e, 0]=a} D[e]=1 c == 1 => for each vertex b, sum{edge[e, 1]=b} D[e]=1 c == 2 => symmetric (‘l2’) normalization

Notes

Note that when sum_{edge[e, .] == a } D[e] = 0, nothing is performed

propagate_upward(label)

Propagation of a certain labelling from leves to roots Assuming that label is a certain positive integer field this propagates these labels to the parents whenever the children nodes have coherent properties otherwise the parent value is unchanged

Parameters

label: array of shape(self.V)

Returns

label: array of shape(self.V)

propagate_upward_and(prop)

propagates from leaves to roots some binary property of the nodes so that prop[parents] = logical_and(prop[children])

Parameters

prop, array of shape(self.V), the input property

Returns

prop, array of shape(self.V), the output property field

remove_edges(valid)

Removes all the edges for which valid==0

Parameters

valid : (self.E,) array

remove_trivial_edges()

Removes trivial edges, i.e. edges that are (vv)-like self.weights and self.E are corrected accordingly

Returns

self.E (int): The number of edges

reorder_from_leaves_to_roots()

reorder the tree so that the leaves come first then their parents and so on, and the roots are last.

Returns

order: array of shape(self.V)

the order of the old vertices in the reordered graph

right_incidence()

Return right incidence matrix

Returns

right_incid: list

the right incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[1] = i

set_edges(edges)

Sets the graph’s edges

Preconditions:

  • edges has a correct size

  • edges take values in [1..V]

set_euclidian(X)

Compute the weights of the graph as the distances between the corresponding rows of X, which represents an embdedding of self

Parameters

X array of shape (self.V, edim),

the coordinate matrix of the embedding

set_gaussian(X, sigma=0)

Compute the weights of the graph as a gaussian function of the distance between the corresponding rows of X, which represents an embdedding of self

Parameters

X array of shape (self.V, dim)

the coordinate matrix of the embedding

sigma=0, float: the parameter of the gaussian function

Notes

When sigma == 0, the following value is used: sigma = sqrt(mean(||X[self.edges[:, 0], :]-X[self.edges[:, 1], :]||^2))

set_weights(weights)

Set edge weights

Parameters

weights: array

array shape(self.V): edges weights

show(X=None, ax=None)

Plots the current graph in 2D

Parameters

X : None or array of shape (self.V, 2)

a set of coordinates that can be used to embed the vertices in 2D. If X.shape[1]>2, a svd reduces X for display. By default, the graph is presented on a circle

ax: None or int, optional

ax handle

Returns

ax: axis handle

Notes

This should be used only for small graphs.

subforest(valid)

Creates a subforest with the vertices for which valid > 0

Parameters

valid: array of shape (self.V): idicator of the selected nodes

Returns

subforest: a new forest instance, with a reduced set of nodes

Notes

The children of deleted vertices become their own parent

subgraph(valid)

Creates a subgraph with the vertices for which valid>0 and with the correponding set of edges

Parameters

valid, array of shape (self.V): nonzero for vertices to be retained

Returns

G, WeightedGraph instance, the desired subgraph of self

Notes

The vertices are renumbered as [1..p] where p = sum(valid>0) when sum(valid==0) then None is returned

symmeterize()

Symmeterize self, modify edges and weights so that self.adjacency becomes the symmetric part of the current self.adjacency.

to_coo_matrix()

Return adjacency matrix as coo sparse

Returns

sp: scipy.sparse matrix instance

that encodes the adjacency matrix of self

tree_depth()

Returns the number of hierarchical levels in the tree

voronoi_diagram(seeds, samples)

Defines the graph as the Voronoi diagram (VD) that links the seeds. The VD is defined using the sample points.

Parameters

seeds: array of shape (self.V, dim)

samples: array of shape (nsamples, dim)

Notes

By default, the weights are a Gaussian function of the distance The implementation is not optimal

voronoi_labelling(seed)

Performs a voronoi labelling of the graph

Parameters

seed: array of shape (nseeds), type (np.int),

vertices from which the cells are built

Returns

labels: array of shape (self.V) the labelling of the vertices

Functions

Agglomerative function based on a (hopefully sparse) similarity graph

Parameters

G the input graph

Returns

t a weightForest structure that represents the dendrogram of the data

Agglomerative function based on a (hopefully sparse) similarity graph

Parameters

G the input graph

stop: float

the stopping criterion

qmax: int, optional

the number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (G.V)

a labelling of the graph vertices according to the criterion

cost: array of shape (G.V (?))

the cost of each merge step during the clustering procedure

nipy.algorithms.clustering.hierarchical_clustering.fusion(K, pop, i, j, k)[source]

Modifies the graph K to merge nodes i and j into nodes k

The similarity values are weighted averaged, where pop[i] and pop[j] yield the relative weights. this is used in average_link_slow (deprecated)

nipy.algorithms.clustering.hierarchical_clustering.ward(G, feature, verbose=False)[source]

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G : graph

the input graph (a topological graph essentially)

feature : array of shape (G.V,dim_feature)

vectorial information related to the graph vertices

verbose : bool, optional

If True, print diagnostic information

Returns

t : WeightedForest instance

structure that represents the dendrogram

Notes

When G has more than 1 connected component, t is no longer a tree. This case is handled cleanly now

nipy.algorithms.clustering.hierarchical_clustering.ward_field_segment(F, stop=-1, qmax=-1, verbose=False)[source]

Agglomerative function based on a field structure

Parameters

F the input field (graph+feature)

stop: float, optional

the stopping crterion. if stop==-1, then no stopping criterion is used

qmax: int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (F.V)

labelling of the graph vertices according to the criterion

cost array of shape (F.V - 1)

the cost of each merge step during the clustering procedure

Notes

See ward_quick_segment for more information

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_quick(G, feature, verbose=False)[source]

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G : graph instance

topology-defining graph

feature: array of shape (G.V,dim_feature)

some vectorial information related to the graph vertices

verbose : bool, optional

If True, print diagnostic information

Returns

t: weightForest instance,

that represents the dendrogram of the data

Notes

Hopefully a quicker version

A euclidean distance is used in the feature space

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_quick_segment(G, feature, stop=-1, qmax=1, verbose=False)[source]

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G: labs.graph.WeightedGraph instance

the input graph (a topological graph essentially)

feature array of shape (G.V,dim_feature)

vectorial information related to the graph vertices

stop1 : int or float, optional

the stopping crterion if stop==-1, then no stopping criterion is used

qmax : int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (G.V)

labelling of the graph vertices according to the criterion

cost: array of shape (G.V - 1)

the cost of each merge step during the clustering procedure

Notes

Hopefully a quicker version

A euclidean distance is used in the feature space

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_segment(G, feature, stop=-1, qmax=1, verbose=False)[source]

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G : graph object

the input graph (a topological graph essentially)

feature : array of shape (G.V,dim_feature)

some vectorial information related to the graph vertices

stop : int or float, optional

the stopping crterion. if stop==-1, then no stopping criterion is used

qmax : int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (G.V):

a labelling of the graph vertices according to the criterion

cost: array of shape (G.V - 1)

the cost of each merge step during the clustering procedure

Notes

A euclidean distance is used in the feature space

Caveat : when the number of cc in G (nbcc) is greter than qmax, u contains nbcc values, not qmax !