algorithms.clustering.hierarchical_clustering¶

Module: `algorithms.clustering.hierarchical_clustering`¶

Inheritance diagram for nipy.algorithms.clustering.hierarchical_clustering:

digraph inheritance1a5d4ec8c5 { rankdir=LR; size="8.0, 12.0"; "clustering.hierarchical_clustering.WeightedForest" [URL="#nipy.algorithms.clustering.hierarchical_clustering.WeightedForest",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="This is a weighted Forest structure, i.e. a tree"]; "graph.forest.Forest" -> "clustering.hierarchical_clustering.WeightedForest" [arrowsize=0.5,style="setlinewidth(0.5)"]; "graph.forest.Forest" [URL="nipy.algorithms.graph.forest.html#nipy.algorithms.graph.forest.Forest",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="Forest structure, i.e. a set of trees"]; "graph.graph.WeightedGraph" -> "graph.forest.Forest" [arrowsize=0.5,style="setlinewidth(0.5)"]; "graph.graph.Graph" [URL="nipy.algorithms.graph.graph.html#nipy.algorithms.graph.graph.Graph",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="Basic topological (non-weighted) directed Graph class"]; "graph.graph.WeightedGraph" [URL="nipy.algorithms.graph.graph.html#nipy.algorithms.graph.graph.WeightedGraph",fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5)",target="_top",tooltip="Basic weighted, directed graph class"]; "graph.graph.Graph" -> "graph.graph.WeightedGraph" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

These routines perform some hierrachical agglomerative clustering of some input data. The following alternatives are proposed: - Distance based average-link - Similarity-based average-link - Distance based maximum-link - Ward’s algorithm under graph constraints - Ward’s algorithm without graph constraints

In this latest version, the results are returned in a ‘WeightedForest’ structure, which gives access to the clustering hierarchy, facilitates the plot of the result etc.

For back-compatibility, *_segment versions of the algorithms have been appended, with the old API (except the qmax parameter, which now represents the number of wanted clusters)

Author : Bertrand Thirion,Pamela Guevara, 2006-2009

Class¶

`WeightedForest`¶

class nipy.algorithms.clustering.hierarchical_clustering.WeightedForest(V, parents=None, height=None)[source]¶

Bases: nipy.algorithms.graph.forest.Forest

This is a weighted Forest structure, i.e. a tree - each node has one parent and children (hierarchical structure) - some of the nodes can be viewed as leaves, other as roots - the edges within a tree are associated with a weight: +1 from child to parent -1 from parent to child - additionally, the nodes have a value, which is called ‘height’, especially useful from dendrograms

__init__(V, parents=None, height=None)[source]¶

Parameters

V: the number of edges of the graph

parents=None: array of shape (V)

the parents of the graph by default, the parents are set to range(V), i.e. each node is its own parent, and each node is a tree

height=None: array of shape(V)

the height of the nodes

set_height(height=None)[source]¶: Set the height array

get_height()[source]¶: Get the height array

check_compatible_height()[source]¶: Check that height[parents[i]]>=height[i] for all nodes

plot(ax=None)[source]¶

Plot the dendrogram associated with self the rank of the data in the dendogram is returned

Parameters: ax: axis handle, optional
Returns: ax, the axis handle

partition(threshold)[source]¶: Partition the tree according to a cut criterion

split(k)[source]¶: idem as partition, but a number of components are supplied instead

plot_height()[source]¶: Plot the height of the non-leaves nodes

list_of_subtrees()[source]¶: returns the list of all non-trivial subtrees in the graph Caveat: theis function assumes that the vertices are sorted in a way such that parent[i]>i forall i Only the leaves are listeed, not the subtrees themselves

adjacency()¶

returns the adjacency matrix of the graph as a sparse coo matrix

Returns: adj: scipy.sparse matrix instance,

that encodes the adjacency matrix of self

all_distances(seed=None)¶

returns all the distances of the graph as a tree

Parameters: seed=None array of shape(nbseed) with valuesin [0..self.V-1]

set of vertices from which tehe distances are computed
Returns: dg: array of shape(nseed, self.V), the resulting distances

Notes

By convention infinite distances are given the distance np.inf

anti_symmeterize()¶: anti-symmeterize self, i.e. produces the graph whose adjacency matrix would be the antisymmetric part of its current adjacency matrix

cc()¶

Compte the different connected components of the graph.

Returns: label: array of shape(self.V), labelling of the vertices

check()¶

Check that self is indeed a forest, i.e. contains no loop

Returns: a boolean b=0 iff there are loops, 1 otherwise

Notes

Slow implementation, might be rewritten in C or cython

cliques()¶

Extraction of the graphe cliques these are defined using replicator dynamics equations

Returns: cliques: array of shape (self.V), type (np.int)

labelling of the vertices according to the clique they belong to

compact_neighb()¶

returns a compact representation of self

Returns

idx: array of of shape(self.V + 1):

the positions where to find the neighors of each node within neighb and weights

neighb: array of shape(self.E), concatenated list of neighbors

weights: array of shape(self.E), concatenated list of weights

compute_children()¶: Define the children of each node (stored in self.children)

copy()¶: returns a copy of self

cut_redundancies()¶

Returns a graph with redundant edges removed: ecah edge (ab) is present ony once in the edge matrix: the correspondng weights are added.

Returns: the resulting WeightedGraph

define_graph_attributes()¶: define the edge and weights array

degrees()¶

Returns the degree of the graph vertices.

Returns

rdegree: (array, type=int, shape=(self.V,)), the right degrees

ldegree: (array, type=int, shape=(self.V,)), the left degrees

depth_from_leaves()¶

compute an index for each node: 0 for the leaves, 1 for their parents etc. and maximal for the roots.

Returns: depth: array of shape (self.V): the depth values of the vertices

dijkstra(seed=0)¶

Returns all the [graph] geodesic distances starting from seed x

seed (int, >-1, <self.V) or array of shape(p)
edge(s) from which the distances are computed

Returns: dg: array of shape (self.V),

the graph distance dg from ant vertex to the nearest seed

Notes

It is mandatory that the graph weights are non-negative

floyd(seed=None)¶

Compute all the geodesic distances starting from seeds

Parameters: seed= None: array of shape (nbseed), type np.int

vertex indexes from which the distances are computed if seed==None, then every edge is a seed point
Returns: dg array of shape (nbseed, self.V)

the graph distance dg from each seed to any vertex

Notes

It is mandatory that the graph weights are non-negative. The algorithm proceeds by repeating Dijkstra’s algo for each seed. Floyd’s algo is not used (O(self.V)^3 complexity…)

from_3d_grid(xyz, k=18)¶

Sets the graph to be the topological neighbours graph of the three-dimensional coordinates set xyz, in the k-connectivity scheme

Parameters

xyz: array of shape (self.V, 3) and type np.int,

k = 18: the number of neighbours considered. (6, 18 or 26)

Returns

E(int): the number of edges of self

get_E()¶: To get the number of edges in the graph

get_V()¶: To get the number of vertices in the graph

get_children(v=-1)¶

Get the children of a node/each node

Parameters: v: int, optional

a node index
Returns: children: list of int the list of children of node v (if v is provided)

a list of lists of int, the children of all nodes otherwise

get_descendants(v, exclude_self=False)¶

returns the nodes that are children of v as a list

Parameters: v: int, a node index
Returns: desc: list of int, the list of all descendant of the input node

get_edges()¶: To get the graph’s edges

get_vertices()¶: To get the graph’s vertices (as id)

get_weights()¶

is_connected()¶: States whether self is connected or not

isleaf()¶

Identification of the leaves of the forest

Returns: leaves: bool array of shape(self.V), indicator of the forest’s leaves

isroot()¶

Returns an indicator of nodes being roots

Returns: roots, array of shape(self.V, bool), indicator of the forest’s roots

kruskal()¶

Creates the Minimum Spanning Tree of self using Kruskal’s algo. efficient is self is sparse

Returns: K, WeightedGraph instance: the resulting MST

Notes

If self contains several connected components, will have the same number k of connected components

leaves_of_a_subtree(ids, custom=False)¶

tests whether the given nodes are the leaves of a certain subtree

Parameters

ids: array of shape (n) that takes values in [0..self.V-1]

custom == False, boolean

if custom==true the behavior of the function is more specific - the different connected components are considered as being in a same greater tree - when a node has more than two subbranches, any subset of these children is considered as a subtree

left_incidence()¶

Return left incidence matrix

Returns: left_incid: list

the left incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[0] = i

list_of_neighbors()¶: returns the set of neighbors of self as a list of arrays

main_cc()¶

Returns the indexes of the vertices within the main cc

Returns: idx: array of shape (sizeof main cc)

merge_simple_branches()¶

Return a subforest, where chained branches are collapsed

Returns: sf, Forest instance, same as self, without any chain

normalize(c=0)¶

Normalize the graph according to the index c Normalization means that the sum of the edges values that go into or out each vertex must sum to 1

Parameters: c=0 in {0, 1, 2}, optional: index that designates the way

according to which D is normalized c == 0 => for each vertex a, sum{edge[e, 0]=a} D[e]=1 c == 1 => for each vertex b, sum{edge[e, 1]=b} D[e]=1 c == 2 => symmetric (‘l2’) normalization

Notes

Note that when sum_{edge[e, .] == a } D[e] = 0, nothing is performed

propagate_upward(label)¶

Propagation of a certain labelling from leves to roots Assuming that label is a certain positive integer field this propagates these labels to the parents whenever the children nodes have coherent properties otherwise the parent value is unchanged

Parameters: label: array of shape(self.V)
Returns: label: array of shape(self.V)

propagate_upward_and(prop)¶

propagates from leaves to roots some binary property of the nodes so that prop[parents] = logical_and(prop[children])

Parameters: prop, array of shape(self.V), the input property
Returns: prop, array of shape(self.V), the output property field

remove_edges(valid)¶

Removes all the edges for which valid==0

Parameters: valid : (self.E,) array

remove_trivial_edges()¶

Removes trivial edges, i.e. edges that are (vv)-like self.weights and self.E are corrected accordingly

Returns: self.E (int): The number of edges

reorder_from_leaves_to_roots()¶

reorder the tree so that the leaves come first then their parents and so on, and the roots are last.

Returns: order: array of shape(self.V)

the order of the old vertices in the reordered graph

right_incidence()¶

Return right incidence matrix

Returns: right_incid: list

the right incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[1] = i

set_edges(edges)¶

Sets the graph’s edges

Preconditions:

edges has a correct size
edges take values in [1..V]

set_euclidian(X)¶

Compute the weights of the graph as the distances between the corresponding rows of X, which represents an embdedding of self

Parameters: X array of shape (self.V, edim),

the coordinate matrix of the embedding

set_gaussian(X, sigma=0)¶

Compute the weights of the graph as a gaussian function of the distance between the corresponding rows of X, which represents an embdedding of self

Parameters

X array of shape (self.V, dim)

the coordinate matrix of the embedding

sigma=0, float: the parameter of the gaussian function

Notes

When sigma == 0, the following value is used: sigma = sqrt(mean(||X[self.edges[:, 0], :]-X[self.edges[:, 1], :]||^2))

set_weights(weights)¶

Set edge weights

Parameters: weights: array

array shape(self.V): edges weights

show(X=None, ax=None)¶

Plots the current graph in 2D

Parameters

X : None or array of shape (self.V, 2)

a set of coordinates that can be used to embed the vertices in 2D. If X.shape[1]>2, a svd reduces X for display. By default, the graph is presented on a circle

ax: None or int, optional

ax handle

Returns

ax: axis handle

Notes

This should be used only for small graphs.

subforest(valid)¶

Creates a subforest with the vertices for which valid > 0

Parameters: valid: array of shape (self.V): idicator of the selected nodes
Returns: subforest: a new forest instance, with a reduced set of nodes

Notes

The children of deleted vertices become their own parent

subgraph(valid)¶

Creates a subgraph with the vertices for which valid>0 and with the correponding set of edges

Parameters: valid, array of shape (self.V): nonzero for vertices to be retained
Returns: G, WeightedGraph instance, the desired subgraph of self

Notes

The vertices are renumbered as [1..p] where p = sum(valid>0) when sum(valid==0) then None is returned

symmeterize()¶: Symmeterize self, modify edges and weights so that self.adjacency becomes the symmetric part of the current self.adjacency.

to_coo_matrix()¶

Return adjacency matrix as coo sparse

Returns: sp: scipy.sparse matrix instance

that encodes the adjacency matrix of self

tree_depth()¶: Returns the number of hierarchical levels in the tree

voronoi_diagram(seeds, samples)¶

Defines the graph as the Voronoi diagram (VD) that links the seeds. The VD is defined using the sample points.

Parameters

seeds: array of shape (self.V, dim)

samples: array of shape (nsamples, dim)

Notes

By default, the weights are a Gaussian function of the distance The implementation is not optimal

voronoi_labelling(seed)¶

Performs a voronoi labelling of the graph

Parameters: seed: array of shape (nseeds), type (np.int),

vertices from which the cells are built
Returns: labels: array of shape (self.V) the labelling of the vertices

Functions¶

nipy.algorithms.clustering.hierarchical_clustering.average_link_graph(G)[source]¶

Agglomerative function based on a (hopefully sparse) similarity graph

Parameters: G the input graph
Returns: t a weightForest structure that represents the dendrogram of the data

nipy.algorithms.clustering.hierarchical_clustering.average_link_graph_segment(G, stop=0, qmax=1, verbose=False)[source]¶

Agglomerative function based on a (hopefully sparse) similarity graph

Parameters

G the input graph

stop: float

the stopping criterion

qmax: int, optional

the number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (G.V)

a labelling of the graph vertices according to the criterion

cost: array of shape (G.V (?))

the cost of each merge step during the clustering procedure

nipy.algorithms.clustering.hierarchical_clustering.fusion(K, pop, i, j, k)[source]¶

Modifies the graph K to merge nodes i and j into nodes k

The similarity values are weighted averaged, where pop[i] and pop[j] yield the relative weights. this is used in average_link_slow (deprecated)

nipy.algorithms.clustering.hierarchical_clustering.ward(G, feature, verbose=False)[source]¶

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G : graph

the input graph (a topological graph essentially)

feature : array of shape (G.V,dim_feature)

vectorial information related to the graph vertices

verbose : bool, optional

If True, print diagnostic information

Returns

t : WeightedForest instance

structure that represents the dendrogram

Notes

When G has more than 1 connected component, t is no longer a tree. This case is handled cleanly now

nipy.algorithms.clustering.hierarchical_clustering.ward_field_segment(F, stop=-1, qmax=-1, verbose=False)[source]¶

Agglomerative function based on a field structure

Parameters

F the input field (graph+feature)

stop: float, optional

the stopping crterion. if stop==-1, then no stopping criterion is used

qmax: int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (F.V)

labelling of the graph vertices according to the criterion

cost array of shape (F.V - 1)

the cost of each merge step during the clustering procedure

Notes

See ward_quick_segment for more information

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_quick(G, feature, verbose=False)[source]¶

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G : graph instance

topology-defining graph

feature: array of shape (G.V,dim_feature)

some vectorial information related to the graph vertices

verbose : bool, optional

If True, print diagnostic information

Returns

t: weightForest instance,

that represents the dendrogram of the data

Notes

Hopefully a quicker version

A euclidean distance is used in the feature space

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_quick_segment(G, feature, stop=-1, qmax=1, verbose=False)[source]¶

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G: labs.graph.WeightedGraph instance

the input graph (a topological graph essentially)

feature array of shape (G.V,dim_feature)

vectorial information related to the graph vertices

stop1 : int or float, optional

the stopping crterion if stop==-1, then no stopping criterion is used

qmax : int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (G.V)

labelling of the graph vertices according to the criterion

cost: array of shape (G.V - 1)

the cost of each merge step during the clustering procedure

Notes

Hopefully a quicker version

A euclidean distance is used in the feature space

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_segment(G, feature, stop=-1, qmax=1, verbose=False)[source]¶

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters

G : graph object

the input graph (a topological graph essentially)

feature : array of shape (G.V,dim_feature)

some vectorial information related to the graph vertices

stop : int or float, optional

the stopping crterion. if stop==-1, then no stopping criterion is used

qmax : int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns

u: array of shape (G.V):

a labelling of the graph vertices according to the criterion

cost: array of shape (G.V - 1)

the cost of each merge step during the clustering procedure

Notes

A euclidean distance is used in the feature space

Caveat : when the number of cc in G (nbcc) is greter than qmax, u contains nbcc values, not qmax !

algorithms.clustering.hierarchical_clustering¶

Module: algorithms.clustering.hierarchical_clustering¶

Class¶

WeightedForest¶

Functions¶

Module: `algorithms.clustering.hierarchical_clustering`¶

`WeightedForest`¶