algorithms.utils.pca¶
Module: algorithms.utils.pca
¶
This module provides a class for principal components analysis (PCA).
PCA is an orthonormal, linear transform (i.e., a rotation) that maps the data to a new coordinate system such that the maximal variability of the data lies on the first coordinate (or the first principal component), the second greatest variability is projected onto the second coordinate, and so on. The resulting data has unit covariance (i.e., it is decorrelated). This technique can be used to reduce the dimensionality of the data.
More specifically, the data is projected onto the eigenvectors of the covariance matrix.
Functions¶
-
nipy.algorithms.utils.pca.
pca
(data, axis=0, mask=None, ncomp=None, standardize=True, design_keep=None, design_resid='mean', tol_ratio=0.01)[source]¶ Compute the SVD PCA of an array-like thing over axis.
- Parameters
data : ndarray-like (np.float)
The array on which to perform PCA over axis axis (below)
axis : int, optional
The axis over which to perform PCA (axis identifying observations). Default is 0 (first)
mask : ndarray-like (np.bool), optional
An optional mask, should have shape given by data axes, with axis removed, i.e.:
s = data.shape; s.pop(axis); msk_shape=s
ncomp : {None, int}, optional
How many component basis projections to return. If ncomp is None (the default) then the number of components is given by the calculated rank of the data, after applying design_keep, design_resid and tol_ratio below. We always return all the basis vectors and percent variance for each component; ncomp refers only to the number of basis_projections returned.
standardize : bool, optional
If True, standardize so each time series (after application of design_keep and design_resid) has the same standard deviation, as calculated by the
np.std
function.design_keep : None or ndarray, optional
Data is projected onto the column span of design_keep. None (default) equivalent to
np.identity(data.shape[axis])
design_resid : str or None or ndarray, optional
After projecting onto the column span of design_keep, data is projected perpendicular to the column span of this matrix. If None, we do no such second projection. If a string ‘mean’, then the mean of the data is removed, equivalent to passing a column vector matrix of 1s.
tol_ratio : float, optional
If
XZ
is the vector of singular values of the projection matrix from design_keep and design_resid, and S are the singular values ofXZ
, then tol_ratio is the value used to calculate the effective rank of the projection of the design, as inrank = ((S / S.max) > tol_ratio).sum()
- Returns
results : dict
\(G\) is the number of non-trivial components found after applying
tol_ratio to the projections of design_keep and design_resid.
results has keys:
basis_vectors
: series over axis, shape (data.shape[axis], G) -the eigenvectors of the PCA
pcnt_var
: percent variance explained by component, shape(G,)
basis_projections
: PCA components, with components varyingover axis axis; thus shape given by:
s = list(data.shape); s[axis] = ncomp
axis
: axis over which PCA has been performed.
Notes
See
pca_image.m
fromfmristat
for Keith Worsley’s code on which some of this is based.See: http://en.wikipedia.org/wiki/Principal_component_analysis for some inspiration for naming - particularly ‘basis_vectors’ and ‘basis_projections’
Examples
>>> arr = np.random.normal(size=(17, 10, 12, 14)) >>> msk = np.all(arr > -2, axis=0) >>> res = pca(arr, mask=msk, ncomp=9)
Basis vectors are columns. There is one column for each component. The number of components is the calculated rank of the data matrix after applying the various projections listed in the parameters. In this case we are only removing the mean, so the number of components is one less than the axis over which we do the PCA (here axis=0 by default).
>>> res['basis_vectors'].shape (17, 16)
Basis projections are arrays with components in the dimension over which we have done the PCA (axis=0 by default). Because we set ncomp above, we only retain ncomp components.
>>> res['basis_projections'].shape (9, 10, 12, 14)
-
nipy.algorithms.utils.pca.
pca_image
(img, axis='t', mask=None, ncomp=None, standardize=True, design_keep=None, design_resid='mean', tol_ratio=0.01)[source]¶ Compute the PCA of an image over a specified axis
- Parameters
img : Image
The image on which to perform PCA over the given axis
axis : str or int, optional
Axis over which to perform PCA. Default is ‘t’. If axis is an integer, gives the index of the input (domain) axis of img. If axis is a str, can be an input (domain) name, or an output (range) name, that maps to an input (domain) name.
mask : Image, optional
An optional mask, should have shape == image.shape[:3] and the same coordinate map as img but with axis dropped
ncomp : {None, int}, optional
How many component basis projections to return. If ncomp is None (the default) then the number of components is given by the calculated rank of the data, after applying design_keep, design_resid and tol_ratio below. We always return all the basis vectors and percent variance for each component; ncomp refers only to the number of basis_projections returned.
standardize : bool, optional
If True, standardize so each time series (after application of design_keep and design_resid) has the same standard deviation, as calculated by the
np.std
function.design_keep : None or ndarray, optional
Data is projected onto the column span of design_keep. None (default) equivalent to
np.identity(data.shape[axis])
design_resid : str or None or ndarray, optional
After projecting onto the column span of design_keep, data is projected perpendicular to the column span of this matrix. If None, we do no such second projection. If a string ‘mean’, then the mean of the data is removed, equivalent to passing a column vector matrix of 1s.
tol_ratio : float, optional
If
XZ
is the vector of singular values of the projection matrix from design_keep and design_resid, and S are the singular values ofXZ
, then tol_ratio is the value used to calculate the effective rank of the projection of the design, as inrank = ((S / S.max) > tol_ratio).sum()
- Returns
results : dict
\(L\) is the number of non-trivial components found after applying
tol_ratio to the projections of design_keep and design_resid.
results has keys: *
basis_vectors
: series over axis, shape (data.shape[axis], L) -the eigenvectors of the PCA
pcnt_var
: percent variance explained by component, shape(L,)
basis_projections
: PCA components, with components varyingover axis axis; thus shape given by:
s = list(data.shape); s[axis] = ncomp
axis
: axis over which PCA has been performed.
Examples
>>> from nipy.testing import funcfile >>> from nipy import load_image >>> func_img = load_image(funcfile)
Time is the fourth axis
>>> func_img.coordmap.function_range CoordinateSystem(coord_names=('aligned-x=L->R', 'aligned-y=P->A', 'aligned-z=I->S', 't'), name='aligned', coord_dtype=float64) >>> func_img.shape (17, 21, 3, 20)
Calculate the PCA over time, by default
>>> res = pca_image(func_img) >>> res['basis_projections'].coordmap.function_range CoordinateSystem(coord_names=('aligned-x=L->R', 'aligned-y=P->A', 'aligned-z=I->S', 'PCA components'), name='aligned', coord_dtype=float64)
The number of components is one less than the number of time points
>>> res['basis_projections'].shape (17, 21, 3, 19)