UMAP#

class spark_rapids_ml.umap.UMAP(*, n_neighbors: Optional[float] = 15, n_components: Optional[int] = 15, metric: str = 'euclidean', n_epochs: Optional[int] = None, learning_rate: Optional[float] = 1.0, init: Optional[str] = 'spectral', min_dist: Optional[float] = 0.1, spread: Optional[float] = 1.0, set_op_mix_ratio: Optional[float] = 1.0, local_connectivity: Optional[float] = 1.0, repulsion_strength: Optional[float] = 1.0, negative_sample_rate: Optional[int] = 5, transform_queue_size: Optional[float] = 1.0, a: Optional[float] = None, b: Optional[float] = None, precomputed_knn: Optional[List[List[float]]] = None, random_state: Optional[int] = None, sample_fraction: Optional[float] = 1.0, featuresCol: Optional[Union[str, List[str]]] = None, labelCol: Optional[str] = None, outputCol: Optional[str] = None, num_workers: Optional[int] = None, verbose: Union[int, bool] = False, **kwargs: Any)#

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique used for low-dimensional data visualization and general non-linear dimension reduction. The algorithm finds a low dimensional embedding of the data that approximates an underlying manifold. The fit() method constructs a KNN-graph representation of an input dataset and then optimizes a low dimensional embedding, and is performed on a single node. The transform() method transforms an input dataset into the optimized embedding space, and is performed distributedly.

Parameters:
n_neighborsfloat (optional, default=15)

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_componentsint (optional, default=2)

The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.

metricstr (optional, default=’euclidean’)

Distance metric to use. Supported distances are [‘l1’, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘correlation’, ‘hellinger’, ‘hamming’, ‘jaccard’]. Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.

n_epochsint (optional, default=None)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

learning_ratefloat (optional, default=1.0)

The initial learning rate for the embedding optimization.

initstr (optional, default=’spectral’)
How to initialize the low dimensional embedding. Options are:

‘spectral’: use a spectral embedding of the fuzzy 1-skeleton ‘random’: assign initial embedding positions at random.

min_distfloat (optional, default=0.1)

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result in a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spreadfloat (optional, default=1.0)

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

set_op_mix_ratiofloat (optional, default=1.0)

Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivityint (optional, default=1)

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strengthfloat (optional, default=1.0)

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rateint (optional, default=5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_sizefloat (optional, default=4.0)

For transform operations (embedding new points using a trained model), this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.

afloat (optional, default=None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

bfloat (optional, default=None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

precomputed_knnarray / sparse array / tuple - device or host (optional, default=None)

Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings.

random_stateint, RandomState instance (optional, default=None)

The seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of training time and memory usage.

verbose
Logging level.
  • 0 - Disables all log messages.

  • 1 - Enables only critical messages.

  • 2 - Enables all messages up to and including errors.

  • 3 - Enables all messages up to and including warnings.

  • 4 or False - Enables all messages up to and including information messages.

  • 5 or True - Enables all messages up to and including debug messages.

  • 6 - Enables all messages up to and including trace messages.

sample_fractionfloat (optional, default=1.0)

The fraction of the dataset to be used for fitting the model. Since fitting is done on a single node, very large datasets must be subsampled to fit within the node’s memory and execute in a reasonable time. Smaller fractions will result in faster training, but may result in sub-optimal embeddings.

featuresCol: str or List[str]

The feature column names, spark-rapids-ml supports vector, array and columnar as the input.

  • When the value is a string, the feature columns must be assembled into 1 column with vector or array type.

  • When the value is a list of strings, the feature columns must be numeric types.

labelCol: str (optional)

The name of the column that contains labels. If provided, supervised fitting will be performed, where labels will be taken into account when optimizing the embedding.

outputCol: str (optional)

The name of the column that contains embeddings. If not provided, the default name of “embedding” will be used.

num_workers:

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU. If not set, spark-rapids-ml tries to infer the number of cuML workers (i.e. GPUs in cluster) from the Spark environment.

Examples

>>> from spark_rapids_ml.umap import UMAP
>>> from cuml.datasets import make_blobs
>>> import cupy as cp
>>> X, _ = make_blobs(500, 5, centers=42, cluster_std=0.1, dtype=np.float32, random_state=10)
>>> feature_cols = [f"c{i}" for i in range(X.shape[1])]
>>> schema = [f"{c} {"float"}" for c in feature_cols]
>>> df = spark.createDataFrame(X.tolist(), ",".join(schema))
>>> df = df.withColumn("features", array(*feature_cols)).drop(*feature_cols)
>>> df.show(10, False)

features

[1.5578103, -9.300072, 9.220654, 4.5838223, -3.2613218] [9.295866, 1.3326015, -4.6483326, 4.43685, 6.906736] [1.1148645, 0.9800974, -9.67569, -8.020592, -3.748023] [-4.6454153, -8.095899, -4.9839406, 7.954683, -8.15784] [-6.5075264, -5.538241, -6.740191, 3.0490158, 4.1693997] [7.9449835, 4.142317, 6.207676, 3.202615, 7.1319785] [-0.3837125, 6.826891, -4.35618, -9.582829, -1.5456663] [2.5012932, 4.2080708, 3.5172815, 2.5741744, -6.291008] [9.317718, 1.3419528, -4.832837, 4.5362573, 6.9357944] [-6.65039, -5.438729, -6.858565, 2.9733503, 3.99863]

only showing top 10 rows

>>> umap_estimator = UMAP(sample_fraction=0.5, num_workers=3).setFeaturesCol("features")
>>> umap_model = umap_estimator.fit(df)
>>> output = umap_model.transform(df).toPandas()
>>> embedding = cp.asarray(output["embedding"].to_list())
>>> print("First 10 embeddings:")
>>> print(embedding[:10])

First 10 embeddings: [[ 5.378397 6.504756 ] [ 12.531521 13.946098 ] [ 11.990916 6.049594 ] [-14.175631 7.4849815] [ 7.065363 -16.75355 ] [ 1.8876278 1.0889664] [ 0.6557462 17.965862 ] [-16.220764 -6.4817486] [ 12.476492 13.80965 ] [ 6.823325 -16.71719 ]]

Methods

clear(param)

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy([extra])

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits multiple models to the input dataset for all param maps in a single pass.

getA()

Gets the value of a.

getB()

Gets the value of b.

getFeaturesCol()

Gets the value of featuresCol or featuresCols

getFeaturesCols()

Gets the value of featuresCols or its default value.

getInit()

Gets the value of init.

getLabelCol()

Gets the value of labelCol or its default value.

getLearningRate()

Gets the value of learning_rate.

getLocalConnectivity()

Gets the value of local_connectivity.

getMetric()

Gets the value of metric.

getMinDist()

Gets the value of min_dist.

getNComponents()

Gets the value of n_components.

getNEpochs()

Gets the value of n_epochs.

getNNeighbors()

Gets the value of n_neighbors.

getNegativeSampleRate()

Gets the value of negative_sample_rate.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets the value of outputCol.

getParam(paramName)

Gets a param by its name.

getPrecomputedKNN()

Gets the value of precomputed_knn.

getRandomState()

Gets the value of random_state.

getRepulsionStrength()

Gets the value of repulsion_strength.

getSampleFraction()

Gets the value of sample_fraction.

getSetOpMixRatio()

Gets the value of set_op_mix_ratio.

getSpread()

Gets the value of spread.

getTransformQueueSize()

Gets the value of transform_queue_size.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setA(value)

Sets the value of a.

setB(value)

Sets the value of b.

setFeaturesCol(value)

Sets the value of featuresCol or featuresCols.

setFeaturesCols(value)

Sets the value of featuresCols.

setInit(value)

Sets the value of init.

setLabelCol(value)

Sets the value of labelCol.

setLearningRate(value)

Sets the value of learning_rate.

setLocalConnectivity(value)

Sets the value of local_connectivity.

setMetric(value)

Sets the value of metric.

setMinDist(value)

Sets the value of min_dist.

setNComponents(value)

Sets the value of n_components.

setNEpochs(value)

Sets the value of n_epochs.

setNNeighbors(value)

Sets the value of n_neighbors.

setNegativeSampleRate(value)

Sets the value of negative_sample_rate.

setOutputCol(value)

Sets the value of outputCol.

setPrecomputedKNN(value)

Sets the value of precomputed_knn.

setRandomState(value)

Sets the value of random_state.

setRepulsionStrength(value)

Sets the value of repulsion_strength.

setSampleFraction(value)

Sets the value of sample_fraction.

setSetOpMixRatio(value)

Sets the value of set_op_mix_ratio.

setSpread(value)

Sets the value of spread.

setTransformQueueSize(value)

Sets the value of transform_queue_size.

write()

Attributes

a

b

cuml_params

Returns the dictionary of parameters intended for the underlying cuML class.

featuresCol

featuresCols

init

labelCol

learning_rate

local_connectivity

metric

min_dist

n_components

n_epochs

n_neighbors

negative_sample_rate

num_workers

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

outputCol

params

Returns all params ordered by name.

precomputed_knn

random_state

repulsion_strength

sample_fraction

set_op_mix_ratio

spread

transform_queue_size

Methods Documentation

clear(param: Param) None#

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy(extra: Optional[ParamMap] = None) P#
explainParam(param: Union[str, Param]) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: Optional[ParamMap] = None) ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:
extradict, optional

extra param values

Returns:
dict

merged param map

fit(dataset: DataFrame, params: Optional[Union[ParamMap, List[ParamMap], Tuple[ParamMap]]] = None) Union[M, List[M]]#

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
datasetpyspark.sql.DataFrame

input dataset.

paramsdict or list or tuple, optional

an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:
Transformer or a list of Transformer

fitted model(s)

fitMultiple(dataset: DataFrame, paramMaps: Sequence[ParamMap]) Iterator[Tuple[int, _CumlModel]]#

Fits multiple models to the input dataset for all param maps in a single pass.

Parameters:
datasetpyspark.sql.DataFrame

input dataset.

paramMapscollections.abc.Sequence

A Sequence of param maps.

Returns:
_FitMultipleIterator

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

getA() float#

Gets the value of a.

getB() float#

Gets the value of b.

getFeaturesCol() Union[str, List[str]]#

Gets the value of featuresCol or featuresCols

getFeaturesCols() List[str]#

Gets the value of featuresCols or its default value.

getInit() str#

Gets the value of init.

getLabelCol() str#

Gets the value of labelCol or its default value.

getLearningRate() float#

Gets the value of learning_rate.

getLocalConnectivity() float#

Gets the value of local_connectivity.

getMetric() str#

Gets the value of metric.

getMinDist() float#

Gets the value of min_dist.

getNComponents() int#

Gets the value of n_components.

getNEpochs() int#

Gets the value of n_epochs.

getNNeighbors() float#

Gets the value of n_neighbors.

getNegativeSampleRate() int#

Gets the value of negative_sample_rate.

getOrDefault(param: Union[str, Param[T]]) Union[Any, T]#

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol() str#

Gets the value of outputCol. Contains the embeddings of the input data.

getParam(paramName: str) Param#

Gets a param by its name.

getPrecomputedKNN() List[List[float]]#

Gets the value of precomputed_knn.

getRandomState() int#

Gets the value of random_state.

getRepulsionStrength() float#

Gets the value of repulsion_strength.

getSampleFraction() float#

Gets the value of sample_fraction.

getSetOpMixRatio() float#

Gets the value of set_op_mix_ratio.

getSpread() float#

Gets the value of spread.

getTransformQueueSize() float#

Gets the value of transform_queue_size.

hasDefault(param: Union[str, Param[Any]]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() MLReader#
save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setA(value: float) P#

Sets the value of a.

setB(value: float) P#

Sets the value of b.

setFeaturesCol(value: Union[str, List[str]]) P#

Sets the value of featuresCol or featuresCols. Used when input vectors are stored in a single column.

setFeaturesCols(value: List[str]) P#

Sets the value of featuresCols. Used when input vectors are stored as multiple feature columns.

setInit(value: str) P#

Sets the value of init.

setLabelCol(value: str) P#

Sets the value of labelCol.

setLearningRate(value: float) P#

Sets the value of learning_rate.

setLocalConnectivity(value: float) P#

Sets the value of local_connectivity.

setMetric(value: str) P#

Sets the value of metric.

setMinDist(value: float) P#

Sets the value of min_dist.

setNComponents(value: int) P#

Sets the value of n_components.

setNEpochs(value: int) P#

Sets the value of n_epochs.

setNNeighbors(value: float) P#

Sets the value of n_neighbors.

setNegativeSampleRate(value: int) P#

Sets the value of negative_sample_rate.

setOutputCol(value: str) P#

Sets the value of outputCol. Contains the embeddings of the input data.

setPrecomputedKNN(value: List[List[float]]) P#

Sets the value of precomputed_knn.

setRandomState(value: int) P#

Sets the value of random_state.

setRepulsionStrength(value: float) P#

Sets the value of repulsion_strength.

setSampleFraction(value: float) P#

Sets the value of sample_fraction.

setSetOpMixRatio(value: float) P#

Sets the value of set_op_mix_ratio.

setSpread(value: float) P#

Sets the value of spread.

setTransformQueueSize(value: float) P#

Sets the value of transform_queue_size.

write() MLWriter#

Attributes Documentation

a = Param(parent='undefined', name='a', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#
b = Param(parent='undefined', name='b', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#
cuml_params#

Returns the dictionary of parameters intended for the underlying cuML class.

featuresCol: Param[str] = Param(parent='undefined', name='featuresCol', doc='features column name.')#
featuresCols = Param(parent='undefined', name='featuresCols', doc='features column names for multi-column input.')#
init = Param(parent='undefined', name='init', doc="How to initialize the low dimensional embedding. Options are: 'spectral': use a spectral embedding of the fuzzy 1-skeleton, 'random': assign initial embedding positions at random.")#
labelCol: Param[str] = Param(parent='undefined', name='labelCol', doc='label column name.')#
learning_rate = Param(parent='undefined', name='learning_rate', doc='The initial learning rate for the embedding optimization.')#
local_connectivity = Param(parent='undefined', name='local_connectivity', doc='The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.')#
metric = Param(parent='undefined', name='metric', doc="Distance metric to use. Supported distances are ['l1', 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard']. Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.")#
min_dist = Param(parent='undefined', name='min_dist', doc='The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result in a more even dispersal of points. The value should be set relative to the ``spread`` value, which determines the scale at which embedded points will be spread out.')#
n_components = Param(parent='undefined', name='n_components', doc='The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.')#
n_epochs = Param(parent='undefined', name='n_epochs', doc='The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).')#
n_neighbors = Param(parent='undefined', name='n_neighbors', doc='The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.')#
negative_sample_rate = Param(parent='undefined', name='negative_sample_rate', doc='The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.')#
num_workers#

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

outputCol: Param[str] = Param(parent='undefined', name='outputCol', doc='output column name.')#
params#

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

precomputed_knn = Param(parent='undefined', name='precomputed_knn', doc='Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings.')#
random_state = Param(parent='undefined', name='random_state', doc='The seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of training time and memory usage.')#
repulsion_strength = Param(parent='undefined', name='repulsion_strength', doc='Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.')#
sample_fraction = Param(parent='undefined', name='sample_fraction', doc="The fraction of the dataset to be used for fitting the model. Since fitting is done on a single node, very large datasets must be subsampled to fit within the node's memory and execute in a reasonable time. Smaller fractions will result in faster training, but may result in sub-optimal embeddings.")#
set_op_mix_ratio = Param(parent='undefined', name='set_op_mix_ratio', doc='Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.')#
spread = Param(parent='undefined', name='spread', doc='The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.')#
transform_queue_size = Param(parent='undefined', name='transform_queue_size', doc='For transform operations (embedding new points using a trained model), this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.')#