UMAPModel#

class spark_rapids_ml.umap.UMAPModel(embedding_: List[Broadcast], raw_data_: List[Broadcast], n_cols: int, dtype: str)#

Methods

clear(param)

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy([extra])

cpu()

Return the equivalent PySpark CPU model

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

getA()

Gets the value of a.

getB()

Gets the value of b.

getFeaturesCol()

Gets the value of featuresCol or featuresCols

getFeaturesCols()

Gets the value of featuresCols or its default value.

getInit()

Gets the value of init.

getLabelCol()

Gets the value of labelCol or its default value.

getLearningRate()

Gets the value of learning_rate.

getLocalConnectivity()

Gets the value of local_connectivity.

getMetric()

Gets the value of metric.

getMinDist()

Gets the value of min_dist.

getNComponents()

Gets the value of n_components.

getNEpochs()

Gets the value of n_epochs.

getNNeighbors()

Gets the value of n_neighbors.

getNegativeSampleRate()

Gets the value of negative_sample_rate.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets the value of outputCol.

getParam(paramName)

Gets a param by its name.

getPrecomputedKNN()

Gets the value of precomputed_knn.

getRandomState()

Gets the value of random_state.

getRepulsionStrength()

Gets the value of repulsion_strength.

getSampleFraction()

Gets the value of sample_fraction.

getSetOpMixRatio()

Gets the value of set_op_mix_ratio.

getSpread()

Gets the value of spread.

getTransformQueueSize()

Gets the value of transform_queue_size.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setA(value)

Sets the value of a.

setB(value)

Sets the value of b.

setFeaturesCol(value)

Sets the value of featuresCol or featuresCols.

setFeaturesCols(value)

Sets the value of featuresCols.

setInit(value)

Sets the value of init.

setLabelCol(value)

Sets the value of labelCol.

setLearningRate(value)

Sets the value of learning_rate.

setLocalConnectivity(value)

Sets the value of local_connectivity.

setMetric(value)

Sets the value of metric.

setMinDist(value)

Sets the value of min_dist.

setNComponents(value)

Sets the value of n_components.

setNEpochs(value)

Sets the value of n_epochs.

setNNeighbors(value)

Sets the value of n_neighbors.

setNegativeSampleRate(value)

Sets the value of negative_sample_rate.

setOutputCol(value)

Sets the value of outputCol.

setPrecomputedKNN(value)

Sets the value of precomputed_knn.

setRandomState(value)

Sets the value of random_state.

setRepulsionStrength(value)

Sets the value of repulsion_strength.

setSampleFraction(value)

Sets the value of sample_fraction.

setSetOpMixRatio(value)

Sets the value of set_op_mix_ratio.

setSpread(value)

Sets the value of spread.

setTransformQueueSize(value)

Sets the value of transform_queue_size.

transform(dataset[, params])

Transforms the input dataset with optional parameters.

write()

Attributes

a

b

cuml_params

Returns the dictionary of parameters intended for the underlying cuML class.

embedding

featuresCol

featuresCols

init

labelCol

learning_rate

local_connectivity

metric

min_dist

n_components

n_epochs

n_neighbors

negative_sample_rate

num_workers

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

outputCol

params

Returns all params ordered by name.

precomputed_knn

random_state

raw_data

repulsion_strength

sample_fraction

set_op_mix_ratio

spread

transform_queue_size

Methods Documentation

clear(param: Param) None#

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy(extra: Optional[ParamMap] = None) P#
cpu() Model#

Return the equivalent PySpark CPU model

explainParam(param: Union[str, Param]) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: Optional[ParamMap] = None) ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:
extradict, optional

extra param values

Returns:
dict

merged param map

getA() float#

Gets the value of a.

getB() float#

Gets the value of b.

getFeaturesCol() Union[str, List[str]]#

Gets the value of featuresCol or featuresCols

getFeaturesCols() List[str]#

Gets the value of featuresCols or its default value.

getInit() str#

Gets the value of init.

getLabelCol() str#

Gets the value of labelCol or its default value.

getLearningRate() float#

Gets the value of learning_rate.

getLocalConnectivity() float#

Gets the value of local_connectivity.

getMetric() str#

Gets the value of metric.

getMinDist() float#

Gets the value of min_dist.

getNComponents() int#

Gets the value of n_components.

getNEpochs() int#

Gets the value of n_epochs.

getNNeighbors() float#

Gets the value of n_neighbors.

getNegativeSampleRate() int#

Gets the value of negative_sample_rate.

getOrDefault(param: Union[str, Param[T]]) Union[Any, T]#

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol() str#

Gets the value of outputCol. Contains the embeddings of the input data.

getParam(paramName: str) Param#

Gets a param by its name.

getPrecomputedKNN() List[List[float]]#

Gets the value of precomputed_knn.

getRandomState() int#

Gets the value of random_state.

getRepulsionStrength() float#

Gets the value of repulsion_strength.

getSampleFraction() float#

Gets the value of sample_fraction.

getSetOpMixRatio() float#

Gets the value of set_op_mix_ratio.

getSpread() float#

Gets the value of spread.

getTransformQueueSize() float#

Gets the value of transform_queue_size.

hasDefault(param: Union[str, Param[Any]]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() MLReader#
save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setA(value: float) P#

Sets the value of a.

setB(value: float) P#

Sets the value of b.

setFeaturesCol(value: Union[str, List[str]]) P#

Sets the value of featuresCol or featuresCols. Used when input vectors are stored in a single column.

setFeaturesCols(value: List[str]) P#

Sets the value of featuresCols. Used when input vectors are stored as multiple feature columns.

setInit(value: str) P#

Sets the value of init.

setLabelCol(value: str) P#

Sets the value of labelCol.

setLearningRate(value: float) P#

Sets the value of learning_rate.

setLocalConnectivity(value: float) P#

Sets the value of local_connectivity.

setMetric(value: str) P#

Sets the value of metric.

setMinDist(value: float) P#

Sets the value of min_dist.

setNComponents(value: int) P#

Sets the value of n_components.

setNEpochs(value: int) P#

Sets the value of n_epochs.

setNNeighbors(value: float) P#

Sets the value of n_neighbors.

setNegativeSampleRate(value: int) P#

Sets the value of negative_sample_rate.

setOutputCol(value: str) P#

Sets the value of outputCol. Contains the embeddings of the input data.

setPrecomputedKNN(value: List[List[float]]) P#

Sets the value of precomputed_knn.

setRandomState(value: int) P#

Sets the value of random_state.

setRepulsionStrength(value: float) P#

Sets the value of repulsion_strength.

setSampleFraction(value: float) P#

Sets the value of sample_fraction.

setSetOpMixRatio(value: float) P#

Sets the value of set_op_mix_ratio.

setSpread(value: float) P#

Sets the value of spread.

setTransformQueueSize(value: float) P#

Sets the value of transform_queue_size.

transform(dataset: DataFrame, params: Optional[ParamMap] = None) DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
datasetpyspark.sql.DataFrame

input dataset

paramsdict, optional

an optional param map that overrides embedded params.

Returns:
pyspark.sql.DataFrame

transformed dataset

write() MLWriter#

Attributes Documentation

a = Param(parent='undefined', name='a', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#
b = Param(parent='undefined', name='b', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#
cuml_params#

Returns the dictionary of parameters intended for the underlying cuML class.

embedding#
featuresCol: Param[str] = Param(parent='undefined', name='featuresCol', doc='features column name.')#
featuresCols = Param(parent='undefined', name='featuresCols', doc='features column names for multi-column input.')#
init = Param(parent='undefined', name='init', doc="How to initialize the low dimensional embedding. Options are: 'spectral': use a spectral embedding of the fuzzy 1-skeleton, 'random': assign initial embedding positions at random.")#
labelCol: Param[str] = Param(parent='undefined', name='labelCol', doc='label column name.')#
learning_rate = Param(parent='undefined', name='learning_rate', doc='The initial learning rate for the embedding optimization.')#
local_connectivity = Param(parent='undefined', name='local_connectivity', doc='The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.')#
metric = Param(parent='undefined', name='metric', doc="Distance metric to use. Supported distances are ['l1', 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard']. Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.")#
min_dist = Param(parent='undefined', name='min_dist', doc='The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result in a more even dispersal of points. The value should be set relative to the ``spread`` value, which determines the scale at which embedded points will be spread out.')#
n_components = Param(parent='undefined', name='n_components', doc='The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.')#
n_epochs = Param(parent='undefined', name='n_epochs', doc='The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).')#
n_neighbors = Param(parent='undefined', name='n_neighbors', doc='The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.')#
negative_sample_rate = Param(parent='undefined', name='negative_sample_rate', doc='The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.')#
num_workers#

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

outputCol: Param[str] = Param(parent='undefined', name='outputCol', doc='output column name.')#
params#

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

precomputed_knn = Param(parent='undefined', name='precomputed_knn', doc='Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings.')#
random_state = Param(parent='undefined', name='random_state', doc='The seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of training time and memory usage.')#
raw_data#
repulsion_strength = Param(parent='undefined', name='repulsion_strength', doc='Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.')#
sample_fraction = Param(parent='undefined', name='sample_fraction', doc="The fraction of the dataset to be used for fitting the model. Since fitting is done on a single node, very large datasets must be subsampled to fit within the node's memory and execute in a reasonable time. Smaller fractions will result in faster training, but may result in sub-optimal embeddings.")#
set_op_mix_ratio = Param(parent='undefined', name='set_op_mix_ratio', doc='Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.')#
spread = Param(parent='undefined', name='spread', doc='The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.')#
transform_queue_size = Param(parent='undefined', name='transform_queue_size', doc='For transform operations (embedding new points using a trained model), this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.')#