UMAPModel#

class spark_rapids_ml.umap.UMAPModel(embedding_: ndarray, raw_data_: Union[ndarray, csr_matrix], sparse_fit: bool, n_cols: int, dtype: str)#

Methods

`clear`(param)	Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.
`copy`([extra])	Create a copy of the current instance, including its parameters and cuml_params.
`cpu`()	Return the equivalent PySpark CPU model
`explainParam`(param)	Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
`explainParams`()	Returns the documentation of all params with their optionally default values and user-supplied values.
`extractParamMap`([extra])	Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
`getA`()	Gets the value of a.
`getB`()	Gets the value of b.
`getBuildAlgo`()	Gets the value of build_algo.
`getBuildKwds`()	Gets the value of build_kwds.
`getFeaturesCol`()	Gets the value of `featuresCol` or `featuresCols`
`getFeaturesCols`()	Gets the value of featuresCols or its default value.
`getInit`()	Gets the value of init.
`getLabelCol`()	Gets the value of labelCol or its default value.
`getLearningRate`()	Gets the value of learning_rate.
`getLocalConnectivity`()	Gets the value of local_connectivity.
`getMetric`()	Gets the value of metric.
`getMetricKwds`()	Gets the value of metric_kwds.
`getMinDist`()	Gets the value of min_dist.
`getNComponents`()	Gets the value of n_components.
`getNEpochs`()	Gets the value of n_epochs.
`getNNeighbors`()	Gets the value of n_neighbors.
`getNegativeSampleRate`()	Gets the value of negative_sample_rate.
`getOrDefault`(param)	Gets the value of a param in the user-supplied param map or its default value.
`getOutputCol`()	Gets the value of `outputCol`.
`getParam`(paramName)	Gets a param by its name.
`getRandomState`()	Gets the value of random_state.
`getRepulsionStrength`()	Gets the value of repulsion_strength.
`getSampleFraction`()	Gets the value of sample_fraction.
`getSetOpMixRatio`()	Gets the value of set_op_mix_ratio.
`getSpread`()	Gets the value of spread.
`getTransformQueueSize`()	Gets the value of transform_queue_size.
`hasDefault`(param)	Checks whether a param has a default value.
`hasParam`(paramName)	Tests whether this instance contains a param with a given (string) name.
`isDefined`(param)	Checks whether a param is explicitly set by user or has a default value.
`isSet`(param)	Checks whether a param is explicitly set by user.
`load`(path)	Reads an ML instance from the input path, a shortcut of read().load(path).
`read`()
`save`(path)	Save this ML instance to the given path, a shortcut of 'write().save(path)'.
`set`(param, value)	Sets a parameter in the embedded param map.
`setA`(value)	Sets the value of a.
`setB`(value)	Sets the value of b.
`setBuildAlgo`(value)	Sets the value of build_algo.
`setBuildKwds`(value)	Sets the value of build_kwds.
`setFeaturesCol`(value)	Sets the value of `featuresCol` or `featuresCols`.
`setFeaturesCols`(value)	Sets the value of `featuresCols`.
`setInit`(value)	Sets the value of init.
`setLabelCol`(value)	Sets the value of `labelCol`.
`setLearningRate`(value)	Sets the value of learning_rate.
`setLocalConnectivity`(value)	Sets the value of local_connectivity.
`setMetric`(value)	Sets the value of metric.
`setMetricKwds`(value)	Sets the value of metric_kwds.
`setMinDist`(value)	Sets the value of min_dist.
`setNComponents`(value)	Sets the value of n_components.
`setNEpochs`(value)	Sets the value of n_epochs.
`setNNeighbors`(value)	Sets the value of n_neighbors.
`setNegativeSampleRate`(value)	Sets the value of negative_sample_rate.
`setOutputCol`(value)	Sets the value of `outputCol`.
`setRandomState`(value)	Sets the value of random_state.
`setRepulsionStrength`(value)	Sets the value of repulsion_strength.
`setSampleFraction`(value)	Sets the value of sample_fraction.
`setSetOpMixRatio`(value)	Sets the value of set_op_mix_ratio.
`setSpread`(value)	Sets the value of spread.
`setTransformQueueSize`(value)	Sets the value of transform_queue_size.
`transform`(dataset[, params])	Transforms the input dataset with optional parameters.
`write`()

Attributes

`a`
`b`
`build_algo`
`build_kwds`
`cuml_params`	Returns the dictionary of parameters intended for the underlying cuML class.
`embedding`	Returns the model embeddings.
`enable_sparse_data_optim`
`featuresCol`
`featuresCols`
`init`
`labelCol`
`learning_rate`
`local_connectivity`
`metric`
`metric_kwds`
`min_dist`
`n_components`
`n_epochs`
`n_neighbors`
`negative_sample_rate`
`num_workers`	Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.
`outputCol`
`params`	Returns all params ordered by name.
`random_state`
`rawData`	Returns the raw data used to fit the model.
`repulsion_strength`
`sample_fraction`
`set_op_mix_ratio`
`spread`
`transform_queue_size`
`verbose`

Methods Documentation

clear(param: Param) → None#: Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy(extra: Optional[ParamMap] = None) → P#

Create a copy of the current instance, including its parameters and cuml_params.

This function extends the default copy() method to ensure the cuml_params variable is also copied. The default super().copy() method only handles _paramMap and _defaultParamMap.

Parameters:

extraOptional[ParamMap]: A dictionary or ParamMap containing additional parameters to set in the copied instance. Note ParamMap = Dict[pyspark.ml.param.Param, Any].

Returns:

P: A new instance of the same type as the current object, with parameters and cuml_params copied.

Raises:

TypeError: If any key in the extra dictionary is not an instance of pyspark.ml.param.Param.

cpu() → Model#: Return the equivalent PySpark CPU model

explainParam(param: Union[str, Param]) → str#: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() → str#: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: Optional[ParamMap] = None) → ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extradict, optional: extra param values

Returns:

dict: merged param map

getA() → float#: Gets the value of a.

getB() → float#: Gets the value of b.

getBuildAlgo() → str#: Gets the value of build_algo.

getBuildKwds() → Optional[Dict[str, Any]]#: Gets the value of build_kwds.

getFeaturesCol() → Union[str, List[str]]#: Gets the value of featuresCol or featuresCols

getFeaturesCols() → List[str]#: Gets the value of featuresCols or its default value.

getInit() → str#: Gets the value of init.

getLabelCol() → str#: Gets the value of labelCol or its default value.

getLearningRate() → float#: Gets the value of learning_rate.

getLocalConnectivity() → float#: Gets the value of local_connectivity.

getMetric() → str#: Gets the value of metric.

getMetricKwds() → Optional[Dict[str, Any]]#: Gets the value of metric_kwds.

getMinDist() → float#: Gets the value of min_dist.

getNComponents() → int#: Gets the value of n_components.

getNEpochs() → int#: Gets the value of n_epochs.

getNNeighbors() → float#: Gets the value of n_neighbors.

getNegativeSampleRate() → int#: Gets the value of negative_sample_rate.

getOrDefault(param: Union[str, Param[T]]) → Union[Any, T]#: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol() → str#: Gets the value of outputCol. Contains the embeddings of the input data.

getParam(paramName: str) → Param#: Gets a param by its name.

getRandomState() → int#: Gets the value of random_state.

getRepulsionStrength() → float#: Gets the value of repulsion_strength.

getSampleFraction() → float#: Gets the value of sample_fraction.

getSetOpMixRatio() → float#: Gets the value of set_op_mix_ratio.

getSpread() → float#: Gets the value of spread.

getTransformQueueSize() → float#: Gets the value of transform_queue_size.

hasDefault(param: Union[str, Param[Any]]) → bool#: Checks whether a param has a default value.

hasParam(paramName: str) → bool#: Tests whether this instance contains a param with a given (string) name.

isDefined(param: Union[str, Param[Any]]) → bool#: Checks whether a param is explicitly set by user or has a default value.

isSet(param: Union[str, Param[Any]]) → bool#: Checks whether a param is explicitly set by user.

classmethod load(path: str) → RL#: Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() → MLReader#

save(path: str) → None#: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) → None#: Sets a parameter in the embedded param map.

setA(value: float) → P#: Sets the value of a.

setB(value: float) → P#: Sets the value of b.

setBuildAlgo(value: str) → P#: Sets the value of build_algo.

setBuildKwds(value: Dict[str, Any]) → P#: Sets the value of build_kwds.

setFeaturesCol(value: Union[str, List[str]]) → P#: Sets the value of featuresCol or featuresCols. Used when input vectors are stored in a single column.

setFeaturesCols(value: List[str]) → P#: Sets the value of featuresCols. Used when input vectors are stored as multiple feature columns.

setInit(value: str) → P#: Sets the value of init.

setLabelCol(value: str) → P#: Sets the value of labelCol.

setLearningRate(value: float) → P#: Sets the value of learning_rate.

setLocalConnectivity(value: float) → P#: Sets the value of local_connectivity.

setMetric(value: str) → P#: Sets the value of metric.

setMetricKwds(value: Dict[str, Any]) → P#: Sets the value of metric_kwds.

setMinDist(value: float) → P#: Sets the value of min_dist.

setNComponents(value: int) → P#: Sets the value of n_components.

setNEpochs(value: int) → P#: Sets the value of n_epochs.

setNNeighbors(value: float) → P#: Sets the value of n_neighbors.

setNegativeSampleRate(value: int) → P#: Sets the value of negative_sample_rate.

setOutputCol(value: str) → P#: Sets the value of outputCol. Contains the embeddings of the input data.

setRandomState(value: int) → P#: Sets the value of random_state.

setRepulsionStrength(value: float) → P#: Sets the value of repulsion_strength.

setSampleFraction(value: float) → P#: Sets the value of sample_fraction.

setSetOpMixRatio(value: float) → P#: Sets the value of set_op_mix_ratio.

setSpread(value: float) → P#: Sets the value of spread.

setTransformQueueSize(value: float) → P#: Sets the value of transform_queue_size.

transform(dataset: DataFrame, params: Optional[ParamMap] = None) → DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:

datasetpyspark.sql.DataFrame: input dataset
paramsdict, optional: an optional param map that overrides embedded params.

Returns:

pyspark.sql.DataFrame: transformed dataset

write() → MLWriter#

Attributes Documentation

a = Param(parent='undefined', name='a', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#

b = Param(parent='undefined', name='b', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#

build_algo = Param(parent='undefined', name='build_algo', doc="How to build the knn graph. Supported build algorithms are ['auto', 'brute_force_knn', 'nn_descent']. 'auto' chooses to run with brute force knn if number of data rows is smaller than or equal to 50K. Otherwise, runs with nn descent.")#

build_kwds = Param(parent='undefined', name='build_kwds', doc="Build algorithm argument {'nnd_graph_degree': 64, 'nnd_intermediate_graph_degree': 128, 'nnd_max_iterations': 20, 'nnd_termination_threshold': 0.0001, 'nnd_return_distances': True, 'nnd_n_clusters': 1} Note that nnd_n_clusters > 1 will result in batch-building with NN Descent.")#

cuml_params#: Returns the dictionary of parameters intended for the underlying cuML class.

embedding#: Returns the model embeddings.

enable_sparse_data_optim = Param(parent='undefined', name='enable_sparse_data_optim', doc='This param activates sparse data optimization for VectorUDT features column. If the param is not included in an Estimator class, Spark rapids ml always converts VectorUDT features column into dense arrays when calling cuml backend. If included, Spark rapids ml will determine whether to create sparse arrays based on the param value: (1) If None, create dense arrays if the first VectorUDT of a dataframe is DenseVector. Create sparse arrays if it is SparseVector.(2) If False, create dense arrays. This is favorable if the majority of vectors are DenseVector.(3) If True, create sparse arrays. This is favorable if the majority of the VectorUDT vectors are SparseVector.')#

featuresCol: Param[str] = Param(parent='undefined', name='featuresCol', doc='features column name.')#

featuresCols = Param(parent='undefined', name='featuresCols', doc='features column names for multi-column input.')#

init = Param(parent='undefined', name='init', doc="How to initialize the low dimensional embedding. Options are: 'spectral': use a spectral embedding of the fuzzy 1-skeleton, 'random': assign initial embedding positions at random.")#

labelCol: Param[str] = Param(parent='undefined', name='labelCol', doc='label column name.')#

learning_rate = Param(parent='undefined', name='learning_rate', doc='The initial learning rate for the embedding optimization.')#

local_connectivity = Param(parent='undefined', name='local_connectivity', doc='The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.')#

metric = Param(parent='undefined', name='metric', doc="Distance metric to use. Supported distances are ['l1', 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary. Note: The 'jaccard' distance metric is only supported for sparse inputs.")#

metric_kwds = Param(parent='undefined', name='metric_kwds', doc='Additional keyword arguments for the metric function. If the metric function takes additional arguments, they should be passed in this dictionary.')#

min_dist = Param(parent='undefined', name='min_dist', doc='The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result in a more even dispersal of points. The value should be set relative to the ``spread`` value, which determines the scale at which embedded points will be spread out.')#

n_components = Param(parent='undefined', name='n_components', doc='The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.')#

n_epochs = Param(parent='undefined', name='n_epochs', doc='The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).')#

n_neighbors = Param(parent='undefined', name='n_neighbors', doc='The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.')#

negative_sample_rate = Param(parent='undefined', name='negative_sample_rate', doc='The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.')#

num_workers#: Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

outputCol: Param[str] = Param(parent='undefined', name='outputCol', doc='output column name.')#

params#: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

random_state = Param(parent='undefined', name='random_state', doc='The seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of training time and memory usage.')#

rawData#: Returns the raw data used to fit the model. If the input data was sparse, this will be a scipy csr matrix.

repulsion_strength = Param(parent='undefined', name='repulsion_strength', doc='Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.')#

sample_fraction = Param(parent='undefined', name='sample_fraction', doc="The fraction of the dataset to be used for fitting the model. Since fitting is done on a single node, very large datasets must be subsampled to fit within the node's memory and execute in a reasonable time. Smaller fractions will result in faster training, but may result in sub-optimal embeddings.")#

set_op_mix_ratio = Param(parent='undefined', name='set_op_mix_ratio', doc='Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.')#

spread = Param(parent='undefined', name='spread', doc='The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.')#

transform_queue_size = Param(parent='undefined', name='transform_queue_size', doc='For transform operations (embedding new points using a trained model), this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.')#

verbose: Param[Union[int, bool]] = Param(parent='undefined', name='verbose', doc='cuml verbosity level (False, True or an integer between 0 and 6).')#