UMAP#
- class spark_rapids_ml.umap.UMAP(*, n_neighbors: Optional[float] = 15, n_components: Optional[int] = 15, metric: str = 'euclidean', metric_kwds: Optional[Dict[str, Any]] = None, n_epochs: Optional[int] = None, learning_rate: Optional[float] = 1.0, init: Optional[str] = 'spectral', min_dist: Optional[float] = 0.1, spread: Optional[float] = 1.0, set_op_mix_ratio: Optional[float] = 1.0, local_connectivity: Optional[float] = 1.0, repulsion_strength: Optional[float] = 1.0, negative_sample_rate: Optional[int] = 5, transform_queue_size: Optional[float] = 1.0, a: Optional[float] = None, b: Optional[float] = None, precomputed_knn: Optional[Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]], Tuple[Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]], Union[_SupportsArray[dtype[Any]], _NestedSequence[_SupportsArray[dtype[Any]]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]]]]]] = None, random_state: Optional[int] = None, build_algo: Optional[str] = 'auto', build_kwds: Optional[Dict[str, Any]] = None, sample_fraction: Optional[float] = 1.0, featuresCol: Optional[Union[str, List[str]]] = None, labelCol: Optional[str] = None, outputCol: Optional[str] = None, num_workers: Optional[int] = None, enable_sparse_data_optim: Optional[bool] = None, **kwargs: Any)#
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique used for low-dimensional data visualization and general non-linear dimension reduction. The algorithm finds a low dimensional embedding of the data that approximates an underlying manifold. The fit() method constructs a KNN-graph representation of an input dataset and then optimizes a low dimensional embedding, and is performed on a single node. The transform() method transforms an input dataset into the optimized embedding space, and is performed distributedly.
- Parameters:
- n_neighborsfloat (optional, default=15)
The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
- n_componentsint (optional, default=2)
The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.
- metricstr (optional, default=’euclidean’)
Distance metric to use. Supported distances are [‘l1’, ‘cityblock’, ‘taxicab’, ‘manhattan’, ‘euclidean’, ‘l2’, ‘sqeuclidean’, ‘canberra’, ‘minkowski’, ‘chebyshev’, ‘linf’, ‘cosine’, ‘correlation’, ‘hellinger’, ‘hamming’, ‘jaccard’]. Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.
- metric_kwdsdict (optional, default=None)
Additional keyword arguments for the metric function. If the metric function takes additional arguments, they should be passed in this dictionary.
- n_epochsint (optional, default=None)
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
- learning_ratefloat (optional, default=1.0)
The initial learning rate for the embedding optimization.
- initstr (optional, default=’spectral’)
- How to initialize the low dimensional embedding. Options are:
‘spectral’: use a spectral embedding of the fuzzy 1-skeleton ‘random’: assign initial embedding positions at random.
- min_distfloat (optional, default=0.1)
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result in a more even dispersal of points. The value should be set relative to the
spread
value, which determines the scale at which embedded points will be spread out.- spreadfloat (optional, default=1.0)
The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are.- set_op_mix_ratiofloat (optional, default=1.0)
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
- local_connectivityint (optional, default=1)
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
- repulsion_strengthfloat (optional, default=1.0)
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
- negative_sample_rateint (optional, default=5)
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
- transform_queue_sizefloat (optional, default=4.0)
For transform operations (embedding new points using a trained model), this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
- afloat (optional, default=None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
.- bfloat (optional, default=None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
.- precomputed_knnarray / sparse array / tuple - device or host (optional, default=None)
Either one of a tuple (indices, distances) of arrays of shape (n_samples, n_neighbors), a pairwise distances dense array of shape (n_samples, n_samples) or a KNN graph sparse array (preferably CSR/COO). This feature allows the precomputation of the KNN outside of UMAP and also allows the use of a custom distance function. This function should match the metric used to train the UMAP embeedings.
- random_stateint, RandomState instance (optional, default=None)
The seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of training time and memory usage.
- verbose
- Logging level.
0
- Disables all log messages.1
- Enables only critical messages.2
- Enables all messages up to and including errors.3
- Enables all messages up to and including warnings.4 or False
- Enables all messages up to and including information messages.5 or True
- Enables all messages up to and including debug messages.6
- Enables all messages up to and including trace messages.
- build_algostr (optional, default=’auto’)
How to build the knn graph. Supported build algorithms are [‘auto’, ‘brute_force_knn’, ‘nn_descent’]. ‘auto’ chooses to run with brute force knn if number of data rows is smaller than or equal to 50K. Otherwise, runs with nn descent.
- build_kwdsdict (optional, default=None)
Build algorithm argument {‘nnd_graph_degree’: 64, ‘nnd_intermediate_graph_degree’: 128, ‘nnd_max_iterations’: 20, ‘nnd_termination_threshold’: 0.0001, ‘nnd_return_distances’: True, ‘nnd_n_clusters’: 1} Note that nnd_n_clusters > 1 will result in batch-building with NN Descent.
- sample_fractionfloat (optional, default=1.0)
The fraction of the dataset to be used for fitting the model. Since fitting is done on a single node, very large datasets must be subsampled to fit within the node’s memory. Smaller fractions will result in faster training, but may decrease embedding quality. Note: this is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame.
- featuresCol: str or List[str]
The feature column names, spark-rapids-ml supports vector, array and columnar as the input.
When the value is a string, the feature columns must be assembled into 1 column with vector or array type.
When the value is a list of strings, the feature columns must be numeric types.
- labelCol: str (optional)
The name of the column that contains labels. If provided, supervised fitting will be performed, where labels will be taken into account when optimizing the embedding.
- outputCol: str (optional)
The name of the column that contains embeddings. If not provided, the default name of “embedding” will be used.
- num_workers:
Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU. If not set, spark-rapids-ml tries to infer the number of cuML workers (i.e. GPUs in cluster) from the Spark environment.
Examples
>>> from spark_rapids_ml.umap import UMAP >>> from cuml.datasets import make_blobs >>> import cupy as cp
>>> X, _ = make_blobs(500, 5, centers=42, cluster_std=0.1, dtype=np.float32, random_state=10) >>> feature_cols = [f"c{i}" for i in range(X.shape[1])] >>> schema = [f"{c} {'float'}" for c in feature_cols] >>> df = spark.createDataFrame(X.tolist(), ",".join(schema)) >>> df = df.withColumn("features", array(*feature_cols)).drop(*feature_cols) >>> df.show(10, False)
features
[1.5578103, -9.300072, 9.220654, 4.5838223, -3.2613218] [9.295866, 1.3326015, -4.6483326, 4.43685, 6.906736] [1.1148645, 0.9800974, -9.67569, -8.020592, -3.748023] [-4.6454153, -8.095899, -4.9839406, 7.954683, -8.15784] [-6.5075264, -5.538241, -6.740191, 3.0490158, 4.1693997] [7.9449835, 4.142317, 6.207676, 3.202615, 7.1319785] [-0.3837125, 6.826891, -4.35618, -9.582829, -1.5456663] [2.5012932, 4.2080708, 3.5172815, 2.5741744, -6.291008] [9.317718, 1.3419528, -4.832837, 4.5362573, 6.9357944] [-6.65039, -5.438729, -6.858565, 2.9733503, 3.99863]
only showing top 10 rows
>>> umap_estimator = UMAP(sample_fraction=0.5, num_workers=3).setFeaturesCol("features") >>> umap_model = umap_estimator.fit(df) >>> output = umap_model.transform(df).toPandas() >>> embedding = cp.asarray(output["embedding"].to_list()) >>> print("First 10 embeddings:") >>> print(embedding[:10])
First 10 embeddings: [[ 5.378397 6.504756 ] [ 12.531521 13.946098 ] [ 11.990916 6.049594 ] [-14.175631 7.4849815] [ 7.065363 -16.75355 ] [ 1.8876278 1.0889664] [ 0.6557462 17.965862 ] [-16.220764 -6.4817486] [ 12.476492 13.80965 ] [ 6.823325 -16.71719 ]]
Methods
clear
(param)Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.
copy
([extra])Create a copy of the current instance, including its parameters and cuml_params.
explainParam
(param)Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap
([extra])Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
fit
(dataset[, params])Fits a model to the input dataset with optional parameters.
fitMultiple
(dataset, paramMaps)Fits multiple models to the input dataset for all param maps in a single pass.
getA
()Gets the value of a.
getB
()Gets the value of b.
Gets the value of build_algo.
Gets the value of build_kwds.
Gets the value of
featuresCol
orfeaturesCols
Gets the value of featuresCols or its default value.
getInit
()Gets the value of init.
Gets the value of labelCol or its default value.
Gets the value of learning_rate.
Gets the value of local_connectivity.
Gets the value of metric.
Gets the value of metric_kwds.
Gets the value of min_dist.
Gets the value of n_components.
Gets the value of n_epochs.
Gets the value of n_neighbors.
Gets the value of negative_sample_rate.
getOrDefault
(param)Gets the value of a param in the user-supplied param map or its default value.
Gets the value of
outputCol
.getParam
(paramName)Gets a param by its name.
Gets the value of random_state.
Gets the value of repulsion_strength.
Gets the value of sample_fraction.
Gets the value of set_op_mix_ratio.
Gets the value of spread.
Gets the value of transform_queue_size.
hasDefault
(param)Checks whether a param has a default value.
hasParam
(paramName)Tests whether this instance contains a param with a given (string) name.
isDefined
(param)Checks whether a param is explicitly set by user or has a default value.
isSet
(param)Checks whether a param is explicitly set by user.
load
(path)Reads an ML instance from the input path, a shortcut of read().load(path).
read
()save
(path)Save this ML instance to the given path, a shortcut of 'write().save(path)'.
set
(param, value)Sets a parameter in the embedded param map.
setA
(value)Sets the value of a.
setB
(value)Sets the value of b.
setBuildAlgo
(value)Sets the value of build_algo.
setBuildKwds
(value)Sets the value of build_kwds.
setFeaturesCol
(value)Sets the value of
featuresCol
orfeaturesCols
.setFeaturesCols
(value)Sets the value of
featuresCols
.setInit
(value)Sets the value of init.
setLabelCol
(value)Sets the value of
labelCol
.setLearningRate
(value)Sets the value of learning_rate.
setLocalConnectivity
(value)Sets the value of local_connectivity.
setMetric
(value)Sets the value of metric.
setMetricKwds
(value)Sets the value of metric_kwds.
setMinDist
(value)Sets the value of min_dist.
setNComponents
(value)Sets the value of n_components.
setNEpochs
(value)Sets the value of n_epochs.
setNNeighbors
(value)Sets the value of n_neighbors.
setNegativeSampleRate
(value)Sets the value of negative_sample_rate.
setOutputCol
(value)Sets the value of
outputCol
.setRandomState
(value)Sets the value of random_state.
setRepulsionStrength
(value)Sets the value of repulsion_strength.
setSampleFraction
(value)Sets the value of sample_fraction.
setSetOpMixRatio
(value)Sets the value of set_op_mix_ratio.
setSpread
(value)Sets the value of spread.
setTransformQueueSize
(value)Sets the value of transform_queue_size.
write
()Attributes
Returns the dictionary of parameters intended for the underlying cuML class.
Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.
Returns all params ordered by name.
Methods Documentation
- clear(param: Param) None #
Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.
- copy(extra: Optional[ParamMap] = None) P #
Create a copy of the current instance, including its parameters and cuml_params.
This function extends the default copy() method to ensure the cuml_params variable is also copied. The default super().copy() method only handles _paramMap and _defaultParamMap.
- Parameters:
- extraOptional[ParamMap]
A dictionary or ParamMap containing additional parameters to set in the copied instance. Note ParamMap = Dict[pyspark.ml.param.Param, Any].
- Returns:
- P
A new instance of the same type as the current object, with parameters and cuml_params copied.
- Raises:
- TypeError
If any key in the extra dictionary is not an instance of pyspark.ml.param.Param.
- explainParam(param: Union[str, Param]) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: Optional[ParamMap] = None) ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
- extradict, optional
extra param values
- Returns:
- dict
merged param map
- fit(dataset: DataFrame, params: Optional[Union[ParamMap, List[ParamMap], Tuple[ParamMap]]] = None) Union[M, List[M]] #
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
- dataset
pyspark.sql.DataFrame
input dataset.
- paramsdict or list or tuple, optional
an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- dataset
- Returns:
Transformer
or a list ofTransformer
fitted model(s)
- fitMultiple(dataset: DataFrame, paramMaps: Sequence[ParamMap]) Iterator[Tuple[int, _CumlModel]] #
Fits multiple models to the input dataset for all param maps in a single pass.
- Parameters:
- dataset
pyspark.sql.DataFrame
input dataset.
- paramMaps
collections.abc.Sequence
A Sequence of param maps.
- dataset
- Returns:
_FitMultipleIterator
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- getA() float #
Gets the value of a.
- getB() float #
Gets the value of b.
- getBuildAlgo() str #
Gets the value of build_algo.
- getBuildKwds() Optional[Dict[str, Any]] #
Gets the value of build_kwds.
- getFeaturesCol() Union[str, List[str]] #
Gets the value of
featuresCol
orfeaturesCols
- getFeaturesCols() List[str] #
Gets the value of featuresCols or its default value.
- getInit() str #
Gets the value of init.
- getLabelCol() str #
Gets the value of labelCol or its default value.
- getLearningRate() float #
Gets the value of learning_rate.
- getLocalConnectivity() float #
Gets the value of local_connectivity.
- getMetric() str #
Gets the value of metric.
- getMetricKwds() Optional[Dict[str, Any]] #
Gets the value of metric_kwds.
- getMinDist() float #
Gets the value of min_dist.
- getNComponents() int #
Gets the value of n_components.
- getNEpochs() int #
Gets the value of n_epochs.
- getNNeighbors() float #
Gets the value of n_neighbors.
- getNegativeSampleRate() int #
Gets the value of negative_sample_rate.
- getOrDefault(param: Union[str, Param[T]]) Union[Any, T] #
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getRandomState() int #
Gets the value of random_state.
- getRepulsionStrength() float #
Gets the value of repulsion_strength.
- getSampleFraction() float #
Gets the value of sample_fraction.
- getSetOpMixRatio() float #
Gets the value of set_op_mix_ratio.
- getSpread() float #
Gets the value of spread.
- getTransformQueueSize() float #
Gets the value of transform_queue_size.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- isDefined(param: Union[str, Param[Any]]) bool #
Checks whether a param is explicitly set by user or has a default value.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- setA(value: float) P #
Sets the value of a.
- setB(value: float) P #
Sets the value of b.
- setBuildAlgo(value: str) P #
Sets the value of build_algo.
- setBuildKwds(value: Dict[str, Any]) P #
Sets the value of build_kwds.
- setFeaturesCol(value: Union[str, List[str]]) P #
Sets the value of
featuresCol
orfeaturesCols
. Used when input vectors are stored in a single column.
- setFeaturesCols(value: List[str]) P #
Sets the value of
featuresCols
. Used when input vectors are stored as multiple feature columns.
- setInit(value: str) P #
Sets the value of init.
- setLearningRate(value: float) P #
Sets the value of learning_rate.
- setLocalConnectivity(value: float) P #
Sets the value of local_connectivity.
- setMetric(value: str) P #
Sets the value of metric.
- setMetricKwds(value: Dict[str, Any]) P #
Sets the value of metric_kwds.
- setMinDist(value: float) P #
Sets the value of min_dist.
- setNComponents(value: int) P #
Sets the value of n_components.
- setNEpochs(value: int) P #
Sets the value of n_epochs.
- setNNeighbors(value: float) P #
Sets the value of n_neighbors.
- setNegativeSampleRate(value: int) P #
Sets the value of negative_sample_rate.
- setOutputCol(value: str) P #
Sets the value of
outputCol
. Contains the embeddings of the input data.
- setRandomState(value: int) P #
Sets the value of random_state.
- setRepulsionStrength(value: float) P #
Sets the value of repulsion_strength.
- setSampleFraction(value: float) P #
Sets the value of sample_fraction.
- setSetOpMixRatio(value: float) P #
Sets the value of set_op_mix_ratio.
- setSpread(value: float) P #
Sets the value of spread.
- setTransformQueueSize(value: float) P #
Sets the value of transform_queue_size.
Attributes Documentation
- a = Param(parent='undefined', name='a', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#
- b = Param(parent='undefined', name='b', doc='More specific parameters controlling the embedding. If None these values are set automatically as determined by ``min_dist`` and ``spread``.')#
- build_algo = Param(parent='undefined', name='build_algo', doc="How to build the knn graph. Supported build algorithms are ['auto', 'brute_force_knn', 'nn_descent']. 'auto' chooses to run with brute force knn if number of data rows is smaller than or equal to 50K. Otherwise, runs with nn descent.")#
- build_kwds = Param(parent='undefined', name='build_kwds', doc="Build algorithm argument {'nnd_graph_degree': 64, 'nnd_intermediate_graph_degree': 128, 'nnd_max_iterations': 20, 'nnd_termination_threshold': 0.0001, 'nnd_return_distances': True, 'nnd_n_clusters': 1} Note that nnd_n_clusters > 1 will result in batch-building with NN Descent.")#
- cuml_params#
Returns the dictionary of parameters intended for the underlying cuML class.
- enable_sparse_data_optim = Param(parent='undefined', name='enable_sparse_data_optim', doc='This param activates sparse data optimization for VectorUDT features column. If the param is not included in an Estimator class, Spark rapids ml always converts VectorUDT features column into dense arrays when calling cuml backend. If included, Spark rapids ml will determine whether to create sparse arrays based on the param value: (1) If None, create dense arrays if the first VectorUDT of a dataframe is DenseVector. Create sparse arrays if it is SparseVector.(2) If False, create dense arrays. This is favorable if the majority of vectors are DenseVector.(3) If True, create sparse arrays. This is favorable if the majority of the VectorUDT vectors are SparseVector.')#
- featuresCol: Param[str] = Param(parent='undefined', name='featuresCol', doc='features column name.')#
- featuresCols = Param(parent='undefined', name='featuresCols', doc='features column names for multi-column input.')#
- init = Param(parent='undefined', name='init', doc="How to initialize the low dimensional embedding. Options are: 'spectral': use a spectral embedding of the fuzzy 1-skeleton, 'random': assign initial embedding positions at random.")#
- labelCol: Param[str] = Param(parent='undefined', name='labelCol', doc='label column name.')#
- learning_rate = Param(parent='undefined', name='learning_rate', doc='The initial learning rate for the embedding optimization.')#
- local_connectivity = Param(parent='undefined', name='local_connectivity', doc='The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.')#
- metric = Param(parent='undefined', name='metric', doc="Distance metric to use. Supported distances are ['l1', 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary. Note: The 'jaccard' distance metric is only supported for sparse inputs.")#
- metric_kwds = Param(parent='undefined', name='metric_kwds', doc='Additional keyword arguments for the metric function. If the metric function takes additional arguments, they should be passed in this dictionary.')#
- min_dist = Param(parent='undefined', name='min_dist', doc='The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result in a more even dispersal of points. The value should be set relative to the ``spread`` value, which determines the scale at which embedded points will be spread out.')#
- n_components = Param(parent='undefined', name='n_components', doc='The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.')#
- n_epochs = Param(parent='undefined', name='n_epochs', doc='The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).')#
- n_neighbors = Param(parent='undefined', name='n_neighbors', doc='The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.')#
- negative_sample_rate = Param(parent='undefined', name='negative_sample_rate', doc='The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.')#
- num_workers#
Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.
- outputCol: Param[str] = Param(parent='undefined', name='outputCol', doc='output column name.')#
- params#
Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
- random_state = Param(parent='undefined', name='random_state', doc='The seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of training time and memory usage.')#
- repulsion_strength = Param(parent='undefined', name='repulsion_strength', doc='Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.')#
- sample_fraction = Param(parent='undefined', name='sample_fraction', doc="The fraction of the dataset to be used for fitting the model. Since fitting is done on a single node, very large datasets must be subsampled to fit within the node's memory and execute in a reasonable time. Smaller fractions will result in faster training, but may result in sub-optimal embeddings.")#
- set_op_mix_ratio = Param(parent='undefined', name='set_op_mix_ratio', doc='Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.')#
- spread = Param(parent='undefined', name='spread', doc='The effective scale of embedded points. In combination with ``min_dist`` this determines how clustered/clumped the embedded points are.')#
- transform_queue_size = Param(parent='undefined', name='transform_queue_size', doc='For transform operations (embedding new points using a trained model), this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.')#
- verbose: Param[Union[int, bool]] = Param(parent='undefined', name='verbose', doc='cuml verbosity level (False, True or an integer between 0 and 6).')#