KMeans#

class spark_rapids_ml.clustering.KMeans(*, featuresCol: Union[str, List[str]] = 'features', predictionCol: str = 'prediction', k: int = 2, initMode: str = 'k-means||', tol: float = 0.0001, maxIter: int = 20, seed: Optional[int] = None, num_workers: Optional[int] = None, verbose: Union[int, bool] = False, **kwargs: Any)#

KMeans algorithm partitions data points into a fixed number (denoted as k) of clusters. The algorithm initializes a set of k random centers then runs in iterations. In each iteration, KMeans assigns every point to its nearest center, then calculates a new set of k centers. KMeans often deals with large datasets. This class provides GPU acceleration for pyspark distributed KMeans.

Parameters:
k: int (default = 8)

the number of centers. Set this parameter to enable KMeans to learn k centers from input vectors.

initMode: str (default = “k-means||”)

the algorithm to select initial centroids. It can be “k-means||” or “random”.

maxIter: int (default = 300)

the maximum iterations the algorithm will run to learn the k centers. More iterations help generate more accurate centers.

seed: int (default = 1)

the random seed used by the algorithm to initialize a set of k random centers to start with.

tol: float (default = 1e-4)

early stopping criterion if centers do not change much after an iteration.

featuresCol: str or List[str]

The feature column names, spark-rapids-ml supports vector, array and columnar as the input.

  • When the value is a string, the feature columns must be assembled into 1 column with vector or array type.

  • When the value is a list of strings, the feature columns must be numeric types.

predictionCol: str

the name of the column that stores cluster indices of input vectors. predictionCol should be set when users expect to apply the transform function of a learned model.

num_workers:

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU. If not set, spark-rapids-ml tries to infer the number of cuML workers (i.e. GPUs in cluster) from the Spark environment.

verbose:
Logging level.
  • 0 - Disables all log messages.

  • 1 - Enables only critical messages.

  • 2 - Enables all messages up to and including errors.

  • 3 - Enables all messages up to and including warnings.

  • 4 or False - Enables all messages up to and including information messages.

  • 5 or True - Enables all messages up to and including debug messages.

  • 6 - Enables all messages up to and including trace messages.

Examples

>>> from spark_rapids_ml.clustering import KMeans
>>> data = [([0.0, 0.0],),
...         ([1.0, 1.0],),
...         ([9.0, 8.0],),
...         ([8.0, 9.0],),]
>>> df = spark.createDataFrame(data, ["features"])
>>> df.show()
+----------+
|  features|
+----------+
|[0.0, 0.0]|
|[1.0, 1.0]|
|[9.0, 8.0]|
|[8.0, 9.0]|
+----------+
>>> gpu_kmeans = KMeans(k=2).setFeaturesCol("features")
>>> gpu_kmeans.setMaxIter(10)
KMeans_5606dff6b4fa
>>> gpu_model = gpu_kmeans.fit(df)
>>> gpu_model.setPredictionCol("prediction")
>>> gpu_model.clusterCenters()
[[0.5, 0.5], [8.5, 8.5]]
>>> transformed = gpu_model.transform(df)
>>> transformed.show()
+----------+----------+
|  features|prediction|
+----------+----------+
|[0.0, 0.0]|         0|
|[1.0, 1.0]|         0|
|[9.0, 8.0]|         1|
|[8.0, 9.0]|         1|
+----------+----------+
>>> gpu_kmeans.save("/tmp/kmeans")
>>> gpu_model.save("/tmp/kmeans_model")
>>> # vector column input
>>> from spark_rapids_ml.clustering import KMeans
>>> from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.dense([0.0, 0.0]),),
...         (Vectors.dense([1.0, 1.0]),),
...         (Vectors.dense([9.0, 8.0]),),
...         (Vectors.dense([8.0, 9.0]),),]
>>> df = spark.createDataFrame(data, ["features"])
>>> gpu_kmeans = KMeans(k=2).setFeaturesCol("features")
>>> gpu_kmeans.getFeaturesCol()
'features'
>>> gpu_model = gpu_kmeans.fit(df)
>>> # multi-column input
>>> data = [(0.0, 0.0),
...         (1.0, 1.0),
...         (9.0, 8.0),
...         (8.0, 9.0),]
>>> df = spark.createDataFrame(data, ["f1", "f2"])
>>> gpu_kmeans = KMeans(k=2).setFeaturesCols(["f1", "f2"])
>>> gpu_kmeans.getFeaturesCols()
['f1', 'f2']
>>> gpu_kmeans = gpu_kmeans.fit(df)

Methods

clear(param)

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy([extra])

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits multiple models to the input dataset for all param maps in a single pass.

getFeaturesCol()

Gets the value of featuresCol or featuresCols

getFeaturesCols()

Gets the value of featuresCols or its default value.

getInitMode()

Gets the value of initMode

getK()

Gets the value of k

getMaxIter()

Gets the value of maxIter or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getParam(paramName)

Gets a param by its name.

getPredictionCol()

Gets the value of predictionCol or its default value.

getSeed()

Gets the value of seed or its default value.

getTol()

Gets the value of tol or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setFeaturesCol(value)

Sets the value of featuresCol or featuresCols.

setFeaturesCols(value)

Sets the value of featuresCols.

setK(value)

Sets the value of k.

setMaxIter(value)

Sets the value of maxIter.

setPredictionCol(value)

Sets the value of predictionCol.

setSeed(value)

Sets the value of seed.

setTol(value)

Sets the value of tol.

write()

Attributes

cuml_params

Returns the dictionary of parameters intended for the underlying cuML class.

featuresCol

featuresCols

initMode

k

maxIter

num_workers

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

params

Returns all params ordered by name.

predictionCol

seed

tol

Methods Documentation

clear(param: Param) None#

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy(extra: Optional[ParamMap] = None) P#
explainParam(param: Union[str, Param]) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: Optional[ParamMap] = None) ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:
extradict, optional

extra param values

Returns:
dict

merged param map

fit(dataset: DataFrame, params: Optional[Union[ParamMap, List[ParamMap], Tuple[ParamMap]]] = None) Union[M, List[M]]#

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
datasetpyspark.sql.DataFrame

input dataset.

paramsdict or list or tuple, optional

an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:
Transformer or a list of Transformer

fitted model(s)

fitMultiple(dataset: DataFrame, paramMaps: Sequence[ParamMap]) Iterator[Tuple[int, _CumlModel]]#

Fits multiple models to the input dataset for all param maps in a single pass.

Parameters:
datasetpyspark.sql.DataFrame

input dataset.

paramMapscollections.abc.Sequence

A Sequence of param maps.

Returns:
_FitMultipleIterator

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

getFeaturesCol() Union[str, List[str]]#

Gets the value of featuresCol or featuresCols

getFeaturesCols() List[str]#

Gets the value of featuresCols or its default value.

getInitMode() str#

Gets the value of initMode

New in version 1.5.0.

getK() int#

Gets the value of k

New in version 1.5.0.

getMaxIter() int#

Gets the value of maxIter or its default value.

getOrDefault(param: Union[str, Param[T]]) Union[Any, T]#

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName: str) Param#

Gets a param by its name.

getPredictionCol() str#

Gets the value of predictionCol or its default value.

getSeed() int#

Gets the value of seed or its default value.

getTol() float#

Gets the value of tol or its default value.

hasDefault(param: Union[str, Param[Any]]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() MLReader#
save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setFeaturesCol(value: Union[str, List[str]]) P#

Sets the value of featuresCol or featuresCols.

setFeaturesCols(value: List[str]) P#

Sets the value of featuresCols. Used when input vectors are stored as multiple feature columns.

setK(value: int) KMeans#

Sets the value of k.

setMaxIter(value: int) KMeans#

Sets the value of maxIter.

setPredictionCol(value: str) P#

Sets the value of predictionCol.

setSeed(value: int) KMeans#

Sets the value of seed.

setTol(value: float) KMeans#

Sets the value of tol.

write() MLWriter#

Attributes Documentation

cuml_params#

Returns the dictionary of parameters intended for the underlying cuML class.

featuresCol: Param[str] = Param(parent='undefined', name='featuresCol', doc='features column name.')#
featuresCols = Param(parent='undefined', name='featuresCols', doc='features column names for multi-column input.')#
initMode: Param[str] = Param(parent='undefined', name='initMode', doc='The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++')#
k: Param[int] = Param(parent='undefined', name='k', doc='The number of clusters to create. Must be > 1.')#
maxIter: Param[int] = Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')#
num_workers#

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

params#

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

predictionCol: Param[str] = Param(parent='undefined', name='predictionCol', doc='prediction column name.')#
seed: Param[int] = Param(parent='undefined', name='seed', doc='random seed.')#
tol: Param[float] = Param(parent='undefined', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).')#