Spark Rapids ML#

Feature#

`PCA`(*[, k, inputCol, outputCol, ...])	PCA algorithm learns principal component vectors to project high-dimensional vectors into low-dimensional vectors, while preserving the similarity of the vectors.
`PCAModel`(mean_, components_, ...)	Applies dimensionality reduction on an input DataFrame.

Classification#

`LogisticRegression`(*[, featuresCol, ...])	LogisticRegression is a machine learning model where the response y is modeled by the sigmoid (or softmax for more than 2 classes) function applied to a linear combination of the features in X.
`LogisticRegressionModel`(coef_, intercept_, ...)	Model fitted by `LogisticRegression`.
`RandomForestClassifier`(*[, featuresCol, ...])	RandomForestClassifier implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.
`RandomForestClassificationModel`(n_cols, ...)	Model fitted by `RandomForestClassifier`.

Clustering#

`DBSCAN`(*[, featuresCol, predictionCol, eps, ...])	The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a non-parametric data clustering algorithm based on data density.
`DBSCANModel`(n_cols, dtype)
`KMeans`(*[, featuresCol, predictionCol, k, ...])	KMeans algorithm partitions data points into a fixed number (denoted as k) of clusters.
`KMeansModel`(cluster_centers_, n_cols, dtype)	KMeans gpu model for clustering input vectors to learned k centers.

Regression#

`LinearRegression`(*[, featuresCol, labelCol, ...])	LinearRegression is a machine learning model where the response y is modeled by a linear combination of the predictors in X.
`LinearRegressionModel`(coef_, intercept_, ...)	Model fitted by `LinearRegression`.
`RandomForestRegressor`(*[, featuresCol, ...])	RandomForestRegressor implements a Random Forest regressor model which fits multiple decision tree in an ensemble.
`RandomForestRegressionModel`(n_cols, dtype, ...)	Model fitted by `RandomForestRegressor`.

Nearest Neighbors#

`ApproximateNearestNeighbors`(*[, k, ...])	ApproximateNearestNeighbors retrieves k approximate nearest neighbors (ANNs) in item vectors for each query.
`ApproximateNearestNeighborsModel`(item_df_withid)
`NearestNeighbors`(*[, k, inputCol, idCol, ...])	NearestNeighbors retrieves the exact k nearest neighbors in item vectors for each query vector.
`NearestNeighborsModel`(item_df_withid, ...)

Tuning#

CrossValidator(*[, estimator, ...])

K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

UMAP#

`UMAP`(*[, n_neighbors, n_components, metric, ...])	Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique used for low-dimensional data visualization and general non-linear dimension reduction.
`UMAPModel`(embedding_, raw_data_, sparse_fit, ...)