Spark Rapids ML#

Feature#

PCA(*[, k, inputCol, outputCol, ...])

PCA algorithm learns principal component vectors to project high-dimensional vectors into low-dimensional vectors, while preserving the similarity of the vectors.

PCAModel(mean_, components_, ...)

Applies dimensionality reduction on an input DataFrame.

Classification#

LogisticRegression(*[, featuresCol, ...])

LogisticRegression is a machine learning model where the response y is modeled by the sigmoid (or softmax for more than 2 classes) function applied to a linear combination of the features in X.

LogisticRegressionModel(coef_, intercept_, ...)

Model fitted by LogisticRegression.

RandomForestClassifier(*[, featuresCol, ...])

RandomForestClassifier implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.

RandomForestClassificationModel(n_cols, ...)

Model fitted by RandomForestClassifier.

Clustering#

DBSCAN(*[, featuresCol, predictionCol, eps, ...])

The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a non-parametric data clustering algorithm based on data density.

DBSCANModel(n_cols, dtype, verbose)

KMeans(*[, featuresCol, predictionCol, k, ...])

KMeans algorithm partitions data points into a fixed number (denoted as k) of clusters.

KMeansModel(cluster_centers_, n_cols, dtype)

KMeans gpu model for clustering input vectors to learned k centers.

Regression#

LinearRegression(*[, featuresCol, labelCol, ...])

LinearRegression is a machine learning model where the response y is modeled by a linear combination of the predictors in X.

LinearRegressionModel(coef_, intercept_, ...)

Model fitted by LinearRegression.

RandomForestRegressor(*[, featuresCol, ...])

RandomForestRegressor implements a Random Forest regressor model which fits multiple decision tree in an ensemble.

RandomForestRegressionModel(n_cols, dtype, ...)

Model fitted by RandomForestRegressor.

Nearest Neighbors#

ApproximateNearestNeighbors(*[, k, ...])

ApproximateNearestNeighbors retrieves k approximate nearest neighbors (ANNs) in item vectors for each query.

ApproximateNearestNeighborsModel(item_df_withid)

NearestNeighbors(*[, k, inputCol, idCol, ...])

NearestNeighbors retrieves the exact k nearest neighbors in item vectors for each query vector.

NearestNeighborsModel(item_df_withid, ...)

Tuning#

CrossValidator(*[, estimator, ...])

K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

UMAP#

UMAP(*[, n_neighbors, n_components, metric, ...])

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique used for low-dimensional data visualization and general non-linear dimension reduction.

UMAPModel(embedding_, raw_data_, n_cols, dtype)