RandomForestClassificationModel#

class spark_rapids_ml.classification.RandomForestClassificationModel(n_cols: int, dtype: str, treelite_model: Union[str, List[str]], model_json: Union[List[str], List[List[str]]], num_classes: int)#

Model fitted by RandomForestClassifier.

Methods

clear(param)

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy([extra])

Create a copy of the current instance, including its parameters and cuml_params.

cpu()

Return the PySpark ML RandomForestClassificationModel

evaluate(dataset)

Evaluates the model on a test dataset.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

getBootstrap()

Gets the value of bootstrap or its default value.

getFeatureSubsetStrategy()

Gets the value of featureSubsetStrategy or its default value.

getFeaturesCol()

Gets the value of featuresCol or featuresCols

getFeaturesCols()

Gets the value of featuresCols or its default value.

getImpurity()

Gets the value of impurity or its default value.

getLabelCol()

Gets the value of labelCol or its default value.

getMaxBins()

Gets the value of maxBins or its default value.

getMaxDepth()

Gets the value of maxDepth or its default value.

getMinInstancesPerNode()

Gets the value of minInstancesPerNode or its default value.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getParam(paramName)

Gets a param by its name.

getPredictionCol()

Gets the value of predictionCol or its default value.

getProbabilityCol()

Gets the value of probabilityCol or its default value.

getRawPredictionCol()

Gets the value of rawPredictionCol or its default value.

getSeed()

Gets the value of seed or its default value.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

predict(value)

Predict label for the given features.

predictLeaf(value)

Predict the indices of the leaves corresponding to the feature vector.

predictProbability(value)

Predict the probability of each class given the features.

predictRaw(value)

Raw prediction for each possible label.

read()

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setFeaturesCol(value)

Sets the value of featuresCol or featureCols.

setFeaturesCols(value)

Sets the value of featuresCols.

setLabelCol(value)

Sets the value of labelCol.

setPredictionCol(value)

Sets the value of predictionCol.

setProbabilityCol(value)

Sets the value of probabilityCol.

setRawPredictionCol(value)

Sets the value of rawPredictionCol.

transform(dataset[, params])

Transforms the input dataset with optional parameters.

write()

Attributes

bootstrap

cuml_params

Returns the dictionary of parameters intended for the underlying cuML class.

featureImportances

Estimate the importance of each feature.

featureSubsetStrategy

featuresCol

featuresCols

getNumTrees

Number of trees in ensemble.

hasSummary

Indicates whether a training summary exists for this model instance.

impurity

labelCol

maxBins

maxDepth

max_batch_size

max_leaves

max_samples

minInstancesPerNode

min_impurity_decrease

min_samples_split

n_streams

numClasses

Number of classes (values which the label can take).

numFeatures

Returns the number of features the model was trained on.

numTrees

num_workers

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

params

Returns all params ordered by name.

predictionCol

probabilityCol

rawPredictionCol

seed

supportedFeatureSubsetStrategies

supportedImpurities

toDebugString

Full description of model.

totalNumNodes

Total number of nodes, summed over all trees in the ensemble.

treeWeights

Return the weights for each tree.

trees

Trees in this ensemble.

verbose

Methods Documentation

clear(param: Param) None#

Reset a Spark ML Param to its default value, setting matching cuML parameter, if exists.

copy(extra: Optional[ParamMap] = None) P#

Create a copy of the current instance, including its parameters and cuml_params.

This function extends the default copy() method to ensure the cuml_params variable is also copied. The default super().copy() method only handles _paramMap and _defaultParamMap.

Parameters:
extraOptional[ParamMap]

A dictionary or ParamMap containing additional parameters to set in the copied instance. Note ParamMap = Dict[pyspark.ml.param.Param, Any].

Returns:
P

A new instance of the same type as the current object, with parameters and cuml_params copied.

Raises:
TypeError

If any key in the extra dictionary is not an instance of pyspark.ml.param.Param.

cpu() RandomForestClassificationModel#

Return the PySpark ML RandomForestClassificationModel

evaluate(dataset: DataFrame) Union[BinaryRandomForestClassificationSummary, RandomForestClassificationSummary]#

Evaluates the model on a test dataset.

Parameters:
datasetpyspark.sql.DataFrame

Test dataset to evaluate model on.

explainParam(param: Union[str, Param]) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: Optional[ParamMap] = None) ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:
extradict, optional

extra param values

Returns:
dict

merged param map

getBootstrap() bool#

Gets the value of bootstrap or its default value.

New in version 3.0.0.

getFeatureSubsetStrategy() str#

Gets the value of featureSubsetStrategy or its default value.

New in version 1.4.0.

getFeaturesCol() Union[str, List[str]]#

Gets the value of featuresCol or featuresCols

getFeaturesCols() List[str]#

Gets the value of featuresCols or its default value.

getImpurity() str#

Gets the value of impurity or its default value.

New in version 1.6.0.

getLabelCol() str#

Gets the value of labelCol or its default value.

getMaxBins() int#

Gets the value of maxBins or its default value.

getMaxDepth() int#

Gets the value of maxDepth or its default value.

getMinInstancesPerNode() int#

Gets the value of minInstancesPerNode or its default value.

getOrDefault(param: Union[str, Param[T]]) Union[Any, T]#

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName: str) Param#

Gets a param by its name.

getPredictionCol() str#

Gets the value of predictionCol or its default value.

getProbabilityCol() str#

Gets the value of probabilityCol or its default value.

getRawPredictionCol() str#

Gets the value of rawPredictionCol or its default value.

getSeed() int#

Gets the value of seed or its default value.

hasDefault(param: Union[str, Param[Any]]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: Union[str, Param[Any]]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

predict(value: Vector) float#

Predict label for the given features.

predictLeaf(value: Vector) float#

Predict the indices of the leaves corresponding to the feature vector.

predictProbability(value: Vector) Vector#

Predict the probability of each class given the features.

predictRaw(value: Vector) Vector#

Raw prediction for each possible label.

classmethod read() MLReader#
save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setFeaturesCol(value: Union[str, List[str]]) P#

Sets the value of featuresCol or featureCols.

setFeaturesCols(value: List[str]) P#

Sets the value of featuresCols.

setLabelCol(value: str) P#

Sets the value of labelCol.

setPredictionCol(value: str) P#

Sets the value of predictionCol.

setProbabilityCol(value: str) _RFClassifierParams#

Sets the value of probabilityCol.

setRawPredictionCol(value: str) _RFClassifierParams#

Sets the value of rawPredictionCol.

transform(dataset: DataFrame, params: Optional[ParamMap] = None) DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
datasetpyspark.sql.DataFrame

input dataset

paramsdict, optional

an optional param map that overrides embedded params.

Returns:
pyspark.sql.DataFrame

transformed dataset

write() MLWriter#

Attributes Documentation

bootstrap: Param[bool] = Param(parent='undefined', name='bootstrap', doc='Whether bootstrap samples are used when building trees.')#
cuml_params#

Returns the dictionary of parameters intended for the underlying cuML class.

featureImportances#

Estimate the importance of each feature.

featureSubsetStrategy: Param[str] = Param(parent='undefined', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'")#
featuresCol: Param[str] = Param(parent='undefined', name='featuresCol', doc='features column name.')#
featuresCols = Param(parent='undefined', name='featuresCols', doc='features column names for multi-column input.')#
getNumTrees#

Number of trees in ensemble.

hasSummary#

Indicates whether a training summary exists for this model instance.

impurity: Param[str] = Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini')#
labelCol: Param[str] = Param(parent='undefined', name='labelCol', doc='label column name.')#
maxBins: Param[int] = Param(parent='undefined', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.')#
maxDepth: Param[int] = Param(parent='undefined', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].')#
max_batch_size = Param(parent='undefined', name='max_batch_size', doc='The max_batch_size parameter to use for cuml.')#
max_leaves = Param(parent='undefined', name='max_leaves', doc='The max_leaves parameter to use for cuml.')#
max_samples = Param(parent='undefined', name='max_samples', doc='The max_samples parameter to use for cuml.')#
minInstancesPerNode: Param[int] = Param(parent='undefined', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.')#
min_impurity_decrease = Param(parent='undefined', name='min_impurity_decrease', doc='The min_impurity_decrease parameter to use for cuml.')#
min_samples_split = Param(parent='undefined', name='min_samples_split', doc='The min_sample_split parameter to use for cuml.')#
n_streams = Param(parent='undefined', name='n_streams', doc='The n_streams parameter to use for cuml.')#
numClasses#

Number of classes (values which the label can take).

numFeatures#

Returns the number of features the model was trained on. If unknown, returns -1

numTrees: Param[int] = Param(parent='undefined', name='numTrees', doc='Number of trees to train (>= 1).')#
num_workers#

Number of cuML workers, where each cuML worker corresponds to one Spark task running on one GPU.

params#

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

predictionCol: Param[str] = Param(parent='undefined', name='predictionCol', doc='prediction column name.')#
probabilityCol: Param[str] = Param(parent='undefined', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.')#
rawPredictionCol: Param[str] = Param(parent='undefined', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.')#
seed: Param[int] = Param(parent='undefined', name='seed', doc='random seed.')#
supportedFeatureSubsetStrategies: List[str] = ['auto', 'all', 'onethird', 'sqrt', 'log2']#
supportedImpurities: List[str] = ['entropy', 'gini']#
toDebugString#

Full description of model.

totalNumNodes#

Total number of nodes, summed over all trees in the ensemble.

treeWeights#

Return the weights for each tree.

trees#

Trees in this ensemble. Warning: These have null parent Estimators.

verbose: Param[Union[int, bool]] = Param(parent='undefined', name='verbose', doc='cuml verbosity level (False, True or an integer between 0 and 6).')#