The Forest-based Classification and Regression tool creates models and generates predictions using an adaptation of Leo Breiman's random forest algorithm, which is a supervised machine learning method. Predictions can be performed for both categorical variables (classification) and continuous variables (regression). Explanatory variables can take the form of fields in the attribute table of the training features. In addition to validation of model performance based on the training data, predictions can be made to features.
Workflow diagram
Analysis using GeoAnalytics Tools
Analysis using GeoAnalytics Tools is run using distributed processing across multiple ArcGIS GeoAnalytics Server machines and cores. GeoAnalytics Tools and standard feature analysis tools in ArcGIS Enterprise have different parameters and capabilities. To learn more about these differences, see Feature analysis tool differences.
Examples
- Given data on occurrence of seagrass, as well as a number of environmental explanatory variables, in addition to distances to upstream factories and major ports, future seagrass occurrence can be predicted based on projections for those same environmental explanatory variables.
- Housing values can be predicted based on the prices of houses that have been sold in the current year. The sale price of homes sold along with information about the number of bedrooms, distance to schools, proximity to major highways, average income, and crime counts can be used to predict sale prices of similar homes.
- Given information on the blood lead levels of children and the tax parcel ID of their homes, combined with parcel-level attributes such as age of home, census-level data such as income and education levels, and national datasets reflecting toxic release of lead and lead compounds, the risk of lead exposure for parcels without blood lead level data can be predicted. These risk predictions can inform policies and education programs in the area.
Usage notes
This tool creates hundreds of trees, called an ensemble of decision trees, to create a model that can then be used for prediction. Each decision tree is created using randomly generated portions of the original (training) data. Each tree generates its own prediction and votes on an outcome. The forest model considers votes from all decision trees to predict or classify the outcome of an unknown sample. This is important, as individual trees may have issues with overfitting a model; however, combining multiple trees in a forest for prediction addresses the overfitting problem associated with a single tree.
This tool can be used in two operation modes. The Train a model to assess model performance option can be used to evaluate the performance of different models as you explore different explanatory variables and tool settings. Once a good model has been found, you can use the Fit a model and predict values option. This is a data-driven tool and performs best on large datasets. The tool should be trained on at least several hundred features for best results. It is not an appropriate tool for very small datasets.
The Input Training Features can be tables, points, line or area features. This tool does not work with multipart data.
This tool produces a variety of outputs. The outputs produced vary depending on the operation mode as follows:
- Train a model to assess model performance produces the following two outputs:
- Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model . It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created.
- Tool summary messages—Messages to help you understand the performance of the model created. The messages include information on the model characteristics, out-of-bag errors, variable importance, and validation diagnostics. To access the summary of your results, click Show Results under the resulting layer in Map Viewer. The summary information is also added to the item details page.
- Fit a model and predict values produces the following three outputs:
- Output trained features—Contains all of the Input Training Features used in the model created as well as all of the explanatory variables used in the model. It also contains predictions for all of the features used for training the model, which can be helpful in assessing the performance of the model created
- Output predicted features—A layer of predicted results. Predictions are applied to the layer to predict (Choose the layer to predict values for) using the model generated from the training layer.
- Tool summary messages—Messages to help you understand the performance of the model created. The messages include information on the model characteristics, out-of-bag errors, variable importance, and validation diagnostics. To access the summary of your results, click Show Results under the resulting layer in Map Viewer. The summary information is also added to the item details page.
You can use the Output Variable Importance Table parameter to create a table to display a chart of variable importance for evaluation. The top 20 variable importance values are also reported in the messages window. The chart can be accessed directly below the layer in the Contents pane.
Explanatory variables can come from fields and should contain a variety of values. If the explanatory variable is categorical, the Categorical check box should be checked (variables of type string will automatically be checked). Categorical explanatory variables are limited to 60 unique values, though a smaller number of categories will improve model performance. For a given data size, the more categories a variable contains, the more likely it is that it will dominate the model and lead to less effective prediction results.
When matching explanatory variables, the Training Field and Prediction Field must have fields that are the same type (a double field in Training Field must be matched to a double field in Prediction Field for example).
Forest-based models do not extrapolate; they can only classify or predict to a value that the model was trained on. Train the model with training features and explanatory variables that are within the range of your target features and variables. The tool will fail if categories exist in the prediction explanatory variables that were not present in the training features.
The default value for the Number of Trees parameter is 100. Increasing the number of trees in the forest model will result in more accurate model prediction, but the model will take longer to calculate.
To learn more about how this tool works, and the ArcGIS Pro geoprocessing tool on which this implementation is based, see How Forest-based Classification and Regression works.
Limitations
The GeoAnalytics implementation of Forest-based Classification and Regression has the following limitations:
- Feature datasets (points, lines, polygons and tables) are supported as input. Rasters are not supported.
- A single layer for training and a single layer for prediction are supported. To combine multiple datasets into one, use the Build Multi-Variable Grid and Enrich from Multi-Variable Grid tools to generate input data.
ArcGIS API for Python example
The Forest-based Classification and Regression tool is available through ArcGIS API for Python.
This example builds a model and predicts ice cream sales.
# Import the required ArcGIS API for Python modules
import arcgis
from arcgis.gis import GIS
# Connect to your ArcGIS Enterprise portal and check that GeoAnalytics is supported
portal = GIS("https://myportal.domain.com/portal", "gis_publisher", "my_password", verify_cert=False)
if not portal.geoanalytics.is_supported():
print("Quitting, GeoAnalytics is not supported")
exit(1)
# Find the big data file share dataset you're interested in using for analysis
search_result = portal.content.search("", "Big Data File Share")
# Look through search results for a big data file share with the matching name
bd_file = next(x for x in search_result if x.title == "bigDataFileShares_IceCreamSales")
# Run the tool Forest-based Classification and Regression to predict
forest_model = arcgis.geoanalytics.analyze_patterns.forest_based_regression(prediction_type = "Train",
input_layer = bd_file,
variable_predict = {"fieldName":"Amount", "categorical":true},
explanatory_variables = [{"fieldName":"Weekend", "categorical":true},{"fieldName":"Temperature", "categorical":false}, {"fieldName":"Holiday", "categorical":true}, {"fieldName":"DistanceToBeach", "categorical":false}],
sample_size = 50,
output_name = "ice_cream_prediction")
Similar tools
Use the ArcGIS GeoAnalytics Server Forest-based Classification and Regression tool to generate predictions or to model using an adaptation of Leo Breiman's random forest algorithm. Other tools may be useful in solving similar but slightly different problems.
Map Viewer analysis tools
Create models and predictions using the ArcGIS GeoAnalytics Server Generalized Linear Regression tool.
ArcGIS Desktop analysis tools
Perform similar regression operations in ArcGIS Pro with the Forest-based Classification and Regression geoprocessing tool as part of the Spatial Statistics toolbox.
Perform Generalized Linear Regression (GLR) to generate predictions or to model a dependent variable in terms of its relationship to a set of explanatory variables in ArcGIS Pro with the Generalized Linear Regression geoprocessing tool in the Spatial Statistics toolbox.
Perform Geographically Weighted Regression (GWR) in ArcGIS Pro with the Geographically Weighted Regression geoprocessing tool in the Spatial Statistics toolbox.