Chapter 7 Local Random Forest

Standard machine learning algorithms like Random Forest (RF) lack spatial calibration, hindering capturing the spatial heterogeneity in the relationship between a dependent and a set of independent variables. To account for spatial heterogeneity (i.e. non-stationarity) when modeling landslides spatial patterns as function of geographical features, in the present work we explore the local feature importance of geographical independent predisposing variables on the spatial distribution of landslides in canton Vaud (Switzerland).

We adopted Geographically Random Forest (GRF), a spatial analysis method using a local version of RF algorithm (georganos_forest_2022?) . This is achieved by fitting a sub-model for each observation in space, taking into account the neighboring observations. GRF can model the non-stationarity coupled with a non-linear model (RF), which, compared to liner models, tends not to overfit due to its bootstrapping nature. In addition RF is suited for datasets with numerous predictor variables.

Essentially, GRF was designed to be a bridge between machine learning and geographical models, combining inferential and explanatory power.

7.1 Computing lab: GRF

We will use the last development of Geographical Random Forest (GRF) [4]. This function have been implemented for regression problem, so we need to transform our binary response variable (i.e., presence==1 / absence==0) as a numeric value which can assume a range of values from zero to one.

## 'data.frame':    5188 obs. of  11 variables:
##  $ distRoad : num  160.1 55.9 106.1 70.7 125 ...
##  $ DEM      : num  443 451 460 458 463 ...
##  $ landCover: Factor w/ 7 levels "11","15","21",..: 3 1 2 2 3 1 3 3 3 1 ...
##  $ TWI      : num  7.4 8.72 9.08 8.55 8.33 ...
##  $ planCurv : num  -0.000143 0.003633 0.005981 0.006165 0.000217 ...
##  $ profCurv : num  0.013297 -0.006447 0.000381 -0.001675 -0.005383 ...
##  $ slope    : num  20.86 2.91 2.04 3.46 4.32 ...
##  $ geology  : Factor w/ 11 levels "0","1","2","7",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ x        : num  567887 568087 567912 567962 567812 ...
##  $ y        : num  2e+05 2e+05 2e+05 2e+05 2e+05 ...
##  $ LS       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

## 'data.frame':    4150 obs. of  12 variables:
##  $ distRoad : num  182 25 90.1 212.1 134.6 ...
##  $ DEM      : num  917 719 662 498 810 ...
##  $ landCover: Factor w/ 7 levels "11","15","21",..: 5 5 1 3 3 3 3 2 5 3 ...
##  $ TWI      : num  8.15 8.47 8.73 15.52 8.98 ...
##  $ planCurv : num  0.000789 0.003464 0.000962 -0.00004 0.000359 ...
##  $ profCurv : num  7.89e-04 -1.17e-02 -7.98e-04 -3.99e-05 -4.42e-04 ...
##  $ slope    : num  28.31 33.15 2.88 1.94 4.49 ...
##  $ geology  : Factor w/ 11 levels "0","1","2","7",..: 7 4 8 2 2 8 8 2 2 8 ...
##  $ x        : num  569837 561612 540062 504862 548312 ...
##  $ y        : num  133062 139887 154237 140712 169212 ...
##  $ LS       : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
##  $ LSregr   : num  1 1 1 0 0 1 1 1 0 0 ...

## 'data.frame':    1038 obs. of  12 variables:
##  $ distRoad : num  90.1 956.9 55.9 70.7 25 ...
##  $ DEM      : num  490 1299 850 480 1338 ...
##  $ landCover: Factor w/ 7 levels "11","15","21",..: 3 5 5 1 3 3 2 3 1 4 ...
##  $ TWI      : num  10.14 9.19 9.74 10.78 10.51 ...
##  $ planCurv : num  0.005263 0.002523 -0.005231 -0.000883 0.001165 ...
##  $ profCurv : num  -0.010737 -0.003077 0.008209 0.000877 0.000685 ...
##  $ slope    : num  12.51 1.82 18.31 5.2 12.48 ...
##  $ geology  : Factor w/ 11 levels "0","1","2","7",..: 8 3 8 2 2 8 2 2 8 5 ...
##  $ x        : num  529962 501762 552912 545262 572112 ...
##  $ y        : num  160712 149912 148212 189712 135962 ...
##  $ LS       : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 1 ...
##  $ LSregr   : num  1 0 1 0 1 0 1 0 1 0 ...

Coords<-LS_train[,9:10] # define coordinates

# Run GRF

set.seed(123) # initialize 

gwRF_LS<-grf(LSregr~distRoad+DEM+landCover+TWI+planCurv+profCurv+slope+geology,  LS_train, bw=40, mtry=3, kernel="adaptive",coords=Coords)

## Ranger result
## 
## Call:
##  ranger(LSregr ~ distRoad + DEM + landCover + TWI + planCurv +      profCurv + slope + geology, data = LS_train, num.trees = 500,      mtry = 3, importance = "impurity", num.threads = NULL) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      4150 
## Number of independent variables:  8 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.117128 
## R squared (OOB):                  0.5315825 
##  distRoad       DEM landCover       TWI  planCurv  profCurv     slope   geology 
## 181.93889 113.22987  33.98818 101.59075  83.15042 123.31246 302.83349  39.09378 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.03162  0.00000 -0.01882  0.03284  1.00000 
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.3981333 -0.0030000  0.0000000 -0.0003047  0.0042917  0.2957333 
##           Min      Max      Mean       StD
## distRoad    0 4.846175 0.6640442 0.8686210
## DEM         0 4.442619 0.7486061 0.7680691
## landCover   0 4.301103 0.1936414 0.3141822
## TWI         0 4.284581 0.6287907 0.6790622
## planCurv    0 3.126327 0.4565076 0.4888202
## profCurv    0 4.048531 0.5832407 0.6300842
## slope       0 4.913916 0.9080469 0.9518115
## geology     0 3.669751 0.2213336 0.4038733

saveRDS(gwRF_LS, "gwRF_LS.rds")

7.1.1 Feature importance

7.1.1.1 Global variable importance plot

Based on the results of the GRF, we can plot of the variable importance ranking for illustrative purposes. Values came from “Global ML Model Summary” –> “Importance”

# Create a data frame with variable names and importance values
variable_importance <- data.frame (
  Variable = c("distRoad", "DEM", "landCover", "TWI", "planCurv", "profCurv", "slope", "geology"),
  Importance = c(181.18490, 114.32444,  34.23643, 101.51863,  84.81667, 125.93651, 297.74411,  39.22721 ) # Importance - Global ML
)

# Assign different colors to the top three important variables
variable_importance$Color <- ifelse(variable_importance$Importance >= sort(variable_importance$Importance, decreasing = TRUE)[3], "orange", "skyblue")

# Create a bar plot for variable importance with different colors for the top three variables
ggplot(data = variable_importance, aes(x = Variable, y = Importance, fill = Color)) +
  geom_bar(stat = "identity") +
  scale_fill_identity() +
  labs(title = "Variable Importance Plot", x = "Variable", y = "Importance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

7.1.1.2 Local feature importance

We can plot the local feature importance (FI) scores for the two variables that are globally most important, slope and distrance to road with the output values mapped over the geographic space.

# Create a data frame with the values of the local variables importance and the coordinates for each location
gwRF_LS_var<-gwRF_LS$Local.Variable.Importance
gwRF_LS_var_XY<-cbind(gwRF_LS_var,LS_train$x,LS_train$y ) # add coordinates
colnames(gwRF_LS_var_XY)[9]<- "X" #rename column X-coordinate 
colnames(gwRF_LS_var_XY)[10]<- "Y" #rename column Y-coordinate 
str(gwRF_LS_var_XY)

## 'data.frame':    4150 obs. of  10 variables:
##  $ distRoad : num  0.53312 0.00351 0.31458 0.07548 2.46608 ...
##  $ DEM      : num  1.11112 0.00548 0.13488 0.79269 0.66155 ...
##  $ landCover: num  0.1857 0.0072 0.0856 0.6757 0.0124 ...
##  $ TWI      : num  0.2571 0.0425 1.1416 0.2105 0.392 ...
##  $ planCurv : num  0.20406 0.00112 0.12462 0.84817 1.17041 ...
##  $ profCurv : num  0.1011 0.0032 0.2125 1.6097 2.5097 ...
##  $ slope    : num  0.3679 0.0269 0.6325 0.6656 0.9508 ...
##  $ geology  : num  0.078629 0.000606 0.09991 0.572837 0.01066 ...
##  $ X        : num  569837 561612 540062 504862 548312 ...
##  $ Y        : num  133062 139887 154237 140712 169212 ...

library(sf) #for spatial data operations
# Convert vector to sf (simple feature)
Vaud<-vect("data/RF/Vaud_CH.shp")
Vaud_sf<-st_as_sf(Vaud) 

# Output predicted values are transformed to a vector
pred.vect <- as.vector(gwRF_LS_var_XY$slope)

library(classInt) #for classification

brk<-(classIntervals(pred.vect, n=5, style = "fisher"))
brkInt<-round(brk$brks, digits=2)
print(brkInt)

## [1] 0.00 0.39 1.00 1.81 2.99 4.91

#natural breaks (fisher)

ggplot() +
    geom_point(data = gwRF_LS_var_XY, aes(x = X, y = Y, colour = slope), size = 1) +
    scale_color_gradientn(colors = c("white", "orange","red", "blue"),  
                       breaks = c(0.00, 0.40, 1.15, 2.11, 3.23, 4.91), 
                       labels=c(0.00, 0.40, 1.15, 2.11, 3.23, 4.91)) +
        labs( x = "X Coordinate", y = "Y Coordinate")+
    ggtitle("Local average FI - Slope")+
    geom_sf(data = Vaud_sf, fill = "transparent", color = "black", size=2) #overlap borders

# Output predicted values are transformed to a vector
pred.vect <- as.vector(gwRF_LS$Local.Variable.Importance$distRoad)
 
brk<-(classIntervals(pred.vect, n=5, style = "fisher"))
brkInt<-round(brk$brks, digits=2)
print(brkInt) # print breaks

## [1] 0.00 0.36 0.98 2.04 3.33 4.85

#natural breaks (fisher)

ggplot() +
    geom_point(data = gwRF_LS_var_XY, aes(x = X, y = Y, colour = distRoad), size = 1) +
    scale_color_gradientn(colors = c("white", "orange","red", "blue"), 
                       breaks = c(0.00, 0.37, 1.02, 2.17, 3.58, 4.85), 
                       labels=c(0.00, 0.37, 1.02, 2.17, 3.58, 4.85)) +
    labs( x = "X Coordinate", y = "Y Coordinate")+
    ggtitle("Local average FI - Distance to roads")+
    geom_sf(data = Vaud_sf, fill = "transparent", color = "black", size=2) #overlap borders

7.1.2 Local R squared

The Local R-squared value represents the strength of the correlations of the local model on the features and ranges from 0 to 1.

Rsq<-gwRF_LS$LGofFit$LM_Rsq100
Rsq_XY<-as.data.frame(cbind(Rsq,LS_train$x,LS_train$y)) # add coordinates
colnames(Rsq_XY)[2]<- "X"
colnames(Rsq_XY)[3]<- "Y"
str(Rsq_XY)

## 'data.frame':    4150 obs. of  3 variables:
##  $ Rsq: num  -0.2341 -0.0148 -0.2604 0.2742 0.5882 ...
##  $ X  : num  569837 561612 540062 504862 548312 ...
##  $ Y  : num  133062 139887 154237 140712 169212 ...

ggplot () +
  geom_point(data = Rsq_XY, aes(x = X, y = Y, colour = Rsq), size = 1)+ 
  scale_color_gradientn(colors = c("white", "yellow","red", "blue"))+
  labs(title = "Rsq", x = "X Coordinate", y = "Y Coordinate")+
  ggtitle("Local R2")+
  geom_sf(data = Vaud_sf, fill = "transparent", color = "black", size=2) #overlap borders

7.2 Conclusions and further analyses

In the present exercise GRF has been used as a purely exploratory tool to estimate the spatial variation of the relationship between landslides in Canton Vaud (Switzerland) and the influencing factors. It allowed to elaborate maps delineating the local average importance of the most highly correlated features and to visualize the local fitting performance (R2 local value) into a map.

To ensure that everything is perfectly clear, we propose you to answer the following questions:

Among the following algorithms evaluate them in therms of their interpretability and explainability: Support Vector Machines , linear regression, Deep Learning Models, Decision Trees, K-Nearest Neighbors, Neural Networks, Random Forests, logistic regression.
Which are the three most important variables of your model (based on the MDA)?
What is the slope value (or range of values) that gives the highest probability of landslides occurrence? And for the geology, which are the most important classes?
Evaluate the spatial variation of the relationship between landslides and slope / distance to roads in your study area by visually inspecting the local average importance of these features.
You can replicate this code (some chiuncks of it) to evaluate the local average importance of the third most important variable, as well as to map the local mean squared error.

Advanced Geospatial Data Analysis in R - Environmental applications