A walkthough of submitting a slighty more complex model in R

submission
guide
R

Here we show the steps for submitting a slighty more complex model in R

Authors

Gert Stulp

Lisa Sivak

Published

March 30, 2024

Here we describe how to prepare and make a slighty more complex submission in R than this one. The sole purpose of this script is to make the submission process more clear by showing some of the errors that you might bump into, not to present a model that is any good.


Our aim is to run a penalized regression via the package glmnet. We’re only doing a penalized regression with three variables, birthyear_bg, gender_bg, and oplmet_2020.

Working on submission.R

We are going to use the packages dplyr, tidyr, and glmnet so we’re going to put these at the top of submission.R. Then we’ll start by selecting the variables, processing these variables, and then turn the dataset into a matrix, needed for the function glmnet. We’ll use the function model.matrix to conveniently turn our factors into dummies.

First part of submission.R:

library(dplyr)
library(tidyr)
library(glmnet)
Warning: package 'glmnet' was built under R version 4.3.3
clean_df <- function(df, background_df = NULL){
  
  ## Selecting variables
  keepcols = c('nomem_encr', # ID variable required for predictions,
               'birthyear_bg', # birthyear of respondents
               'gender_bg', # gender of respondents, factor
               'oplmet_2020') # highest educational level in 2020
  
  ## Keeping data with variables selected
  df <- df |> select(all_of(keepcols))
  
  df <- df |>
    # standardise continuous variable, create factor for categorical variables
    # needed for glmnet
    mutate(
      birthyear_bg = as.numeric(scale(birthyear_bg)), # z-scores, as.numeric to remove attributes
      gender_bg = factor(gender_bg),
      oplmet_2020 = factor(oplmet_2020)
    )
  
  # turn factors into dummy variables, required for glmnet
  df <- model.matrix(~ ., df)
  
  return(df)
  
}

Testing clean_df

Let’s test if our clean_df function works. Let’s read in the PreFer_train_data.csv (which you hopefully have in a different folder than your local repository; see step 1 here {target=“_blank”}). Let’s also read-in the outcome for the training data and the fake training data, which we’ll both use.

library(data.table)
train <- data.table::fread("../../../PreFer_data/PreFer_train_data.csv", 
                           keepLeadingZeros = TRUE, # if FALSE adds zeroes to some dates
                           data.table = FALSE) # returns a data.frame object rather than data.table 
outcome <- data.table::fread("../../../PreFer_data/PreFer_train_outcome.csv", 
                           keepLeadingZeros = TRUE, data.table = FALSE)
fake <- data.table::fread("../../../PreFer_data/PreFer_fake_data.csv", 
                          keepLeadingZeros = TRUE, data.table = FALSE)

clean <- clean_df(train)
str(clean)
 num [1:2939, 1:12] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2939] "2" "3" "5" "9" ...
  ..$ : chr [1:12] "(Intercept)" "nomem_encr" "birthyear_bg" "gender_bg2" ...
 - attr(*, "assign")= int [1:12] 0 1 2 3 4 4 4 4 4 4 ...
 - attr(*, "contrasts")=List of 2
  ..$ gender_bg  : chr "contr.treatment"
  ..$ oplmet_2020: chr "contr.treatment"

A matrix with 2939 cases and 12 variables.

Using training.R

We’ll use this dataset (matrix) for training, using the train_save_model function in training.R. We want to do a penalized regression. We’ll first do a 10-fold cross-validation, determine the optimal penalty (lambda), and run another penalized regression using this lambda [nested cross-validation may be more appropriate here, but we’ll go with it].

training.R:

train_save_model <- function(cleaned_df, outcome_df) {
  set.seed(1) # useful here because penalized regression not deterministic
  
  # Combine cleaned_df and outcome_df to match on ID
  model_df <- merge(cleaned_df, outcome_df, by = "nomem_encr")
  
  # glmnet requires matrix, and also features (X) separately from outcome (y)
  model_df <- as.matrix(model_df) # merged filed into matrix
  
  # features without outcome and identifier
  X <- model_df[ , !(colnames(model_df) %in% c("nomem_encr", "new_child"))]
  y <- model_df[ , colnames(model_df) == "new_child"] # outcome only
  
  ## LASSO regression ##
  # hyperparameter tuning: 10 fold cross-validation to retrieve optimal lambda
  CV <- cv.glmnet(x = X,  y = y, 
                  family = "binomial", nfolds = 10, standardize = FALSE)
  optimal_lambda <- CV$lambda.min
  
  # Run model with optimal lambda
  model <- glmnet(x = X, y = y, 
                  family = "binomial", 
                  lambda = optimal_lambda, standardize = FALSE )
  
  # Save the model
  saveRDS(model, "model.rds")
}

Will it work?!

train_save_model(clean, outcome)
Error in if (!all(o)) {: missing value where TRUE/FALSE needed

Some error about missing values. This is actually about missing values in the outcome variable, which glmnet does not except.

table(outcome$new_child, useNA = "always")

   0    1 <NA> 
 775  212 5431 

Back to submission.R

Annoyingly, there are some missing values in the outcome values, and this means that we want to exclude all cases with missing outcome values (or impute the outcomes). We have created the variable outcome_available for this reason, which has a 1 if the outcome is available. Let’s update our clean_df function.

Updated first part of submission.R:

clean_df <- function(df, background_df = NULL){
  
  # glmnet requires that outcome is available for all cases
  df <- df |> filter(outcome_available == 1)
  
  ## Selecting variables, we don't need outcome_available!
  keepcols = c('nomem_encr', # ID variable required for predictions,
               'birthyear_bg', # birthyear of respondents
               'gender_bg', # gender of respondents, factor
               'oplmet_2020') # highest educational level in 2020
  
  ## Keeping data with variables selected
  df <- df |> select(all_of(keepcols))
  
  df <- df |>
    # standardise continuous variable, create factor for categorical variables
    # needed for glmnet
    mutate(
      birthyear_bg = as.numeric(scale(birthyear_bg)), # z-scores, as.numeric to remove attributes
      gender_bg = factor(gender_bg),
      oplmet_2020 = factor(oplmet_2020)
    )
  
  # turn factors into dummy variables, required for glmnet
  df <- model.matrix(~ ., df)
  
  return(df)
}

Let’s try again with our updated clean_df.

clean <- clean_df(train)
train_save_model(clean, outcome)

Score! The function worked, and a model.rds was created.

Working on submission.R: predict_outcomes

Only one more step: trying to see if we can make predictions via predict_outcomes in submission.R.

predict_outcomes <- function(df, background_df = NULL, model_path = "./model.rds"){
  
  if( !("nomem_encr" %in% colnames(df)) ) {
    warning("The identifier variable 'nomem_encr' should be in the dataset")
  }
  
  # Load the model
  model <- readRDS(model_path)
  
  # Preprocess the fake / holdout data
  df <- clean_df(df) # is matrix
  
  # Exclude id because not used in model
  X_pred <- df[ , !(colnames(df) %in% c("nomem_encr"))]
  
  # Generate predictions from model
  predictions <- predict(model, 
                         X_pred, 
                         type = "response") 
  predictions <- ifelse(predictions > 0.5, 1, 0)  
  
  # Output file should be data.frame with two columns, nomem_encr and predictions
  df_predict <- data.frame("nomem_encr" = df[ , colnames(df) == "nomem_encr"], 
                           "prediction" = predictions)
  # Force columnnames (overrides names that may be given by `predict`)
  names(df_predict) <- c("nomem_encr", "prediction") 
  
  # Return only dataset with predictions and identifier
  return( df_predict )
  
}

Let’s check. It works on the training data itself!

predict_outcomes(train) |> head() # only first couple of rows
  nomem_encr prediction
1     715619          0
2     716711          0
3     717188          0
4     712090          0
5     709537          0
6     701934          0

Let’s check on fake data:

predict_outcomes(fake) |> head() # only first couple of rows
  nomem_encr prediction
1     700001          0
2     700002          0
3     700003          0
4     700004          0
5     700005          0
6     700006          0

Success.

Adding packages to packages.R

We have used the packages dplyr, tidyr, and glmnet, which means you will also put these packages into packages.R.

settings.json

Make sure you have written {"dockerfile": "r.Dockerfile"} in settings.json

Testing everything

Even though we have now tested everything manually, it’s still good to do a check via the terminal with Rscript run.R PreFer_fake_data.csv PreFer_fake_background_data.csv.

Here you can see it in action and that it passes the check.

Photo by Fotis Fotopoulos on Unsplash | Photo by Kelli McClintock on Unsplash