ExampleDataPrep.Rmd

---
title: "Data Preparation for using devMSMs with Longitudinal Data"
output: html_document
date: "2024-05-17"
---
The following notebook provides a guide to assist the user in preparingn their data for use with *devMSMs*.  

Please review the accompanying manuscript for a full conceptual and practical introduction to MSMs in the context of developmental data. Please also see the vignettes on the *devMSMs* website for step-by-step guidance on the use of this code: https://istallworthy.github.io/devMSMs/index.html. 

Headings denote accompanying website sections and steps. We suggest using the interactive outline tool (located above the Console) for ease of navigation.

We advise users implement the appropriate data prepeartion steps in accordance with your starting data, with the goal of assigning to `data` one of the following wide data formats (see Figure 1) for use in the package:  

* a single data frame of data in wide format with no missing data  

* a mids object (output from mice::mice()) of data imputed in wide format  

* a list of data imputed in wide format as data frames.  


Users have several options for reading in data. They can begin this workflow with the following options:  

* long data with missingness can can be formatted and converted to wide data (P1a) for imputation (P2)

* wide with missingness can be formatted (P1b) before imputing (P2)

* data already imputed in wide format can be read in as a list (P2). 

* data with no missingness (P3)

## Installation
You will always need to install the *devMSMs helper* functions from Github (*devMSMsHelpers*).

```{r}
install.packages("devtools", quiet = TRUE)
library(devtools)

install_github("istallworthy/devMSMsHelpers", quiet = TRUE)
library(devMSMsHelpers)

install_github("istallworthy/devMSMs", quiet = TRUE)
library(devMSMs, include.only = "initMSM")
```


# Specify Core Inputs

```{r}
# required 

exposure = c("ESETA1.6", "ESETA1.15", "ESETA1.24", "ESETA1.35", "ESETA1.58") # wide format

# outcome

outcome <- "StrDif_Tot.58"


# required

ti_conf =  c("state", "BioDadInHH2", "PmAge2", "PmBlac2", "TcBlac2", "PmMrSt2", "PmEd2", "KFASTScr",
             "RMomAgeU", "RHealth", "HomeOwnd", "SWghtLB", "SurpPreg", "SmokTotl", "DrnkFreq",
             "peri_health", "caregiv_health", "gov_assist")

tv_conf = c("SAAmylase.6","SAAmylase.15", "SAAmylase.24", 
            "MDI.6", "MDI.15",                                            
            "RHasSO.6", "RHasSO.15", "RHasSO.24","RHasSO.35", "RHasSO.58",                                         
            "WndNbrhood.6","WndNbrhood.24", "WndNbrhood.35", "WndNbrhood.58",                                       
            "IBRAttn.6", "IBRAttn.15", "IBRAttn.24",                                   
            "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",                                           
            "HOMEETA1.6", "HOMEETA1.15", "HOMEETA1.24", "HOMEETA1.35", "HOMEETA1.58",                               
            "InRatioCor.6", "InRatioCor.15", "InRatioCor.24", "InRatioCor.35", "InRatioCor.58",                         
            "CORTB.6", "CORTB.15", "CORTB.24",                                                                  
            "EARS_TJo.24", "EARS_TJo.35",                                        
            "LESMnPos.24", "LESMnPos.35",                                  
            "LESMnNeg.24", "LESMnNeg.35",       
            "StrDif_Tot.35", "StrDif_Tot.58",    
            "fscore.35", "fscore.58")

home_dir = '/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/isa/'

```


# D1. Starting with a Single Long Data Frame
Users beginning with a single data frame in long format (with or without missingness).

## D1a. Format Long Data
Users have the option of formatting their long data if their data columns are not yet labeled correctly using the `formatLongData()` helper function. 

Read in long data and specify any factors and integers
```{r}
# add path to your long data file here if you want to begin with long data

data_long <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_long_missing_unformatted.csv", 
                      header = TRUE)

# list factor variables in long format here

factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",       
                        "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
                        "RHasSO") 


# list any integer variables in long format here

integer_confounders <- c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", 
                         "peri_health", "caregiv_health" , 
                         "gov_assist", "B18Raw", "EARS_TJo", "MDI")
```

Format your long data
```{r}
data_long_f <- formatLongData(data = data_long, 
                              exposure = exposure, 
                              outcome = outcome, 
                              sep = "\\.",
                              time_var = "Tage", # list original time variable here if it's not "WAVE"
                              id_var = "S_ID", # list original id variable here if it's not "ID"
                              missing = -9999, # list missing value here
                              factor_confounders = factor_confounders, # list factor variables in long format here
                              integer_confounders = integer_confounders, # list any integer variables in long format here
                              home_dir = home_dir, save.out = TRUE) 
head(data_long_f)
```

Make sure to inspect your data at each step of the way to ensure formatting looks correct. 


## D1b. Tranform Formatted Long Data to Wide
Users with correctly formatted data in long format have the option of using the following code to transform their data into wide format, to proceed to using the package (if there is no missing data) or imputing (with < 20% missing data MAR). They may see some warnings about *NA* values.    

```{r}
# identify time-varying confounders 

sep <- "\\." # time delimiter
v <- sapply(strsplit(tv_conf[!grepl("\\:", tv_conf)], sep), head, 1)
v <- c(v[!duplicated(v)], sapply(strsplit(exposure[1], sep), head, 1))


# reshape data 

library(stats)
data_wide_f <- stats::reshape(data = data_long_f, 
                            idvar = "ID", #list ID variable in your dataset
                            v.names = v, 
                            timevar = "WAVE", # list time variable in your long dataset
                            times = c(6, 15, 24, 35, 58), # list all time points in your dataset
                            direction = "wide")

data_wide_f <- data_wide_f[, colSums(is.na(data_wide_f)) < nrow(data_wide_f)]

head(data_wide_f)
```


# D2. Starting with a Single Wide Data Frame
Alternatively, we could start with a single data frame of wide data (with or without missingness).  

## D2a. Format Wide Data
Users can draw on the `formatWideData()` helper function to format their wide data.   

Read in long data and specify any factors and integers
```{r}

# add path to your wide, formatted data file here 

data_wide <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/Tutorial paper/merged_tutorial_filtered.csv", 
                      header = TRUE)


# list factor variables in wide format here

factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",       
                        "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
                        "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58") 


# list any integer variables in wide format here

integer_confounders = c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", "peri_health", "caregiv_health" , 
                        "gov_assist", "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58", 
                        "EARS_TJo.24", "EARS_TJo.35", "MDI.6", "MDI.15") 
```

Format wide data
```{r}
data_wide_f <- formatWideData(data = data_wide, 
                              exposure = exposure,
                              outcome = outcome, 
                              sep = "\\.",
                              id_var = "ID", # list original id variable here if it's not "ID"
                              missing = NA, # list missing value here
                              factor_confounders = factor_confounders,
                              integer_confounders = integer_confounders,
                              home_dir = home_dir, save.out = TRUE) 

head(data_wide_f)
```


# D3. Starting with Formatted Wide Data with Missingness
Most data collected from humans will have some degree of missing data.  

## D3a. Multiply Impute Formatted, Wide Data Frame Using Mice
Users have the option of imputing their wide, formatted data using the *mice* package, for use with *devMSMs*.    

Specify imputation parameters
```{r}
# optional seed for reproducibility 

s <- NA 
s <- 1234 # empirical example


# optional; number of imputations (default is 5)

m <- NA
m <- 5 # empirical example


# optional; provide an imputation method pmm, midastouch, sample, cart , rf (default)

method <- NA
method <- "rf" # empirical example


# optional maximum iterations for imputation (default is 5)
 
maxit <- NA
maxit <- 5 # empirical example
```

Impute data
```{r}
set.seed(s)

imputed_data <- imputeData(data = data_wide_f, 
                           exposure = exposure, 
                           outcome = outcome, 
                           sep = "\\.",
                           m = m, 
                           method = method, 
                           maxit = maxit, 
                           para_proc = TRUE, 
                           seed = s, 
                           read_imps_from_file = FALSE, 
                           home_dir = home_dir, 
                           save.out = TRUE)
```

To use the package with imputed data
```{r}

data <- imputed_data


# optional: extract first imputed dataset for inspection

library(mice)
summary(mice::complete(data, 1))
```


Alternatively, users could read in a saved mids object for use with *devMSMS*.    
```{r}
# place your .rds file in your home directory and change the name of file here

imputed_data <- readRDS("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_wide_imputed_mids.rds") 


data <- imputed_data

# optional: extract first imputed dataset for inspection

library(mice)
summary(mice::complete(imputed_data, 1))
anyNA(data) # make sure this is false, meaning no NAs
```


## D3b. Read in a List of Wide Imputed Data Saved Locally 
Alternatively, if a user has imputed datasets already created using a program other than *mice*, they can read in, as a list, files saved locally as .csv files (labeled “1”:m) in a single folder. 

```{r}
# change this to match your local folder

folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/"


# make sure pattern matches suffix of your data

files <- list.files(folder, full.names = TRUE, pattern = "\\.csv") 

```

To use the package with a list of imputed data from above
```{r}
data <- lapply(files, function(file) {
  imp_data <- read.csv(file)
  imp_data
})


# check for character variables

any(as.logical(unlist(lapply(data, function(x) { # if this is TRUE, run next lines
  any(sapply(x, class) == "character") }))))
names(data[[1]])[sapply(data[[1]], class) == "character"] # find names of any character variables

# run this for each variable that needs char -> factor

data <- lapply(data, function(x){
  x[, "state"] <- factor(x[, "state"], labels = c(1, 0)) 
})


# make factors; list factor variables here

factor_covars <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",       
                   "PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
                   "RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")

data <- lapply(data, function(x) {
  x[, factor_covars] <- as.data.frame(lapply(x[, factor_covars], as.factor))
  x })
```

Having read in their data, users should now have have assigned to `data` wide, complete data as: a data frame, mice object, or list of imputed data frames for use with the *devMSMs* functions.