-
Notifications
You must be signed in to change notification settings - Fork 2
/
ExampleDataPrep.Rmd
329 lines (218 loc) · 12.1 KB
/
ExampleDataPrep.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Data Preparation for using devMSMs with Longitudinal Data"
output: html_document
date: "2024-05-17"
---
The following notebook provides a guide to assist the user in preparingn their data for use with *devMSMs*.
Please review the accompanying manuscript for a full conceptual and practical introduction to MSMs in the context of developmental data. Please also see the vignettes on the *devMSMs* website for step-by-step guidance on the use of this code: https://istallworthy.github.io/devMSMs/index.html.
Headings denote accompanying website sections and steps. We suggest using the interactive outline tool (located above the Console) for ease of navigation.
We advise users implement the appropriate data prepeartion steps in accordance with your starting data, with the goal of assigning to `data` one of the following wide data formats (see Figure 1) for use in the package:
* a single data frame of data in wide format with no missing data
* a mids object (output from mice::mice()) of data imputed in wide format
* a list of data imputed in wide format as data frames.
Users have several options for reading in data. They can begin this workflow with the following options:
* long data with missingness can can be formatted and converted to wide data (P1a) for imputation (P2)
* wide with missingness can be formatted (P1b) before imputing (P2)
* data already imputed in wide format can be read in as a list (P2).
* data with no missingness (P3)
## Installation
You will always need to install the *devMSMs helper* functions from Github (*devMSMsHelpers*).
```{r}
install.packages("devtools", quiet = TRUE)
library(devtools)
install_github("istallworthy/devMSMsHelpers", quiet = TRUE)
library(devMSMsHelpers)
install_github("istallworthy/devMSMs", quiet = TRUE)
library(devMSMs, include.only = "initMSM")
```
# Specify Core Inputs
```{r}
# required
exposure = c("ESETA1.6", "ESETA1.15", "ESETA1.24", "ESETA1.35", "ESETA1.58") # wide format
# outcome
outcome <- "StrDif_Tot.58"
# required
ti_conf = c("state", "BioDadInHH2", "PmAge2", "PmBlac2", "TcBlac2", "PmMrSt2", "PmEd2", "KFASTScr",
"RMomAgeU", "RHealth", "HomeOwnd", "SWghtLB", "SurpPreg", "SmokTotl", "DrnkFreq",
"peri_health", "caregiv_health", "gov_assist")
tv_conf = c("SAAmylase.6","SAAmylase.15", "SAAmylase.24",
"MDI.6", "MDI.15",
"RHasSO.6", "RHasSO.15", "RHasSO.24","RHasSO.35", "RHasSO.58",
"WndNbrhood.6","WndNbrhood.24", "WndNbrhood.35", "WndNbrhood.58",
"IBRAttn.6", "IBRAttn.15", "IBRAttn.24",
"B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",
"HOMEETA1.6", "HOMEETA1.15", "HOMEETA1.24", "HOMEETA1.35", "HOMEETA1.58",
"InRatioCor.6", "InRatioCor.15", "InRatioCor.24", "InRatioCor.35", "InRatioCor.58",
"CORTB.6", "CORTB.15", "CORTB.24",
"EARS_TJo.24", "EARS_TJo.35",
"LESMnPos.24", "LESMnPos.35",
"LESMnNeg.24", "LESMnNeg.35",
"StrDif_Tot.35", "StrDif_Tot.58",
"fscore.35", "fscore.58")
home_dir = '/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/isa/'
```
# D1. Starting with a Single Long Data Frame
Users beginning with a single data frame in long format (with or without missingness).
## D1a. Format Long Data
Users have the option of formatting their long data if their data columns are not yet labeled correctly using the `formatLongData()` helper function.
Read in long data and specify any factors and integers
```{r}
# add path to your long data file here if you want to begin with long data
data_long <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_long_missing_unformatted.csv",
header = TRUE)
# list factor variables in long format here
factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO")
# list any integer variables in long format here
integer_confounders <- c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB",
"peri_health", "caregiv_health" ,
"gov_assist", "B18Raw", "EARS_TJo", "MDI")
```
Format your long data
```{r}
data_long_f <- formatLongData(data = data_long,
exposure = exposure,
outcome = outcome,
sep = "\\.",
time_var = "Tage", # list original time variable here if it's not "WAVE"
id_var = "S_ID", # list original id variable here if it's not "ID"
missing = -9999, # list missing value here
factor_confounders = factor_confounders, # list factor variables in long format here
integer_confounders = integer_confounders, # list any integer variables in long format here
home_dir = home_dir, save.out = TRUE)
head(data_long_f)
```
Make sure to inspect your data at each step of the way to ensure formatting looks correct.
## D1b. Tranform Formatted Long Data to Wide
Users with correctly formatted data in long format have the option of using the following code to transform their data into wide format, to proceed to using the package (if there is no missing data) or imputing (with < 20% missing data MAR). They may see some warnings about *NA* values.
```{r}
# identify time-varying confounders
sep <- "\\." # time delimiter
v <- sapply(strsplit(tv_conf[!grepl("\\:", tv_conf)], sep), head, 1)
v <- c(v[!duplicated(v)], sapply(strsplit(exposure[1], sep), head, 1))
# reshape data
library(stats)
data_wide_f <- stats::reshape(data = data_long_f,
idvar = "ID", #list ID variable in your dataset
v.names = v,
timevar = "WAVE", # list time variable in your long dataset
times = c(6, 15, 24, 35, 58), # list all time points in your dataset
direction = "wide")
data_wide_f <- data_wide_f[, colSums(is.na(data_wide_f)) < nrow(data_wide_f)]
head(data_wide_f)
```
# D2. Starting with a Single Wide Data Frame
Alternatively, we could start with a single data frame of wide data (with or without missingness).
## D2a. Format Wide Data
Users can draw on the `formatWideData()` helper function to format their wide data.
Read in long data and specify any factors and integers
```{r}
# add path to your wide, formatted data file here
data_wide <- read.csv("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/Tutorial paper/merged_tutorial_filtered.csv",
header = TRUE)
# list factor variables in wide format here
factor_confounders <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")
# list any integer variables in wide format here
integer_confounders = c("KFASTScr", "PmEd2", "RMomAgeU", "SWghtLB", "peri_health", "caregiv_health" ,
"gov_assist", "B18Raw.6", "B18Raw.15", "B18Raw.24", "B18Raw.58",
"EARS_TJo.24", "EARS_TJo.35", "MDI.6", "MDI.15")
```
Format wide data
```{r}
data_wide_f <- formatWideData(data = data_wide,
exposure = exposure,
outcome = outcome,
sep = "\\.",
id_var = "ID", # list original id variable here if it's not "ID"
missing = NA, # list missing value here
factor_confounders = factor_confounders,
integer_confounders = integer_confounders,
home_dir = home_dir, save.out = TRUE)
head(data_wide_f)
```
# D3. Starting with Formatted Wide Data with Missingness
Most data collected from humans will have some degree of missing data.
## D3a. Multiply Impute Formatted, Wide Data Frame Using Mice
Users have the option of imputing their wide, formatted data using the *mice* package, for use with *devMSMs*.
Specify imputation parameters
```{r}
# optional seed for reproducibility
s <- NA
s <- 1234 # empirical example
# optional; number of imputations (default is 5)
m <- NA
m <- 5 # empirical example
# optional; provide an imputation method pmm, midastouch, sample, cart , rf (default)
method <- NA
method <- "rf" # empirical example
# optional maximum iterations for imputation (default is 5)
maxit <- NA
maxit <- 5 # empirical example
```
Impute data
```{r}
set.seed(s)
imputed_data <- imputeData(data = data_wide_f,
exposure = exposure,
outcome = outcome,
sep = "\\.",
m = m,
method = method,
maxit = maxit,
para_proc = TRUE,
seed = s,
read_imps_from_file = FALSE,
home_dir = home_dir,
save.out = TRUE)
```
To use the package with imputed data
```{r}
data <- imputed_data
# optional: extract first imputed dataset for inspection
library(mice)
summary(mice::complete(data, 1))
```
Alternatively, users could read in a saved mids object for use with *devMSMS*.
```{r}
# place your .rds file in your home directory and change the name of file here
imputed_data <- readRDS("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/FLP_wide_imputed_mids.rds")
data <- imputed_data
# optional: extract first imputed dataset for inspection
library(mice)
summary(mice::complete(imputed_data, 1))
anyNA(data) # make sure this is false, meaning no NAs
```
## D3b. Read in a List of Wide Imputed Data Saved Locally
Alternatively, if a user has imputed datasets already created using a program other than *mice*, they can read in, as a list, files saved locally as .csv files (labeled “1”:m) in a single folder.
```{r}
# change this to match your local folder
folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/"
# make sure pattern matches suffix of your data
files <- list.files(folder, full.names = TRUE, pattern = "\\.csv")
```
To use the package with a list of imputed data from above
```{r}
data <- lapply(files, function(file) {
imp_data <- read.csv(file)
imp_data
})
# check for character variables
any(as.logical(unlist(lapply(data, function(x) { # if this is TRUE, run next lines
any(sapply(x, class) == "character") }))))
names(data[[1]])[sapply(data[[1]], class) == "character"] # find names of any character variables
# run this for each variable that needs char -> factor
data <- lapply(data, function(x){
x[, "state"] <- factor(x[, "state"], labels = c(1, 0))
})
# make factors; list factor variables here
factor_covars <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")
data <- lapply(data, function(x) {
x[, factor_covars] <- as.data.frame(lapply(x[, factor_covars], as.factor))
x })
```
Having read in their data, users should now have have assigned to `data` wide, complete data as: a data frame, mice object, or list of imputed data frames for use with the *devMSMs* functions.