Skip to content

How to impute missing data a data matrix using SVD

Elisa Guma edited this page Jan 13, 2020 · 1 revision

Missing data arise in many experiments due to a variety of reasons, however it can be problematic depending on the type of statistical test you would like to perform, not all can handle missing data.

Thankfully there are a few simple methods we can use to impute these missing values. Imputation in this case means replacing missing values with substituted values. One way to do this is with Singular Value Decomposition.

The SVD.miss function in R (https://www.rdocumentation.org/packages/SpatioTemporal/versions/0.9.2/topics/SVD.miss) completes a data matrix using interative svd. It computes the svd for the matrix and replaces the missing values by linear regression of the columns onto the first svd components. The function will fail if entire rows and/or columns are missing from the data matrix.

To use this in R, you will need to load the following libraries:

library(RMINC) 
library(SpatioTemporal)

Next, you can load in your dataframe, with all missing data coded as "NA". Here's an example:

data=read.csv("/path_to/data_for_imputation.csv",na.strings = c("NA",""))

The SVDmiss function only works on a matrix of numbers, so you can modify your dataframe to separate the demographics information from the numeric values of your data. You will then have to convert your "number" datafram to a matrix. Here is an example:

filtered.data.only = filtered.data[,22:109] ##columns containing data
subject.info = filtered.data[,1:21] ##columns containing demographics information (i.e. letters)
filtered.data.matrix=as.matrix(filtered.data.only) ##converting dataframe to a matrix

Now you are ready to run the SVDmiss function, as follows (these are the defaults in R, but you can play around with it based on what works best for your type of data):

imputed.data=SVDmiss(filtered.data.matrix, niter = Inf, conv.reldiff = 0.001, ncomp = 4)

filtered.data.matrixData is the matrix (with missing values marked by NA).

niter is the maximum number of iterations run before exiting. Setting it to Inf will run it until there is convergence.

conv.reldiff when the iterative procedure has converged the difference between to consecutive iterations will be less than the value assigned here.

ncomp is the number of SVD components to use in the reconstruction

To visualize your results you can run:

summary(imputed.data$Xfill)

This can now be saved as a new dataframe on which you can run statistical tests.

Xfill is the completed data matrix with the missing values replaced by fitting the data;

Other outputs you may be interested in:

svd is the result of SVD on the new data matrix, i.e. svd(Xfill)

Clone this wiki locally