Missing Value Imputation with Data Augmentation in R

Incomplete data is a problem that Data Scientists face every day. Most common practices vary from complete deletion of the observations with missing values, substitution by a fixed value, or performing imputation using statistics like the mean or median. Since these approaches have limitations on capturing the structure of the data, scientists have developed more sophisticated methods.

Data Augmentation

The Data Augmentation (DA) algorithm was developed during the 1980s-1990s and became one of the most popular methods in the domain of missing data. The idea of DA is based on the Expectation Maximization (EM) approach to deal with intractable likelihood functions. Facing the challenge of optimizing a difficult likelihood, the solution consists on the addition of unobserved (latent) variables so the Maximum Likelihood Estimation process is feasible. Specifically, if L(θ|Y) is the likelihood of the observed vector of variables Y, we can introduce the latent variable Z, so that the likelihood L(θ|Y,Z) becomes easy to optimize.

Numerical Integration

With the development of Bayesian computational methods, numerical integration methods started to play an important role to obtain posterior moments and marginal distributions of different parameters from the joint posterior. Then, a Markov Chain Monte Carlo (MCMC) idea emerged to simulate sampling distributions for θ and Z. Data Augmentation introduces an auxiliary variable Z and creates a MCMC that works iteratively sampling θ and Z from their conditional distributions. Specifically:

  1. Find a starting point θ(0)
  2. Find values of Z by sampling from the conditional distribution of Z

    Z(t+1) ~ P (Z| Y, θ(t))

  3. Impute a new value of θ from its posterior distribution

    θ(t+1) ~ P (θ| Y, Z(t+1))

Iterating on these two steps we create a Markov Chain Monte Carlos {(θ(t), Z(t)): t=1, 2, …}, whose subsequences converge to P (θ| Y) and P (Z| Y) respectively. The utility of the DA algorithm relies on the flexibility to choose an appropriate P (Z| θ), such that sampling from P (θ| Y, Z) and P (Z| Y, θ) is easier than sampling directly from P (θ| Y).

We can implement Data Augmentation in R, by using the library ‘norm’. The following code allows to fill out missing values in a data table by generating values from the Markov Chain after it converges to the conditional distributions.

The Code

#Data Augmentation Using 'norm' library
 rm(list=ls())
library(norm)
 library(graphics)
 Data <- read.csv(file = "test.csv")
 Data <- as.matrix(Data)
 m <- length(Data[1,])
 for (ii in 1:m){
Data[,ii]<-as.numeric(Data[,ii])
}
 y<-matrix(data=rep(0,times=15000), nrow=15, ncol=1000)
 transfData <- prelim.norm(Data)
 initialtheta <- em.norm(transfData, maxits=2, criterion=0.01)
 #find the MLE for a starting value 
 ## Iterations of EM: 
 ## 1...2...
 for(i in 1:1000){
 rngseed(1234)
 #set random number generator seed 
 theta <- da.norm(transfData,initialtheta, steps=i, showits=FALSE, return.ymis=TRUE)
 # take i steps 
 y[,i]<-theta$parameter
 }
 Data1<-imp.norm(transfData, theta$parameter, Data)
 #filling missing values
 #plotting parameters for 1 iteration and 1000 iterations
 plot(c(1:15),y[,1],col="red")
 par(new=TRUE)
 plot(c(1:15),y[,1000],col="blue")

heatmap(Data, Rowv=NULL, Colv=NULL,col=rainbow(10),hclustfun = hclust,symm = FALSE, na.rm =TRUE,labRow = NULL, labCol =NULL, main =
 NULL,xlab = NULL, ylab =NULL)

heatmap(Data1, Rowv=NULL, Colv=NULL,col=rainbow(10),hclustfun = hclust,symm = FALSE, na.rm =TRUE,labRow = NULL, labCol =NULL, main =NULL,xlab = NULL, ylab =NULL)

Leave a Reply

Your email address will not be published. Required fields are marked *