Title: | Maximum Likelihood Multiple Imputation |
---|---|
Description: | Implements so called Maximum Likelihood Multiple Imputation as described by von Hippel and Bartlett (2021) <doi:10.1214/20-STS793>. A number of different imputations are available, by utilising the 'norm', 'cat' and 'mix' packages. Inferences can be performed either using combination rules similar to Rubin's or using a likelihood score based approach based on theory by Wang and Robins (1998) <doi:10.1093/biomet/85.4.935>. |
Authors: | Jonathan Bartlett |
Maintainer: | Jonathan Bartlett <[email protected]> |
License: | GPL-3 |
Version: | 1.1.2 |
Built: | 2024-10-25 03:16:31 UTC |
Source: | https://github.com/jwb133/mlmi |
This function performs multiple imputation under a log-linear model
as described by Schafer (1997), using his cat
package, either with
or without posterior draws.
catImp( obsData, M = 10, pd = FALSE, type = 1, margins = NULL, steps = 100, rseed )
catImp( obsData, M = 10, pd = FALSE, type = 1, margins = NULL, steps = 100, rseed )
obsData |
The data frame to be imputed. Variables must be coded such that they take consecutive positive integer values, i.e. 1,2,3,... |
M |
Number of imputations to generate. |
pd |
Specify whether to use posterior draws ( |
type |
An integer specifying what type of log-linear model to impute using.
|
margins |
An optional argument that can be used instead of |
steps |
If |
rseed |
The value to set the |
By default catImp
will impute using a log-linear model allowing for all two-way
associations, but not higher order associations. This can be modified through
use of the type
and margins
arguments.
With pd=FALSE
, all imputed datasets are generated conditional on the MLE
of the model parameter, referred to as maximum likelihood multiple imputation
by von Hippel and Bartlett (2021).
With pd=TRUE
, regular 'proper' multiple imputation
is used, where each imputation is drawn from a distinct value of the model
parameter. Specifically, for each imputation, a single MCMC chain is run,
iterating for steps
iterations.
Imputed datasets can be analysed using withinBetween
,
scoreBased
, or for example the
bootImpute package.
A list of imputed datasets, or if M=1
, just the imputed data frame.
Schafer J.L. (1997). Analysis of incomplete multivariate data. Chapman & Hall, Boca Raton, Florida, USA.
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
#simulate a partially observed categorical dataset set.seed(1234) n <- 100 #for simplicity we simulate completely independent variables temp <- data.frame(x1=ceiling(3*runif(n)), x2=ceiling(2*runif(n)), x3=ceiling(2*runif(n))) #make some data missing for (i in 1:3) { temp[(runif(n)<0.25),i] <- NA } #impute using catImp, assuming two-way associations in the log-linear model imps <- catImp(temp, M=10, pd=FALSE, rseed=4423) #impute assuming a saturated log-linear model imps <- catImp(temp, M=10, pd=FALSE, type=3, rseed=4423)
#simulate a partially observed categorical dataset set.seed(1234) n <- 100 #for simplicity we simulate completely independent variables temp <- data.frame(x1=ceiling(3*runif(n)), x2=ceiling(2*runif(n)), x3=ceiling(2*runif(n))) #make some data missing for (i in 1:3) { temp[(runif(n)<0.25),i] <- NA } #impute using catImp, assuming two-way associations in the log-linear model imps <- catImp(temp, M=10, pd=FALSE, rseed=4423) #impute assuming a saturated log-linear model imps <- catImp(temp, M=10, pd=FALSE, type=3, rseed=4423)
A dataset in the wide form containing simulated data with a repeatedly measured outcome. Some outcome values are missing. The missing data pattern is monotone. There are two baseline covariates.
ctsTrialWide
ctsTrialWide
A data frame with 500 rows and 7 variables:
ID for individual
A numeric 0/1 variable indicating control or active treatment group
A baseline covariate
Baseline measurement of the outcome variable
Outcome measurement at visit 1
Outcome measurement at visit 2
Outcome measurement at visit 3
This function performs multiple imputation under a general location model
as described by Schafer (1997), using the mix
package. Imputation can
either be performed using posterior draws (pd=TRUE
) or conditonal on the maximum likelihood
estimate of the model parameters (pd=FALSE
), referred to as maximum likelihood
multiple imputation by von Hippel and Bartlett (2021).
mixImp( obsData, nCat, M = 10, pd = FALSE, marginsType = 1, margins = NULL, designType = 1, design = NULL, steps = 100, rseed )
mixImp( obsData, nCat, M = 10, pd = FALSE, marginsType = 1, margins = NULL, designType = 1, design = NULL, steps = 100, rseed )
obsData |
The data frame to be imputed. The categorical variables must be
in the first |
nCat |
The number of categorical variables in |
M |
Number of imputations to generate. |
pd |
Specify whether to use posterior draws ( |
marginsType |
An integer specifying what type of log-linear model to use for the
categorical variables. |
margins |
If |
designType |
An integer specifying how the continuous variables' means should depend
on the categorical variables. |
design |
If |
steps |
If |
rseed |
The value to set the |
See the descriptions for marginsType
, margins
, designType
, design
and the documentation
in ecm.mix
for details about how to specify the model.
Imputed datasets can be analysed using withinBetween
,
scoreBased
, or for example the
bootImpute package.
A list of imputed datasets, or if M=1
, just the imputed data frame.
Schafer J.L. (1997). Analysis of incomplete multivariate data. Chapman & Hall, Boca Raton, Florida, USA.
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
#simulate a partially observed dataset with a mixture of categorical and continuous variables set.seed(1234) n <- 100 #for simplicity we simulate completely independent categorical variables x1 <- ceiling(3*runif(n)) x2 <- ceiling(2*runif(n)) x3 <- ceiling(2*runif(n)) y <- 1+0.5*(x1==2)+1.5*(x1==3)+x2+x3+rnorm(n) temp <- data.frame(x1=x1,x2=x2,x3=x3,y=y) #make some data missing in all variables for (i in 1:4) { temp[(runif(n)<0.25),i] <- NA } #impute conditional on MLE, assuming two-way associations in the log-linear model #and main effects of categorical variables on continuous one (the default) imps <- mixImp(temp, nCat=3, M=10, pd=FALSE, rseed=4423)
#simulate a partially observed dataset with a mixture of categorical and continuous variables set.seed(1234) n <- 100 #for simplicity we simulate completely independent categorical variables x1 <- ceiling(3*runif(n)) x2 <- ceiling(2*runif(n)) x3 <- ceiling(2*runif(n)) y <- 1+0.5*(x1==2)+1.5*(x1==3)+x2+x3+rnorm(n) temp <- data.frame(x1=x1,x2=x2,x3=x3,y=y) #make some data missing in all variables for (i in 1:4) { temp[(runif(n)<0.25),i] <- NA } #impute conditional on MLE, assuming two-way associations in the log-linear model #and main effects of categorical variables on continuous one (the default) imps <- mixImp(temp, nCat=3, M=10, pd=FALSE, rseed=4423)
This function performs multiple imputation under a multivariate normal model
as described by Schafer (1997), using his norm
package, either with
or without posterior draws.
normImp(obsData, M = 10, pd = FALSE, steps = 100, rseed)
normImp(obsData, M = 10, pd = FALSE, steps = 100, rseed)
obsData |
The data frame to be imputed. |
M |
Number of imputations to generate. |
pd |
Specify whether to use posterior draws ( |
steps |
If |
rseed |
The value to set the |
This function imputes from a multivariate normal model with unstructured covariance
matrix, as described by Schafer (1997). With pd=FALSE
, all imputed datasets
are generated conditional on the MLE of the model parameter, referred to as maximum
likelihood multiple imputation by von Hippel and Bartlett (2021).
With pd=TRUE
, regular 'proper' multiple imputation
is used, where each imputation is drawn from a distinct value of the model
parameter. Specifically, for each imputation, a single MCMC chain is run,
iterating for steps
iterations.
Imputed datasets can be analysed using withinBetween
,
scoreBased
, or for example the
bootImpute package.
A list of imputed datasets, or if M=1
, just the imputed data frame.
Schafer J.L. (1997). Analysis of incomplete multivariate data. Chapman & Hall, Boca Raton, Florida, USA.
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
#simulate a partially observed dataset from multivariate normal distribution set.seed(1234) n <- 100 temp <- MASS::mvrnorm(n=n,mu=rep(0,4),Sigma=diag(4)) #make some values missing for (i in 1:4) { temp[(runif(n)<0.25),i] <- NA } #impute using normImp imps <- normImp(data.frame(temp), M=10, pd=FALSE, rseed=4423)
#simulate a partially observed dataset from multivariate normal distribution set.seed(1234) n <- 100 temp <- MASS::mvrnorm(n=n,mu=rep(0,4),Sigma=diag(4)) #make some values missing for (i in 1:4) { temp[(runif(n)<0.25),i] <- NA } #impute using normImp imps <- normImp(data.frame(temp), M=10, pd=FALSE, rseed=4423)
Performs multiple imputation of a single continuous variable using a normal
linear regression model. The covariates in the imputation model must be fully
observed. By default normUniImp
imputes every dataset using the
maximum likelihood estimates of the imputation model parameters, which here
coincides with the OLS estimates, referred to as maximum likelihood multiple
imputation by von Hippel and Bartlett (2021). If pd=TRUE
is specified, it instead
performs posterior draw Bayesian imputation.
normUniImp(obsData, impFormula, M = 5, pd = FALSE)
normUniImp(obsData, impFormula, M = 5, pd = FALSE)
obsData |
The data frame to be imputed. |
impFormula |
The linear model formula. |
M |
Number of imputations to generate. |
pd |
Specify whether to use posterior draws ( |
Imputed datasets can be analysed using withinBetween
,
scoreBased
, or for example the
bootImpute package.
A list of imputed datasets, or if M=1
, just the imputed data frame.
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
#simulate a dataset with one partially observed (conditionally) normal variable set.seed(1234) n <- 100 x <- rnorm(n) y <- x+rnorm(n) x[runif(n)<0.25] <- NA temp <- data.frame(x=x,y=y) #impute using normImp imps <- normUniImp(temp, y~x, M=10, pd=FALSE)
#simulate a dataset with one partially observed (conditionally) normal variable set.seed(1234) n <- 100 x <- rnorm(n) y <- x+rnorm(n) x[runif(n)<0.25] <- NA temp <- data.frame(x=x,y=y) #impute using normImp imps <- normUniImp(temp, y~x, M=10, pd=FALSE)
Performs multiple imputation of a repeatedly measured continuous endpoint in a randomised clinical trial using reference based imputation as proposed by doi:10.1080/10543406.2013.834911Carpenter et al (2013). This approach can be used for imputation of missing data in randomised clinical trials.
refBasedCts( obsData, outcomeVarStem, nVisits, trtVar, baselineVars = NULL, baselineVisitInt = TRUE, type = "MAR", M = 5 )
refBasedCts( obsData, outcomeVarStem, nVisits, trtVar, baselineVars = NULL, baselineVisitInt = TRUE, type = "MAR", M = 5 )
obsData |
The data frame to be imputed. |
outcomeVarStem |
String for stem of outcome variable name, e.g. y if y1, y2, y3 are the outcome columns |
nVisits |
The integer number of visits (not including baseline) |
trtVar |
The string variable name of the randomised treatment group variable. The reference arm is assumed
to correspond to |
baselineVars |
A string or vector of strings specfying the baseline variables. Often this will include the baseline measurement of the outcome |
baselineVisitInt |
TRUE/FALSE indicating whether to allow for interactions between each baseline variable and visit. Default is TRUE. |
type |
A string specifying imputation type to use. Valid options are "MAR", "J2R" |
M |
Number of imputations to generate. |
Unlike most implementations of reference based imputation, this implementation imputes conditional on the maximum likelihood estimates of the model parameters, rather than a posterior draw. If one is interested in frequentist valid inferences, this is ok provided the bootstrapping used, for example with using the bootImpute package.
Intermediate missing values are imputed assuming MAR, based on the mixed model fit to that patient's treatment arm. Monotone missing values are imputed using the specified imputation type.
Baseline covariates must be numeric variables. If you have factor variables you must code these into suitable dummy indicators and pass these to the function.
A list of imputed datasets, or if M=1
, just the imputed data frame.
Carpenter JR, Roger JH, Kenward MG. Analysis of Longitudinal Trials with Protocol Deviation: A Framework for Relevant, Accessible Assumptions, and Inference via Multiple Imputation. (2013) 23(6) 1352-1371
von Hippel PT & Bartlett JW (2019) Maximum likelihood multiple imputation: Faster imputations and consistent standard errors without posterior draws arXiv:1210.0870v10.
#take a look at ctsTrialWide data head(ctsTrialWide) #impute the missing outcome values twice assuming MAR imps <- refBasedCts(ctsTrialWide, outcomeVarStem="y", nVisits=3, trtVar="trt", baselineVars=c("v", "y0"), type="MAR", M=2) #now impute using jump to reference method imps <- refBasedCts(ctsTrialWide, outcomeVarStem="y", nVisits=3, trtVar="trt", baselineVars=c("v", "y0"), type="J2R", M=2) #for frequentist valid inferences we use bootstrapping from the bootImpute package ## Not run: #bootstrap 10 times using 2 imputations per bootstrap. Note that to do this #we specify nImp=2 to bootImpute by M=1 to the refBasedCts function. #Also, 10 bootstraps is far too small to get reliable inferences. To do this #for real you would want to use a lot more (e.g. at least nBoot=1000). library(bootImpute) bootImps <- bootImpute(ctsTrialWide, refBasedCts, nBoot=10, nImp=2, outcomeVarStem="y", nVisits=3, trtVar="trt", baselineVars=c("v", "y0"), type="J2R", M=1) #write a small wrapper function to perform an ANCOVA at the final time point ancova <- function(inputData) { coef(lm(y3~v+y0+trt, data=inputData)) } ests <- bootImputeAnalyse(bootImps, ancova) ests ## End(Not run)
#take a look at ctsTrialWide data head(ctsTrialWide) #impute the missing outcome values twice assuming MAR imps <- refBasedCts(ctsTrialWide, outcomeVarStem="y", nVisits=3, trtVar="trt", baselineVars=c("v", "y0"), type="MAR", M=2) #now impute using jump to reference method imps <- refBasedCts(ctsTrialWide, outcomeVarStem="y", nVisits=3, trtVar="trt", baselineVars=c("v", "y0"), type="J2R", M=2) #for frequentist valid inferences we use bootstrapping from the bootImpute package ## Not run: #bootstrap 10 times using 2 imputations per bootstrap. Note that to do this #we specify nImp=2 to bootImpute by M=1 to the refBasedCts function. #Also, 10 bootstraps is far too small to get reliable inferences. To do this #for real you would want to use a lot more (e.g. at least nBoot=1000). library(bootImpute) bootImps <- bootImpute(ctsTrialWide, refBasedCts, nBoot=10, nImp=2, outcomeVarStem="y", nVisits=3, trtVar="trt", baselineVars=c("v", "y0"), type="J2R", M=1) #write a small wrapper function to perform an ANCOVA at the final time point ancova <- function(inputData) { coef(lm(y3~v+y0+trt, data=inputData)) } ests <- bootImputeAnalyse(bootImps, ancova) ests ## End(Not run)
This function implements the score based variance estimation approach described by von Hippel and Bartlett (2021), which is based on earlier work by Wang and Robins (1998).
scoreBased(imps, analysisFun, scoreFun, pd = NULL, dfComplete = NULL, ...)
scoreBased(imps, analysisFun, scoreFun, pd = NULL, dfComplete = NULL, ...)
imps |
A list of imputed datasets produced by one of the imputation functions
in |
analysisFun |
A function to analyse the imputed datasets that when applied to
a dataset returns a list containing a vector |
scoreFun |
A function whose first argument is a dataset and whose second argument is a vector of parameter values. It should return a matrix of subject level scores evaluated at the parameter value passed to it. |
pd |
If |
dfComplete |
The complete data degrees of freedom. If |
... |
Other parameters that are to be passed through to |
A list containing the overall parameter estimates, its corresponding covariance matrix, and degrees of freedom for each parameter.
Wang N., Robins J.M. (1998) Large-sample theory for parametric multiple imputation procedures. Biometrika 85(4): 935-948. doi:10.1093/biomet/85.4.935.
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
#simulate a partially observed dataset set.seed(1234) n <- 100 x <- rnorm(n) y <- x+rnorm(n) y[1:50] <- NA temp <- data.frame(x,y) #impute using normUniImp, without posterior draws imps <- normUniImp(temp, y~x, M=10, pd=FALSE) #define a function which performs our desired analysis on a dataset, returning #the parameter estimates yonx <- function(inputData) { fitmod <- lm(y~x, data=inputData) list(est=c(fitmod$coef,sigma(fitmod)^2)) } #define a function which when passed a dataset and parameter #vector, calculates the likelihood score vector myScore <- function(inputData, parm) { beta0 <- parm[1] beta1 <- parm[2] sigmasq <- parm[3] res <- inputData$y - beta0 - beta1*inputData$x cbind(res/sigmasq, (res*inputData$x)/sigmasq, res^2/(2*sigmasq^2)-1/(2*sigmasq)) } #call scoreBased to perform variance estimation scoreBased(imps, analysisFun=yonx, scoreFun=myScore)
#simulate a partially observed dataset set.seed(1234) n <- 100 x <- rnorm(n) y <- x+rnorm(n) y[1:50] <- NA temp <- data.frame(x,y) #impute using normUniImp, without posterior draws imps <- normUniImp(temp, y~x, M=10, pd=FALSE) #define a function which performs our desired analysis on a dataset, returning #the parameter estimates yonx <- function(inputData) { fitmod <- lm(y~x, data=inputData) list(est=c(fitmod$coef,sigma(fitmod)^2)) } #define a function which when passed a dataset and parameter #vector, calculates the likelihood score vector myScore <- function(inputData, parm) { beta0 <- parm[1] beta1 <- parm[2] sigmasq <- parm[3] res <- inputData$y - beta0 - beta1*inputData$x cbind(res/sigmasq, (res*inputData$x)/sigmasq, res^2/(2*sigmasq^2)-1/(2*sigmasq)) } #call scoreBased to perform variance estimation scoreBased(imps, analysisFun=yonx, scoreFun=myScore)
This function implements the within-between variance estimation approach. If the imputations were generated using posterior draws, it implements the approach proposed by Barnard & Rubin (1999). If posterior draws were not used, it implements the WB approach described by von Hippel and Bartlett (2021).
withinBetween(imps, analysisFun, pd = NULL, dfComplete = NULL, ...)
withinBetween(imps, analysisFun, pd = NULL, dfComplete = NULL, ...)
imps |
A list of imputed datasets produced by one of the imputation functions
in |
analysisFun |
A function to analyse the imputed datasets that when applied to
a dataset returns a list containing a vector |
pd |
If |
dfComplete |
The complete data degrees of freedom. If |
... |
Other parameters that are to be passed through to |
A list containing the overall parameter estimates, its corresponding covariance matrix, and degrees of freedom for each parameter.
Barnard J, Rubin DB. Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika 1999; 86(4): 948-955. doi:10.1093/biomet/86.4.948
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
#simulate a partially observed dataset set.seed(1234) n <- 100 x <- rnorm(n) y <- x+rnorm(n) y[1:50] <- NA temp <- data.frame(x,y) #impute using normImp imps <- normImp(temp, M=100, pd=TRUE, rseed=4423) #define a function which analyses a dataset using our desired #analysis model, returning the estimated parameters and their #corresponding variance covariance matrix analysisFun <- function(inputData) { mod <- lm(y~x, data=inputData) list(est=coef(mod), var=vcov(mod)) } withinBetween(imps,analysisFun, dfComplete=c(n-2,n-2))
#simulate a partially observed dataset set.seed(1234) n <- 100 x <- rnorm(n) y <- x+rnorm(n) y[1:50] <- NA temp <- data.frame(x,y) #impute using normImp imps <- normImp(temp, M=100, pd=TRUE, rseed=4423) #define a function which analyses a dataset using our desired #analysis model, returning the estimated parameters and their #corresponding variance covariance matrix analysisFun <- function(inputData) { mod <- lm(y~x, data=inputData) list(est=coef(mod), var=vcov(mod)) } withinBetween(imps,analysisFun, dfComplete=c(n-2,n-2))