European Survey Research AssociationEuropean Survey Research Association
 
Home About us Membership Conferences Journal Courses Minutes Contact

Login to your account:

Sign up | Reset password

Conferences

Conferences


ESRA2009: Conference main page | Overview of sessions | Time table

Warsaw 2009: Presentations and short courses


When Normality Fails: a Multiple Imputation Technique Based on the Cumulative Distribution

Session: Strategies for Nonresponse Adjustments (II)

Author:

  • Bryce Weaver; FORS - Swiss Foundation for Research in Social Sciences, Switzerland

Abstract:

Objectives. One of the drawbacks of multiple imputation is that assumptions are generally made on the form of the distribution —e.g. normality. These assumptions may often be invalid. In our work, we develop a technique of multiple imputation that makes no a priori assumptions on the form of the distribution. This tool is useful for the imputation of data under the standard assumption that the values are missing at random (being missing may depend on other variables, but not on the value of the variable to be imputed). It is designed to reproduce the cumulative distribution of any variable —continuous or dichotomous. The conceptual framework is transformed into a simple tool in SAS. Empirically, we use the Swiss Household Panel (SHP) wave 8 (2007) to apply this new method and compare the obtained cumulative distribution with the ones from the standard PROC MI procedures in SAS.

Method. The entire sample (missing and non-missing) is separated into imputation classes based on up to five categorical variables (there can be grouping of classes in order to maintain a sufficient number of non-missing observations). These variables are selected by the researcher as being important indicators of the values of the variable(s) to be imputed. First, within each imputation class, a cumulative distribution for the variable to be imputed is built from the non-missing cases. This cumulative distribution is obtained by linearly interpolating between the observed points (non-missing) to allow for any form of the distribution. Then, for each missing observation, the imputed value is obtained by randomly selecting a number between zero and one from a uniform distribution and by taking the value that corresponds to the number selected on the cumulative distribution. This last procedure is repeated multiple times. The SAS program is flexible as it allows the user to choose (i) the variables to be imputed, (ii) the categorical variables used to create the imputation classes, (iii) the number of imputed files to be produced, and (iv) whether the variable(s) to be imputed are continuous or discrete variables.

Results. Using the SHP wave 8 (treated as cross sectional), which has 5794 observations, income is assigned a missing value for nearly one third of non-missing cases (n=1815). These are then imputed using this procedure and some PROC MI procedures. The obtained distributions are then compared to the actual distribution of the observations that had been assigned missing. The above technique performs better at reproducing the actual form of the distribution on missing data than the PROC MI procedure.

Discussion. Some of the advantages of the technique are that it works for any type of missing —categorical or continuous—, it can reproduce any form of a distribution, and it provides relatively accurate standard descriptive statistics. Future work could consist in developing some more theoretical basis to produce smoother distributions as opposed to the non-smooth distribution based on linear interpolation.