08.01.2020 / Imputation/Missing Data

Missing values among collected data are not uncommon and there are several different reasons for their occurrence, some of which are inevitable.

Performing a thought-out imputation can help to obtain reliable statistical results even for high rates of missing values. In this article we are trying to give a basic understanding on the topic by having a look at what imputation means at all, what types of missing data exist, what pros and cons for different methods (exemplary) exist, and what multiple imputation is.

In a bottom line, we are summarizing some advantages of imputation and point out when you could make use of that newly acquired information. So let’s get started …

What is imputation?

To perform an imputation means to fill in missing values with random draws from an imputation model and then fit the imputed data to an analysis model.

Therefore, the point of imputation is not that the imputed values look like observed values. It is rather that the imputed variables should act like the observed variables when used in analyses.

Nevertheless, any kind of imputation introduces some sort of bias as we cannot know what the ‘real’ data actually looks like. The best we can do is trying to get as close to the ‘real’ data as possible by minimizing any bias and for that we need to choose the right methods.

Not only to choose an appropriate method but also to validate if imputation does make sense at all, the first thing to ask is ‘What kind of missing data are we looking at?’ and yes, missing data does not equal missing data!

Types of missing data

At this point our statisticians would like to come up with lots of formulas and mathematical language, but for now we try to keep it simple. Generally speaking, there are three different types of missing data.

Missing completely at random (MCAR): A variable is missing completely at random when the probability of its ‘missingness’ does neither depend on other (observed) variables nor on itself. An example for this would be missing data for respondents, whose questionnaire simply got lost in the mail.

Missing at random (MAR): A variable is missing at random when the probability of missing data on that variable is not related to the variable itself but to other measured variables. If, for example, men are less likely to answer a survey about depression, then missing data about the grade of depression will be related to the variable ‘gender’. Within each gender, the probability of missing ‘grade of depression’ does not depend on the grade of depression itself.

Missing not at random (MNAR): A variable is missing not at random when the probability of its ‘missingness’ is related to the variable itself. A typical example for this would be the variable ‘income’. The higher the income of a subject, the less likely the related question gets answered.

Once we know what we are dealing with in the first place, we can start to think about proper imputation methods.

Pros and cons for different methods

When trying to find the best imputation method to be applied to your data, both advantages and disadvantages of each potential method have to be considered.

As a basic example, we take a look at ‘imputation by mean’. This might work well on small datasets with numerical variables as it keeps the mean unbiased but gives poor results on encoded categorical variables. ‘Imputation by most frequent values’ on the other hand works great for categorical variables but may introduce even greater bias.

Needless to say that those two methods are rather simple, as both do not even factor in the correlations between variables, and are therefore not recommended to be used anymore.

Modern methods of imputation are (almost always) way better as they take more factors into account which, however, render them more complex. ‘Imputation using k-NN (Nearest Neighbor)’ for example can be much more accurate than the methods mentioned above. In this procedure the k nearest neighbors of a missing observation are taken into account, based on other observed values, for imputation. The challenge here is to define k. A low k will increase the influence of noise and the results are going to be less generalizable. A high k on the other hand will tend to blur local effects which are exactly what we are looking for.

Further methods of imputation would be regression imputation, in general the overlaying process of multiple imputation or methods based on deep learning and/or machine learning, just to name a few approaches.

The more complex a method, the less comprehensible the pros and cons, at least without a closer introduction, leading to another important indicator when searching for the best method: Complexity itself!

The complexity of imputation methods can be both an advantage and a disadvantage. Even though approaches like MICE (Multivariate Imputation by Chained Equations) perform extremely well when properly specified, these might be exaggerated depending on what data we want to impute and what analysis we want to perform. In that case, we can save time and resources by choosing a less complex method with a negligible loss in performance.

Although we have just scratched the surface here, it already became clear that imputation is no easy fix for the missing-data problem.

Conclusion

In the entire process of imputation statistical/mathematical knowledge and experience in the data to be imputed have to be brought together. This requires a close cooperation between statisticians and field experts. The reward can be to gain as much and as reliable knowledge from incomplete data as possible.

Especially for huge databases and registers and data which has been collected or merged without having a specific goal in mind in the first place, imputation can serve as a highly beneficial, hypothesis-generating approach.

If this topic caught your attention or if you are interested in coorperating with us, please don’t hesitate to contact us.