Missing
data is a real headache for most researchers because it can waste the whole
effort invested in collection of huge volumes of data. Missing data can occur
because of non-response, loss to follow up/drop outs, equipment not working,
samples lost in transit etc.
Types
of missing data:
1. Missing completely at random:
The reason for missingness is completely random, i.e. not related to any patient
characteristic. Ex. A tube containing a blood sample of a subject broken by
accident, a questionnaire lost accidentally. In this case, the
representativeness of the sample is less doubted, hence most simple techniques for
handling missing data like complete (available) case analyses can give unbiased
results. (See below for the ways of handling missing data).
2. Missing not at random: When
the probability that an observation is
missing depends on information that is not observed. For ex, when asking a
subject for his or her income level it might well be that missing data are more
likely to occur when the income level is relatively high.
3. Missing at random: The probability that an observation is missing commonly
depends on information for that subject that is present i.e. on other observed patient
characteristics like age or sex for which we have information. This is the most
common category of missing data we usually encounter.
The
prime concern is always whether the available data would be biased. We need to
check this and one way to do this is compare the responders (non-missing
subjects) and the non-responders (missing subjects) on some variables. If they
are markedly different then probably the non-responsiveness (missingness) is
related to some variable or characteristic. This gives some pointer to the
representativeness of the sample.
Missing
data are much more common in retrospective studies, in which routinely
collected data are subsequently used for a different purpose.
The
main ways of handling missing data in analysis are:
1. Complete case (or available case) analysis
(a)
Omitting variables which have many missing values;
(b)
Omitting individuals who do not have complete data;
2. Imputation techniques
(a)
Hot deck imputation
(b)
Mean substitution
(c)
Last observation carried forward
(d)
Regression Imputation techniques
Omitting
individuals without complete data or omitting the variable with incomplete
information is known as complete case
(or available case) analysis and is probably the most common approach.
Imputation techniques
Hot deck imputation involves replacing missing values of one or more
variables for a non-respondent with observed values from a respondent (the
donor) that is similar to the non-respondent with respect to characteristics
observed by both cases.
The
values of a randomly chosen 32 year old black male will replace another 32 year
old black male who did not respond to a survey.
Mean imputation
The
missing value will be replaced by the mean of the values for that variable. For
example a missing blood pressure value will be replaced by the mean of the
blood pressure values for all other respondents.
Last observation carried forward
‘Last
observation carried forward’ uses the cell value immediately prior to the data
that are missing to impute the missing value.
Regression Imputation techniques
A
regression model is estimated to predict the missing values and the missing
data is imputed in relation to this. In other words, available information for
all other cases is used to predict the missing value using a regression model.
Fitted values from the regression model are then used to impute the missing
values.


01:00
Beyond p values
