Loading...

Friday, 12 April 2013

Missing Data


Missing data is a real headache for most researchers because it can waste the whole effort invested in collection of huge volumes of data. Missing data can occur because of non-response, loss to follow up/drop outs, equipment not working, samples lost in transit etc.
   
Types of missing data:
1. Missing completely at random: The reason for missingness is completely random, i.e. not related to any patient characteristic. Ex. A tube containing a blood sample of a subject broken by accident, a questionnaire lost accidentally. In this case, the representativeness of the sample is less doubted, hence most simple techniques for handling missing data like complete (available) case analyses can give unbiased results. (See below for the ways of handling missing data).
2. Missing not at random: When the probability that an observation is missing depends on information that is not observed. For ex, when asking a subject for his or her income level it might well be that missing data are more likely to occur when the income level is relatively high.
3. Missing at random: The probability that an observation is missing commonly depends on information for that subject that is present i.e. on other observed patient characteristics like age or sex for which we have information. This is the most common category of missing data we usually encounter.

The prime concern is always whether the available data would be biased. We need to check this and one way to do this is compare the responders (non-missing subjects) and the non-responders (missing subjects) on some variables. If they are markedly different then probably the non-responsiveness (missingness) is related to some variable or characteristic. This gives some pointer to the representativeness of the sample.
Missing data are much more common in retrospective studies, in which routinely collected data are subsequently used for a different purpose.

The main ways of handling missing data in analysis are:
1.  Complete case (or available case) analysis
(a) Omitting variables which have many missing values;
(b) Omitting individuals who do not have complete data;
2. Imputation techniques
(a) Hot deck imputation
(b) Mean substitution
(c) Last observation carried forward
(d) Regression Imputation techniques

Omitting individuals without complete data or omitting the variable with incomplete information is known as complete case (or available case) analysis and is probably the most common approach.


Imputation techniques

Hot deck imputation involves replacing missing values of one or more variables for a non-respondent with observed values from a respondent (the donor) that is similar to the non-respondent with respect to characteristics observed by both cases.
The values of a randomly chosen 32 year old black male will replace another 32 year old black male who did not respond to a survey.
Mean imputation
The missing value will be replaced by the mean of the values for that variable. For example a missing blood pressure value will be replaced by the mean of the blood pressure values for all other respondents.
Last observation carried forward
‘Last observation carried forward’ uses the cell value immediately prior to the data that are missing to impute the missing value.
Regression Imputation techniques    
A regression model is estimated to predict the missing values and the missing data is imputed in relation to this. In other words, available information for all other cases is used to predict the missing value using a regression model. Fitted values from the regression model are then used to impute the missing values. 

Thursday, 11 April 2013

Short course on epidemiology, biostatistics and research methodology


We are pleased to announce that registration is open for the following
 "3rd short course on basics of epidemiology and biostatistics"

 "2nd short course on research methodology" 
at dept. of community medicine, IGMCRI, Puducherry, 19-22 May 2013

Sunday, 17 March 2013

Standard deviation (SD) and Standard error (SE)



Two terms commonly reported in scientific articles. We believe that most readers are acquainted with these concepts already. Let’s brush them clean. What we seldom realize is that SD is one of the most commonly used statistical concepts in day to day life. Consciously or not we imagine a tolerable variation of different biological phenomena - age, weight, height, BP, anger. We understand that there are no hard and fast absolute limits and allow for some variation. This variation, that all individual values deviate from the mean value, is statistically measured using different measures.
 Let us say we have the Hb values of children in a school and their arithmetic mean. Each individual child’s value deviates/ varies from this mean by some value. The sum of these values of deviation would be zero. Hence we have two choices- either sum the absolute values of these deviations and that would be the absolute mean deviation which would give an idea of the absolute magnitude deviation of the values from the mean, but no sense of the direction of deviation. The second option is to square the values of individual deviation (to get rid of the sign) and calculate their average. This again gives us the idea of on an average, how much are the individual values away from the mean? This is what you call the variance. But variance is not what you commonly see in descriptive tables alongside the mean. The reason is that as we have squared the values, the units are squared and hence it is difficult to interpret the value of variance. So we take a square root of it and lo we have our standard deviation- in the same units as the variable measured and thus easier to interpret.
Next, based on the mean and SD one can derive reference ranges which flank the mean. They can be 95% or 90% or whatever you may want to understand about the sample. You want to see the limits that contain 95%/ 90% etc of the observations. So when a mother thinks her child weighs lower than his peers, she has subconsciously calculated the mean and standard deviation of the sample of the children who are his peers and thinks that he is probably below what 95% or 90% weigh – whatever reference range she had constructed.
What then are standard errors? Let us zoom out of the sample to see the population from which it came- a population that is composed of infinite number of such samples. Now the sample is to the population what the individual was to the sample. So earlier we thought of individual values distributed around a sample mean. Now we shall imagine sample means distributed around a population mean. This is what you call a sampling distribution. Now the mean of this distribution is the population and the SD of this distribution is the standard error. So an SE tells you, on an average, how much the sample means are away from the population mean. The mathematical expression of an SE (calculated as SD of the sampling distribution) comes down to SD divided by sqrt of sample size. Thus, SE depends on the variability in the sample (SD) and sample size (n). So for a sample mean to be a good estimate of the population mean, it should deviate least from the population mean. In other words a small SE means that the sample mean is a better approximation of the population mean. Now from the above mathematical expression one can understand how to have a small SE!
Now, what is the equivalent of reference range. The limits that enclose 95% or 90% sample means with the population mean as centre- are the confidence intervals. Needless to mention this is an abstract concept where you imagine you have taken infinite samples (with replacement) from the same population, calculated their means and plotted them as a distribution. If you are adamant about wanting a textbook definition of confidence intervals one may refer to Kirkwood’s Essential Medical Statistics.
Why do we need to imagine this very abstract concept? Because that is what an epidemiologist’s job is about. Talking about populations while they are just studying samples! And as evident from their calculation, SD and SE give different information about the sample. SD measures the variability is the sample and SE measures the uncertainty in the sample statistic (mean/ proportion/ others).

Sunday, 10 March 2013

A peek into the study of studies!



So from where we left it last week, let us peek into a little more detail of the master study designs- systematic reviews- what are they, why we need them, what are the steps involved, what they can and cannot do for us. How to actually do them- we shall see in a later post!
What are they? Well, systematic reviews are a systematic review of studies on a particular research question. Over simplified! Consider narrative reviews, one that we resort to while doing a literature review for our dissertation(s). They are far from being systematic, are not all-inclusive and two of us may differ widely in what we infer from the same studies. Subjective and lacks thoroughness, easy though. Hence enter systematic reviews! These are systematic in every way, in every step of their design, conduct, presentation and interpretation. Now what the textbook says- systematic review is one that has been prepared using a systematic approach to minimizing biases and random errors which is documented in a materials and methods section. They are said to present an epidemiology of results. What then are meta analyses? They are simply the statistical section of a systematic review. A systematic review (SR) may or may not include a meta analysis (MA) as it may not always be sensible to statistically combine results of different studies. 
Why do we need them? Are we too lazy to get into the field and collect primary data? Well, the earliest mention of a meta analysis was made by Pearson to overcome the problem of small studies, well conducted ones, that couldn’t manage a significant result on their own. Hence rationally combine them and see if they could churn up something significant. Another purpose is to synthesize evidence from the large volume of literature available for the ease of the reader/ practitioner. And also when we have contradicting results on a research question it may make sense to systematically combine them to see what they suggest when combined. Archie Cochrane is the person who formalized and condensed his efforts into what we now see in the Cochrane collaboration and library- a collection of systematic reviews and meta analyses conducted along the lines of clearly laid down rigorous standards.
What can they do for you? You can combine experimental and or observational studies. The combination of studies of different designs in an SR is a cross design synthesis. But remember it is only the SR/ MA of RCTs that figures as Grade I evidence. There are criticisms of SR of observational studies that it sums up their biases and errors as well despite just combining their results. That’s left to debate and subject to the quality of the studies you will be combining.  
What are the steps involved? Well you have an itch in your cortex. That’s how you start- an uncertainty that persists after you have grossed through literature- due to inadequate (not nil) evidence, contradicting evidences or too much of it. So you frame your research question. Be doubly sure that there is no SR/ MA- complete or in progress on the same topic before you put your foot down and set off with an SR. Get a team ready – a topic expert and a methods expert are indispensable in your team. Then lay down your search strategy for searching literature- what search terms you will use, what data bases you will access, what language articles you will search, over what reference period will you search literature, how you will capture unpublished work- just a glimpse into what amounts to setting your search strategy. Then decide on your inclusion and exclusion criteria for studies that your search yields. Then you will run a bias check on all the studies and grade the extent of bias in each selected study. At this point we have good news and a bad one. Bad- Each of the above mentioned steps shall have be done by two members of your review team independently and any disagreement between them shall be settled by a third one. Good- You have softwares (shall be discussed in future posts) for bias checking and grading that reduce your work to less than half.
So you are left with the list of studies double filtered and screened for biases. Prepare a data extraction sheet to extract all relevant data needed from each of the studies. Combine them statistically or choose not to as the data will inform you. Forrest plots/ funnel plots and the statistical analyses are just a click away using the softwares mentioned above. Write up results and discuss the results. There you are done. How to publish them -Well if your title/ protocol were registered with Cochrane it goes into Cochrane by default (once you meet their standards).
Sounds like cake walk. No, far from being easy SR/ Ma can take quite a toll on the epidemiologist and the statistician. But they are interesting stuff to work on and there is step by step support available from the Cochrane collaboration. One can check out their website for the support material most of which is freely accessible. As a beginning one can download and read the Cochrane handbook on SR and MA. It requires lots of consistency and hard work as a team. It is a symphony and no ONE can whistle it!
PS: If you felt this post was sketchy that is what peek would give you. If you would be interested in delving into depths let us know.

Saturday, 2 March 2013

STUDY DESIGNS IN EPIDEMIOLOGY – 3

In the previous write up ‘Study designs inepidemiology – 2’, we discussed various study deigns, their advantages and disadvantages. If you would recall, cohort studies are very logistics intensive (Remember here, that case control studies are less logistics intensive, but possibility of bias is high). Generally, large sample size is required, hence more money and manpower is required. Before you go through this write up, it is advised that you read up the previous study designs part 1 and 2.
There is a study design which has the advantages of both case control study and cohort study (disadvantages of none). It is called as ‘case control study within a cohort’. It is more applicable in settings where the risk factor (exposure) is a biochemical parameter measured, say in blood. Here a cohort is followed up. If you would remember from ‘Study designs in Epidemiology – 1’, cohort is a group of people with a common characteristic. Before the follow up is started, the sample, say blood, in which the biochemical parameter (risk factor / exposure) is to be measured, is taken and stored at – 72 C. The blood sample is only stored, no measurement of the biochemical parameter is done at present. With time, some of those in the cohort develop the disease (cases); many do not (some of them taken as controls). Now, the corresponding blood samples of cases and controls are taken and exposure level is measured (we assume no change in the levels after storing the samples at – 72 degree Celsius). Depending on how controls are selected, this type of study design can be of two types. In nested case control study, appropriately matched control (one to one matching: shall be discussed in future write ups) is taken at the same time as the case develops. In case cohort study, after a particular time period (end of study period/follow up), controls are taken based on the number of cases. Here the same control can be used against various groups of cases. In other words, same group may act as control for different set of cases. Case control studies within a cohort have the following advantages:
-          Temporality: Here exposure has occurred before onset of diseases (refrigerated blood sample in which measurement is done at a later stage)
-          No recall bias in exposure ascertainment: problem in case control studies
-          Less resource intensive as exposure ascertainment is done only in cases and control. In cohort study, all in the study population are divided into exposed and non exposed.
There are variations in the randomized controlled trials as well. One such design is called as factorial design. Here more than one study can be done in the same population. Say study population is divided into four equal quarters. First quarter receives drug A and drug B; Second drug A only; Third drug B only; and Fourth quarter receives neither of them. Two comparison groups are now available for analysis. First & Second quarter v/s Third & fourth quarter (drug A v/s no drug A) and First and Third quarter v/s Second and Fourth quarter (drug B v/s no drug B). In factorial design the following assumptions are made:
-          Drug A and B have different mechanisms of action; or in other words, their mechanisms of action are independent of each other
-          Drug A and B have different pharmacokinetics and pharmacodynamics; with no interaction
-          Outcome that is being measured is different for both the drugs
Another variant of an RCT is a planned cross over study design. Here two groups receive different treatment and the outcome is measured. Then, after a wash over period, the treatments are interchanged between the groups and outcome is again measured. In this manner each group acts as its own control. Here too, certain assumptions are to be met:
-          There should be a wash over period after which the treatment must not have any residual effect
-          The outcome of the treatment must not be permanent (must be transient).
Sometimes, in a parallel RCT, unplanned cross over occurs for various reasons. If this happens, analysis should be based on Intention to Treat (ITT). In simple words analysis should be done assuming the randomization is not broken (original assigned group). This should be done with a mention of what percentage of study subjects underwent unplanned crossover.
If you would remember, in experimental studies, the investigator assigns the exposure. With this background let us introduce some terms which are commonly used: quasi experiments and natural experiments. In quasi experiments, exposure assignment is not under investigators control (in strict sense, these are observational). Many a time, experimental studies with non random allocation are also called as quasi experiments. Natural experiments term is used in the setting of a natural event. These are also called as experiments of opportunity. Say, studying the effect of a nuclear accident by comparing disease rate in accident site and in non accident site.
All the studies that we have mentioned till now, involve ascertainment of exposure or outcome at individual level. When the same is done at population level, these studies are called as ecological studies. Say comparison of incidence of breast cancer and average fat comparison in a particular country with other countries.  If there is increase in cancer incidence with increase in average fat consumption among countries, then one may be tempted to say that both may be associated. This interpretation must be made with caution, as there could be a fallacy here. It is possible that high fat consumption is among those who are not developing breast cancer. This fallacy is called as ecological fallacy (fallacy in individual level interpretation when data is collected at population level).
In the next write up we’ll discuss about systemic review and meta-analysis. These study designs are at the top of hierarchy when it comes to strength of evidence. Stay tuned!!!

Sunday, 24 February 2013

STUDY DESIGNS IN EPIDEMIOLOGY – 2



To summarize what we discussed in the previous write up, please refer to the figure given below
[Grimes DA, Schulz KF (2002). An overview of clinical research: the lay of the land. The Lancet, 359:57-61].


Each study design has its own advantage and disadvantage. In cross sectional study, outcome and the exposure are measured at the same time. Hence, one is not able to ascertain temporality. Take the example of a cross sectional study planned with the following research question, “Does obesity (exposure) increase the risk of osteoarthritis (outcome)?” Here, information pertaining to obesity and osteoarthritis is collected at the same time. Hence, the opposite of research hypothesis (osteoarthritis causing obesity), is also a possibility which one can’t rule out.
Case control studies solve this problem. How do they do it? First we select cases and chose appropriate controls (there is a selection criteria for controls). Now we go back in time and collect information pertaining to exposure. Here, we will a select a set of patients with osteoarthritis (cases) and appropriate controls without osteoarthritis. We check their medical records, say 5 years back (when they didn’t have osteoarthritis), and determine their BMI; and make our comparison. Temporality isn’t an issue here; but other problems associated with case control studies are there (we would like to introduce the term bias here; would be discussed in future). Most common being a type of information bias (recall bias). Other information bias and selection bias are also there which one has to understand. Hence, case control studies are more efficient but bias is a common problem (confounding as well).
The next analytic study is cohort study where one starts with exposure and follows them up for outcome (cases). Obviously, there will be a comparison group of unexposed as it’s an analytical study (refer to study designs – 1). Here, there is no bias in ascertaining exposure. Calculation of incidence is possible among exposed and unexposed. There is a possibility of outcome ascertainment bias, though. Despite the advantages of a cohort studies, there are very logistics intensive (Remember here, that case control studies are less logistics intensive, but possibility of bias is high). Generally, large sample size is required, hence more money and manpower is required.
In Randomized clinical / community trials, exposure assignment (not ascertainment) is done by the investigator and who gets which intervention is not under the control of the investigator. This allocation is random and very often concealed. This process ensures that the groups are comparable but for the presence of intervention / exposure in study group. Hence, difference in outcome in both the groups, if any, could be explained only by the presence/absence of exposure in study group. Hence, Randomized clinical / community trials have the advantage of minimal confounding (they can’t be ruled out completely). Confounding will be discussed in future. Many a time blinding (information pertaining to study or control group is not provided or is hidden) of the study participant; and/or outcome assessor; and/or statistician who analyzes the study is done. This decreases the chance of bias in the study. Temporality obviously is not an issue as study (intervention) and control (no intervention) group are followed up for occurrence of outcome. In RCTs, selection criteria are strictly enforced. Because of these selection criteria, our study population to which we should generalize our results gets restricted. Hence, external validity is an issue here.
In non randomized experimental trials, exposure assignment is done by the investigator, but the allocation is not random.
Systematic review and meta-analyses are a step ahead of RCTs in the hierarchy of study designs. They will be discussed in the next write up. Stay tuned!!


PS:-
Bias is a systematic error (non-random error) during any phase of the study. It results in spurious association
Confounder is an exposure [C]  (other than the exposure of interest in your study), which is associated both with the exposure [E] and outcome [O]of interest in your study; is an independent risk factor for the O in your study; and not present in the causal pathway of E and O.  It results in indirect association.  Both, shall be discussed in detail in future write ups.

Sunday, 17 February 2013

STUDY DESIGNS IN EPIDEMIOLOGY - I

In Clinical research, we often tend to ponder over the study design. Many a time we are clear about our methodology, yet we are not able to pin point the specific study design employed. Let us help you here…
Broadly epidemiological studies are divided into Observational and Experimental studies. Now what’s the criterion for this bifurcation? It depends on whether the investigator assigns the exposure or not. If he / she doesn’t assign the exposure, all that he is doing is observing things / events that are happening around him. These studies are called observational studies. If the investigator assigns the exposure (may be random or non random), then the study becomes an experimental study design. Say giving aspirin to one group and placebo to other; following them over a period of time to measure the outcome.
Let us spend some time on experimental studies. As mentioned before, the assignment of exposure could be random or non random. If it is random, then they are called randomized experimental trials; and if non random, then they are called non-randomized experimental trials. I am sure you would have heard of the following abbreviation: RCT. RCT means randomized clinical trials. If done in community setting, the same may be called as randomized community trials. Some prefer to call them, randomized controlled trials where control refers to the presence of a comparison group. CONSORT 2010 guidelines may be followed to report RCTs.
Coming back to observational studies, here, the investigator has no active role. He’s a passive observer. He / She does not intervene. Observational studies are further classified into analytical and descriptive studies. This division is based on whether there is a comparison group in the study. If yes, then it’s an analytical study; if no, then it’s a descriptive study. In descriptive study, one only describes the disease (can describe exposure also). For example prevalence of diabetes in Puducherry is 11%. Among the 100 malnourished children studied, 60 were from low socioeconomic status and 30 from slums. Now this description doesn’t tell anything about any exposure – outcome relationship. You might be tempted to say that majority of malnourished children are from low socioeconomic status, hence the exposure is a risk factor. Wait, this conclusion may be premature. What if the same prevalence of low socioeconomic status is found among non malnourished children (60% in control / comparison group)? Got it?
Analytical studies will give this answer that you are looking for (they have a comparison group!!!). Analytical studies are further divided into cross-sectional, case control and cohort studies depending upon when information on exposure and outcome are collected. If information for both exposure and outcome are collected / determined at one point of time it is called as a cross sectional study. If you start with cases of interest and appropriate controls; go back in time and look for exposure, the study is a case control study. If you start with exposed and non exposed, and move forward in time to look for outcome; then the study is called a cohort study.      

Now, it’s the right time to clarify certain terminologies. First let’s discuss about descriptive v/s analytical studies. The difference has been discussed above. What about cross-sectional v/s longitudinal studies? In cross sectional studies, measurement is done in a single subject only once (also pl refer to previous explanation for cross sectional analytical studies). In longitudinal studies, the same measurement or variable is collected more than one time in a single subject. Now various permutation combinations are possible.
·         Cross-sectional descriptive study: Study on prevalence of Diabetes in a community
·         Cross sectional analytical study: Study on risk factors in diabetes and non diabetes (diabetes and risk factors information / ascertainment collected/done at the same time)
·         Longitudinal descriptive study: Following up a cohort of children born in a hospital in 2012 for their weight and height every 2 months till they attain the age of 1 year.
·         Longitudinal analytical study: Add a comparison group to the above example. May compare the growth pattern of term babies with preterm babies. Here term / preterm is the exposure and the outcome is the growth pattern over the next five years.
(PS:- Cohort is a group of people sharing a common characteristic who generally followed over a period of time. Cohort study is a longitudinal analytical study)

In the next write up we’ll discuss the advantages and disadvantages of each of the above mentioned study designs. Also, we shall mention / discuss each study design in detail. There are other study designs as well which we’ll go through in the future write ups. Stay tuned!!!
 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Online Project management