Errors in Statistical Data

ERRORS IN STATISTICAL DATA

Introduction
The accuracy of a survey estimate refers to the closeness of the estimate to the true population value. Where there is a discrepancy between the value of the survey estimate and true population value, the difference between the two is referred to as the error of the survey estimate. The total error of the survey estimate results from the two types of error:

sampling error, which arises when only a part of the population is used to represent the whole population; and
non-sampling error which can occur at any stage of a sample survey and can also occur with censuses. Sampling error can be measured mathematically whereas measuring non-sampling error can be difficult.

It is important for a researcher to be aware of these errors, in particular non-sampling error, so that they can be either minimised or eliminated from the survey. An introduction to measuring sampling error and the effects of non-sampling error is provided in the following sections.

Sampling Error
Sampling error reflects the difference between an estimate derived from a sample survey and the "true value" that would be obtained if the whole survey population were enumerated. It can be measured from the population values, but as these are unknown (otherwise there would be no need for a survey), it can also be estimated from the sample data. It is important to consider sampling error when publishing survey results as it gives an indication of the accuracy of the estimate and therefore reflects the importance that can be placed on interpretations. If sampling principles are applied carefully within the constraints of available resources, sampling error can be accurately measured and kept to a minimum.

Factors Affecting Sampling Error
Sampling error is affected by a number of factors including sample size, sample design, the sampling fraction and the variability within the population. In general, larger sample sizes decrease the sampling error, however this decrease is not directly proportional. As a rough rule of thumb, you need to increase the sample size fourfold to halve the sampling error. Of much lesser influence is the sampling fraction (the fraction of the population size in the sample), but as the sample size increases as a fraction of the population, the sampling error should decrease.

The population variability also affects the sampling error. More variable populations give rise to larger errors as the samples or the estimates calculated from different samples are more likely to have greater variation. The effect of the variability within the population can be reduced by increasing the sample size to make it more representative of the survey population. Various sample design options also affect the size of the sampling error. For example, stratification reduces sampling error whereas cluster sampling tends to increase it (these designs are discussed in Sample Design ).

Standard Error
The most commonly used measure of sampling error is called the standard error (SE). The standard error is a measure of the spread of estimates around the "true value". In practice, only one estimate is available, so the standard error can not be calculated directly. However, if the population variance is known the standard error can be derived mathematically. Even if the population variance is unknown, as happens in practice, the standard error can be estimated by using the variance of the sample units. Any estimate derived from a probability based sample survey has a standard error associated with it (called the standard error of the estimate, written se(y) where y is the estimate of the variable of interest). Note that :

The standard error is an indication of how close the sample survey estimate is to the result that would have been obtained from a census under the same operating conditions (an equal complete coverage).
The standard error only gives a measure of the variation in values obtained from repeated samples. It does not measure the precision of the particular sample from which it is estimated.
A small standard error indicates that the variation in values from repeated samples is small and therefore the chance of a 'bad' sample is small - hence there is more likelihood that the sample estimate will be close to the result of an equal complete coverage.
Standard errors can be used to work out upper and lower limits ('confidence interval'), which will include the result from an equal complete coverage with a certain probability.
Estimates of the standard error can be obtained from any one of the possible random samples.
The standard error calculated from a sample is itself an estimate (and is also subject to sampling error)
When publishing the results of any survey, statements about the standard error of the estimates should be made.
When comparing survey estimates, the standard errors must be taken into account.
The term 'sampling variance' refers to the square of the standard error.

For more information on how to calculate estimates and their standard errors please refer to Analysis .

Variance
The variance is another measure of sampling error, which is simply the square of the standard error: Var(y) = se(y)²

Relative Standard Error
Another way of measuring sampling error is the relative standard error (RSE) where the standard error is expressed as a percentage of the estimate. The RSE avoids the need to refer to the estimate and is useful when comparing variability of population estimates with different means. RSE is an important measure when expressing the magnitude of standard error relative to the estimate. The relative standard error is calculated as follows (where y is the estimate of the variable of interest):

RSE(y) = 100 * {se(y) / y}

Confidence Interval
Assuming that the target population is distributed normally for the characteristic being measured, (or, if estimating the mean, the sample is sufficient to assume the sample mean is distributed normally) the interval which contains the true value is usually calculated as being one, two, or three standard errors above and below the survey estimate. This interval is usually referred to as a confidence interval.

Normal Curve
Image: Normal Curve

There is a 95% chance that the confidence interval which extends to two standard errors on either side of the estimate contains the "true value". This interval is called the 95% confidence interval and is the most commonly used confidence interval. The 95% confidence interval is written as follows:

95% CI(y) = [y - {2*se(y)} , y + {2*se(y)}]

This is expressed: "We are 95% confident that the true value of the variable of interest lies within the interval [y - {2*se(y)} , y + {2*se(y)}]".

Other confidence intervals are the 68% confidence interval (where the confidence interval extends to one standard error on either side of the estimate has a 68% chance of containing the "true value") and the 99% confidence interval (where the confidence interval extends to three standard errors on either side of the survey estimate has a 99% chance of containing the "true value").

For example, suppose a survey estimate is 50 with a standard error of 10. The confidence interval 40 to 60 has a 68% chance of containing the "true value", the interval 30 to 70 has a 95% chance of containing the "true value" and the interval 20 to 80 has a 99% chance of containing the "true value".

NON-SAMPLING ERROR
Non-sampling error is all other errors in the estimate. Some examples of causes of non-sampling error are non-response, a badly designed questionnaire, respondent bias and processing errors.

Non-sampling errors can occur at any stage of the process. They can happen in censuses and sample surveys. Non-sampling errors can be grouped into two main types: systematic and variable.

Systematic error (called bias) makes survey results unrepresentative of the target population by distorting the survey estimates in one direction. For example, if the target population is the population of Australia but the survey population is just males then the survey results will not be representative of the target population due to systematic bias in the survey frame.

Variable error can distort the results on any given occasion but tends to balance out on average. Some of the types of non-sampling error are outlined below:

Failure to Identify Target Population / Inadequate Survey Population
The target population may not be clearly defined through the use of imprecise definitions or concepts. The survey population may not reflect the target population due to an inadequate sampling frame and poor coverage rules. Problems with the frame include missing units, deaths, out-of-scope units and duplicates. These are discussed in detail in Frames and Population .

Non-Response Bias
Non-respondents may differ from respondents in relation to the attributes/variables being measured. Non-response can be total (none of the questions answered) or partial (some questions may be unanswered owing to memory problems, inability to answer, etc.). To improve response rates, care should be taken in designing the questionnaires, training of interviewers, assuring the respondent of confidentiality, motivating him/her to co-operate, and calling back at different times if having difficulties contacting the respondent. "Call-backs" are successful in reducing non-response but can be expensive for personal interviews. Non-response is covered in more detail in Non-Response.

Questionnaire problems
The content and wording of the questionnaire may be misleading and the layout of the questionnaire may make it difficult to accurately record responses. Questions should not be loaded, double-barrelled, misleading or ambiguous, and should be directly relevant to the objectives of the survey.

It is essential that questionnaires are tested on a sample of respondents before they are finalised to identify questionnaire flow and question wording problems, and allow sufficient time for improvements to be made to the questionnaire. The questionnaire should then be re-tested to ensure changes made do not introduce other problems. This is discussed in more detail in Questionnaire Design .

Respondent Bias
Refusals to answer questions, memory biases and inaccurate information because respondents believe they are protecting their personal interest and integrity may lead to a bias in the estimates. The way the respondent interprets the questionnaire and the wording of the answer the respondent gives can also cause inaccuracies. When designing the survey you should remember that uppermost in the respondent's mind will be protecting their own personal privacy, integrity and interests. Careful questionnaire design and effective questionnaire testing can overcome these problems to some extent.
Respondent bias is covered in more detail below.

Processing Errors
There are four stages in the processing of the data where errors may occur: data grooming, data capture, editing and estimation. Data grooming involves preliminary checking before entering the data onto the processing system in the capture stage. Inadequate checking and quality management at this stage can introduce data loss (where data is not entered into the system) and data duplication (where the same data is entered into the system more than once). Inappropriate edit checks and inaccurate weights in the estimation procedure can also introduce errors to the data. To minimise these errors, processing staff should be given adequate training and realistic workloads.

Misinterpretation of Results
This can occur if the researcher is not aware of certain factors that influence the characteristics under investigation. A researcher or any other user not involved in the collection stage of the data gathering may be unaware of trends built into the data due to the nature of the collection, such as it's scope. (eg. a survey which collected income as a data item with the survey coverage and scope of all adult persons (ie. 18 years or older), would expect to produce a different estimate than that produced by the ABS Survey of Average Weekly Earnings (AWE) simply because AWE includes persons of age 16 and 17 years as part of it's scope). Researchers should carefully investigate the methodology used in any given survey.

Time Period Bias
This occurs when a survey is conducted during an unrepresentative time period. For example, if a survey aims to collect details on ice-cream sales, but only collects a weeks worth of data during the hottest part of summer, it is unlikely to represent the average weekly sales of ice-cream for the year.

Minimising Non-Sampling Error
Non-sampling error can be difficult to measure accurately, but it can be minimised by

careful selection of the time the survey is conducted,
using an up-to-date and accurate sampling frame,
planning for follow up of non-respondents,
careful questionnaire design,
providing thorough training for interviewers and processing staff and
being aware of all the factors affecting the topic under consideration.

RESPONDENT BIAS
No matter how good the questionnaire or the interviewers are, errors can be introduced into a survey either consciously or unconsciously by the respondents. The main sources of error relating to respondents are outlined below.

Sensitivity
If respondents are faced with a question that they find embarrassing, they may refuse to answer, or choose a response which prevents them from having to continue with the questions. For example, if asked the question: "Are you taking any oral contraceptive pills for any reason?", and knowing that if they say "Yes" they will be asked for more details, respondents who are embarrassed by the question are likely to answer "No", even if this is incorrect.

Fatigue
Fatigue can be a problem in surveys which require a high level of commitment from respondents. For example, diary surveys where respondents have to record all expenses made in a two week period. In these type of surveys, the level of accuracy and detail supplied may decrease as respondents become tired of recording all expenditures.

NON-RESPONSE
Non-Response results when data is not collected from respondents. The proportion of these non-respondents in the sample is called the non-response rate. Non-response can be either partial or total. It is important to make all reasonable efforts to maximise the response rate as non-respondents may have differing characteristics to respondents. This causes bias in the results.

Partial Non-Response
When a respondent replies to the survey answering some but not all questions then it is called partial non-response. Partial non-response can arise due to memory problems, inadequate information or an inability to answer a particular question. The respondent may also refuse to answer questions if they

find questions particularly sensitive, or
have been asked too many questions (the questionnaire is too long).

Total Non-Response
Total non-response can arise if a respondent cannot be contacted (the frame contains inaccurate or out-of-date contact information or the respondent is not at home), is unable to respond (may be due to language difficulties or illness) or refuses to answer any questions.

When conducting surveys it is important to collect information on why a respondent has not responded. For example when evaluating a program a respondent may indicate they were not happy with the program and therefore do not wish to be part of the survey. Another respondent may indicate that they simply don't have the time to complete the interview or survey form. If a large number of those not responding indicate dissatisfaction with the program, and this is not indicated in the final report, an obvious bias would be introduced in the results.

Minimising Non-Response
Response rates can be improved through good survey design via short, simple questions, good forms design techniques and explaining survey purposes and uses. Assurances of confidentiality are very important as many respondents are unwilling to respond due to a fear of lack of privacy. Targeted follow-ups on non-contacts or those initially unable to reply can increase response rates significantly.

Following are some hints on how to minimise refusals in a personal or phone contact:

Find out the reasons for refusal and try to talk through them

Use positive language
State how and what you plan to do to help with the questionnaire
Stress the importance of the survey
Explain the importance of their response as a representative of other units
Emphasise the benefits from the survey results, explain how they can obtain results
Give assurance of the confidentiality of the responses

Other measures that can improve respondent cooperation and maximise response include:

Public awareness activities including discussions with key organisations and interest groups, news releases, media interview and articles. This is aimed at informing community about the survey, identifying issues of concern and addressing them.
Advice to selected units by letter, giving them advance notice and explaining the purposes of the survey and how the survey is going to be conducted.

In case of a mail survey most of the points above can be stated in an introductory letter or through a publicity campaign.

Allowing for Non-Response
Where response rates are still low after all reasonable attempts of follow-up are undertaken, you can reduce bias by using population benchmarks to post-stratify the sample (covered in Sample Design ), intensive follow-up of a subsample of the non-respondents or imputation for item non-response (non-response to a particular question).

The main aim of imputation is to produce consistent data without going back to the respondent for the correct values thus reducing both respondent burden and costs associated with the survey. Broadly speaking the imputation methods fall into three groups:

the imputed value is derived from other information supplied by the unit;
values by other units can be used to derive a value for the non-respondent (eg average);
an exact value of another unit (called donor) is used as a value for the non-respondent (called recipient);

When deciding on the method of imputation it is desirable to know what effect will imputation have on the final estimates. If a large amount of imputation is performed the results can be misleading, particularly if the imputation used distorts the distribution of data.

If at the planning stage it is believed that there is likely to be a high non-response rate, then the sample size could be increased to allow for this. However, the non-response bias will not be overcome by just increasing the sample size, particularly if the non-responding units have different characteristics to the responding units. Post-stratification and imputation also fail to totally eliminate non-response bias from the results.

Example: Effect of Non-Response
Suppose a postal survey of 3421 fruit growers was run to estimate the average number of fruit trees on a farm. There was an initial period for response and following low response rates, two series of follow up reminders were sent out. The response and results were as follows:

	Response	Ave. no. of Trees
Initial Response	300	456
Added after 1 follow up reminder	543	382
Added after 2 follow up reminders	434	340
Total Response	1277

After two follow up reminders there was still only a 37% response rate. From other information it was known that the overall average was 329. The result based on this survey would have been:

	Cumulative Response	Combined Average
Initial Response	300	456
Added after 1 follow up reminder	843	408
Added after 2 follow up reminders	1277	385

If results had been published without any follow-up then the average number of trees would have been too high as farms with greater number of trees appeared to have responded more readily. With follow-up, more smaller farms sent back survey forms and the estimate became closer to the true value.

Basic Survey Design