DATA QUALITY
VALIDATION
The NCI method has been used as programmed by the NCI for the Australian Health Survey: Usual Nutrient Intakes, 2011-12. In using this established model, the ABS, in partnership with FSANZ:
- Checked that the input (2011-12 NNPAS) data was suitable for use in the method and conformed satisfactorily to assumptions of the model. This included checking that the method was robust to the presence of outliers in the data set, and determining that multi-modality was not apparent in the data selected for use.
- Selected a model implementation suitable for the input data and purpose. For more information on the implementation selected see Model Implementation: data used and model specification.
- Checked that the output was plausible and within expected ranges. This included comparing the output distributions against the input distributions, and comparing the results of the model to comparable results of other models, where possible.
To streamline processing of a large number of nutrients through the model, produce graphical summaries for error checking, and to produce variances (estimate sampling errors) via group jackknife, the ABS also programmed some SAS macros which complemented the NCI method macros. These complementary macros did not alter the functions of the NCI method.
SOURCES OF ERROR
There are four sources of error associated with the usual nutrient intake distribution statistics, each detailed below:
1)
Sampling error – the error associated from using a sample, rather than conducting a census (complete enumeration) of the population. It refers to the difference between an estimate for a population based on sample data and the ‘true’ value for that population which would result if a census were taken. Sampling error can be measured and controlled in random samples where each unit has a chance of selection, and that chance can be calculated.
2)
Non sampling error – the error associated with factors other than those related to sample selection, and which cause the data values to not accurately reflect the ‘true’ value for the population. Inaccuracies of this kind can occur in any enumeration, whether it is a census or a sample. Every effort is made to reduce non-sampling error to a minimum by careful design of questionnaires, intensive training and supervision of interviewers, and efficient procedures. Non-sampling error can include (but is not limited to):
- Coverage error: this occurs when a unit in the sample is incorrectly excluded or included, or is duplicated in the sample (e.g. a field interviewer fails to interview a selected household or some people in a household).
- Non-response error: this refers to the failure to obtain a response from some unit because of absence, non-contact, refusal, or some other reason. Non-response can be complete non-response (i.e. no data has been obtained at all from a selected unit) or partial non-response (i.e. the answers to some questions have not been provided by a selected unit).
- Response error: this refers to a type of error caused by respondents intentionally or accidentally providing inaccurate responses. This occurs when concepts, questions or instructions are not clearly understood by the respondent; when there are high levels of respondent or memory recall burden; and because some questions can result in a tendency to answer in a socially desirable way (giving a response which they feel is more acceptable rather than being an accurate response).
Analysis of the 2011-12 NNPAS suggests that, like other nutrition surveys, there has been some under-reporting of food intake by participants in this survey. Given the association of under-reporting with overweight/obesity and consciousness of socially acceptable/desirable dietary patterns, under-reporting is unlikely to affect all foods and nutrients equally. No respondents were excluded from the sample on the basis of low total reported energy intakes (low energy reporters were included in the input data set for usual nutrient intakes). For more information on under-reporting see Under-reporting in Nutrition Surveys.
- Interviewer error: Lack of uniformity in interviewing standards may also result in non-sampling errors. Training and retraining programs, and checking of interviewers’ work were methods employed to achieve and maintain uniform interviewing practices and a high level of accuracy in recording answers on the survey questionnaire (see the Interviews section of Data Collection). The operation of the Computer Assisted Instrument (CAI) itself, and the built in checks within it, ensure that data recording standards are maintained. Respondent perception of the personal characteristics of the interviewer can also be a source of error, as the age, sex, appearance or manner of the interviewer may influence the answers obtained.
- Processing error: this refers to errors that occur in the process of data collection, data entry, coding, editing and output.
3)
Prediction error – the variability attributed to the statistical accuracy of the model predictions, given the NCI framework does not directly survey each person’s usual distribution intake but models it instead. This can also include any model bias introduced by a misspecification of the NCI model. This may result when the choice of the model form is incorrect; a key explanatory variable is left out; or an inappropriate explanatory variable is included. Model bias cannot be explicitly captured, however every effort was made to ensure an appropriate model specification was used through external literature research and statistical testing.
While model bias cannot be explicitly captured, the following should be noted:
- True zero intakes: Based on the data collected, it is not possible to distinguish between individuals who never consume a nutrient or food, and those who sometimes consume that nutrient or food but did not do so on the days they were surveyed (e.g. between people who did not consume alcohol on the day they were surveyed, versus those who avoid all alcohol). In the NCI method, all simulated intakes will have a non-zero value, even if exceedingly small, because the logistic regression used to model the probability of consumption does not predict a zero value.1 For nutrients where there may be true zero usual intakes in the population, this will result in model bias in the extreme lower tail of the distribution.
- Lambda selection: The Box Cox transformation is a function that transforms data to a near-normal distribution using a variable called a lambda. This variable affects the strength of the transformation, so that the transformation can be adjusted by the NCI method to suit the characteristics of the input data set. Note that a minimum lambda bound of 0.01 has been set when using this function in the NCI method.2 This minimum lambda value was selected by the method for the pro-vitamin A (beta-carotene equivalents) and caffeine intakes of children under nine years. Selection of the minimum permitted lambda suggests that modelling bias is likely to be greater for these results, noting that the usual nutrient intake distributions generated remained within plausible ranges. This is because the optimum lambda for the Box-Cox transformation may be outside the permitted lambda bounds.
- Use of uncorrelated versus correlated model type: Use of the uncorrelated model for the alcohol intakes of females 19 years and over has likely resulted in some model bias, particularly towards lower estimated intakes of alcohol at the 90th and 95th percentiles of intake (in comparison with the correlated model). Although there was evidence of correlation between probability of consumption of alcohol and amount consumed, the uncorrelated model was used because the correlated form could not run (failed to converge) for certain replicate weight groups. Estimates, but not sampling errors, could be produced for the correlated model for this group. Comparison of the main weight estimates from both model types shows that upper end of the distribution was more affected by the model selection. For the 90th and 95th percentiles, the uncorrelated model produced estimates in the order of one gram to nine grams lower than the correlated model (approximately 5% to 15% lower). However, for the 5th to 75th percentiles the estimates from both models were more similar (the uncorrelated model was less than two grams higher than the correlated for each percentile).
Where comparisons with guideline values (Nutrient Reference Values or NRVs) have been made, any results outside of these guideline values need to be considered along with how the guideline values were established in order to appropriately interpret the quality of the resulting estimates. The NRVs are a set of recommendations made by the Australian National Health and Medical Research Council and the New Zealand Ministry of Health for nutritional intake, based on currently available scientific knowledge. More information on the methods used to derive the NRVs for each nutrient is available here (https://www.nrv.gov.au/home/introduction).
4)
Simulation Variance – the variability due to simulating different random effects, in order to generate usual intake distributions.
ESTIMATION OF SAMPLING ERROR USING GROUP JACKKNIFE
The relative standard errors and margins of error, being measures of accuracy, published alongside the usual nutrient intake statistics are calculated using the group jackknife variance estimation method.
The group jackknife is a replicate method for estimating the sampling variance associated with the usual intake distribution statistics. It partitions the sample into random replicate subgroups of equal size, forms subsamples by leaving out each of these subgroups in turn, then calculates intake distribution statistics for each of these replicate subsamples (forming a set of replicate estimates). The variability associated with these replicate estimates is then used to calculate the relative standard errors associated with the usual intake distribution statistics. The number of replicates used for the statistics published for the 2011-12 NNPAS is 60, which is typical for social surveys.
The group jackknife variance estimation method captures sampling error, with other sources of variance discussed above not being captured. Therefore the sampling errors presented in the
data cubes are likely to underestimate the total variance associated with these statistics. However every effort has been made to ensure the modelling and simulation aspects of the NCI method are sound.
When using replicate weights, there are several approaches for estimating the variance of percentiles, none of which have been universally adopted.
3 Variance estimation for order statistics (such as the median) can be difficult, and most estimators have problems associated with them, whether it is from small sample sizes, the large number of iterations required, or stability issues due to limited degrees of freedom. While jackknife variance estimation is known to underestimate sampling variance for order statistics when each replicate subgroup only contains a single record, as in the drop-one jackknife,
4 the group jackknife should produce asymptotically unbiased variance estimates provided that the number of sample records within each replicate subgroup is sufficiently large.
The group jackknife variance estimation method was chosen over other variance estimation methods due to the following advantages:
- It takes into account the clustering aspect that is present in area based samples.
- It is able to capture the variance improvements from the skip selections design of the survey vehicle, which other variance estimators fail to do adequately.
- Its use is well established and quality assured in the ABS’s corporate systems.
Caution should also be paid to estimates on the tail of the usual intake distributions, as they are likely to suffer more error and volatility.
ENDNOTES
1 Tooze, JA et al. 2006, ‘A new statistical method for estimating the usual intake of episodically consumed foods with application to their distribution’,
Journal of the American Dietetic Association, vol. 106, pp. 1575-1587.
2 In the NCI method, the lambda is selected such that the within-person errors are normally distributed around mean zero on the transformed scale. For more information see endnote
1. The Box-Cox function is
, where r is the input nutrient intake. A minimum lambda bound of 0.01 has been set when using the Box-Cox function. Therefore, although in a Box-Cox transform the limiting case for
=0 is formally defined as the natural logarithm, it has not been employed in the usual nutrient intakes publication.
3 Wolter, Kirk, 1985, "
Introduction to variance estimation", Statistics for Social and Behavioural Sciences, Volume XIV, 2
nd Edition
4 Shao, J. & Tu, D., 1995, "
The Jackknife and Bootstrap", Springer-Verlag, New York