Automated industry coding application for new business registrations in Australia
The Australian Bureau of Statistics (ABS) uses the Australian Business Register (ABR) as its primary source of information to identify new businesses. The information from new businesses flows through to the ABS Business Register which is used as a register or frame for business surveys run by the ABS.
During the process of registering with the Australian Tax Office (ATO), businesses are required to identify their main industry from a pick list and also to provide their main business activity as a free text written response. The pick list of industries is based on the industry divisions as defined by the Australian and New Zealand Industry Classification (ANZSIC).
The automated coding process developed by the ABS in consultation with the ATO was developed to help improve the quality of the ANZSIC coding, reduce manual coding and the time required for processing information received from clients. The Autocoder was implemented for ANZSIC coding of employing businesses in 2004 and for non-employing businesses in 2006.
The Autocoder uses an ‘exact word’ matching algorithm through an index file with codes which are regularly updated. This process ensures better control of the links between codes and descriptions. However, continued investigation is occurring in the ABS into using information retrieval technology for searching to improve match rates, as significant savings can be made at an organisational level.
The Autocoder uses both the main activity and main industry information from the Australian Business Number (ABN) registration form. The Autocoder matches the activity description (or part of it) against an index file of activity descriptions and assigns a four digit ANZSIC code.
The index file contains word string entries of activity descriptions and is based on the primary activity descriptions in the ANZSIC publication. To increase coverage rates, entries are added based on knowledge gained about how businesses describe their activities.
The first stage of the auto coding process involves running the activity description across parsing rules. Parsing rules enable the activity description to be changed into a more useable description to be used by the Autocoder, thus improving the probability of the entry being matched to an ANZSIC code.
During the matching process, multiple matches can occur when the Autocoder matches two or more basic words in an activity description with entries in the index file with different ANZSIC codes. Metadata are used to avoid common multi-matches by identifying certain basic words (or combinations of basic words) and prioritising them to enable common combinations of activity descriptions to be coded.
Checks on the quality and accuracy of the index and associated coding files are regularly performed. In 2007, the Autocoder coded approximately 50 per cent of new business registrations to a quality level of approximately 97 per cent.
The ABS seeks to maintain a high quality ANZSIC coding by the Autocoder by undertaking 12 monthly data updates. Regular updates to the Autocoder files are necessary to reflect current ANZSIC coding methodology, increase coverage rates, allow for efficient coding and ensure that the quality of the coding does not deteriorate.
For more information, please contact Grant M Arnold on (02) 6252 6262.