Producing Official Statistics from Linked Data

Experimental estimates and inference based on historical Business Characteristics Surveys

Released
20/08/2024

Introduction

The Australian Bureau of Statistics (ABS) collects data across Australia’s public and business sectors. This article attempts to derive further insights from this data which can inform decisions across the economy, offering additional benefits to the Australian public.

Investigations into the Business Characteristics Survey (BCS) were initiated to seek a method for relating individual data items to observable business outcomes. Outcomes relating to profitability changes over time were considered and represented by a profitability score. The intention was to identify characteristics common to lower-score businesses in one sample, and to estimate the performance of the statistical model by testing it against another sample.

To do this, a random forest classifier was employed; a machine learning (ML) method to classify businesses with lower and higher profitability. The random forest classifier uses decision trees to iteratively distinguish the two profitability groups from each other based on other characteristics of those businesses. Each decision of a component decision tree identifies a feature which best splits the two classes. The random forest classifier reflects the majority decisions across all decision trees to estimate the splitting approach which best distinguishes businesses belonging to the two profitability classes.

This classifier was tasked with identifying BCS features that are associated with businesses regarded as either in a “higher profitability score” group or a distinct “lower profitability score” group. The trained classifier, once developed to make such distinctions reliably, was scrutinised to understand its inner workings and uncover which aspects of the data contribute to these decisions. 

This article provides a concise overview of the findings documented in the Technical Report linked here: Producing Official Statistics from Linked Data - Technical Report."

Methodology overview

Data sourced from five consecutive BCSs from 2015-16 to 2019-20 were considered for this investigation, creating a wide panel containing all variables of interest.

The classes (lower profitability and higher profitability) were derived from a profitability score which itself was sourced from profitability comparison responses from 2017-18 to 2019-20. This reduced range was required as otherwise the sample size would have been too small. For each of these cycles, each response option (decreased/stayed the same/increased) was given a value (1, 2, 3, respectively). These were added together to form the profitability score ranging from 3 (consecutive decreases) to 9 (consecutive increases). The profitability classes then needed to be projected onto these values in such a way that the classes were not too granular nor imbalanced in terms of size. Profitability scores from 3 to 6 were labelled as “lower profitability”, and those greater than 6 labelled as "higher profitability”.  

These 154 data items from across the five survey cycles were considered to predict the profitability class. This was ultimately realised as 735 features for the model.

The selection of businesses informing the model were those participating in all four surveys which also had a valid profitability score (that is, no missing response elements). These amounted to 1,070 businesses.

The Python programming language was used to implement the random forest classifier available in scikit-learn package. The sktime framework was also used.  

The data was split into a training and testing set containing 80% and 20% of the data, respectively. The training set was used to develop the model itself. A splitting metric, entropy, was used to quantify what the “best” decision was at each stage of the trees’ development. This metric measured the purity of each split; a high-entropy split would contain a mix of businesses while a low entropy split would contain businesses belonging to one of the profitability classes. Low entropy splits are the desired result. Their effects were expressed as normalised mean feature importances (FIs) which, together with confidence intervals (a measure of the model’s reliability), were used to retain only the most informative features.

This test set, kept independent of the model’s development, was used to estimate the model’s performance against new data. The features of the test set businesses were input to the model to predict the profitability class. As the businesses in the test set will each belong to a profitability class, these true classifications can be compared to those predicted. 

Results

To provide a baseline comparison a naïve approach to classification was applied, creating a dummy classifier. The training of this classifier ignores all features, assigning the training set’s most common classification (lower profitability) to all inputs. Applying this classifier to the test dataset yielded an accuracy of 57.01%. 

Following the feature selection process described earlier produced a dataset of seventeen features, with the random forest composed of 463 decision trees. A more detailed breakdown of this process with examples provided is available in the Technical Report. 

Figure 1: Confusion matrix presented as a table for the trained random forest classifier applied to the test set data.

Test resultCount% of test dataset
Lower profitability correctly classified (True Negative)10750.00
Lower profitability incorrectly classified (False Positive)157.01
Higher profitability correctly classified (True Positive)5726.64
Higher profitability incorrectly classified (False Negative)3516.36

 

Applying this classifier to the training set yielded the classification accuracy of 81.78%, and applying it to the test set had an accuracy of 79.44%. The classifier correctly identified 107 lower profitability businesses out of 122 (87.70%), a stronger result than demonstrated by the 63 out of 92 higher profitability businesses (68.48%). This suggests that the features found by the classifier may be better at predicting lower profitability than higher profitability. Additional discussion and an investigation into creating additional sample classification rules can be found in the Technical Report.

The features with the largest positive mean normalised FIs are shown in Figure 2 

The largest mean normalised FIs in Figure 2 relate to features from multiple survey cycles, but each compares a business attribute to the previous year (productivity and sales income). As with the breakdown of the Profitability variable discussed earlier, responses include Not applicable [0], Decreased [1], Stayed the same [2] and Increased [3]. The remaining smaller mean normalised FIs relate to features with a 0 or 1 response. 

Economic Interpretation

The feature importances with the highest scores are correlated strongly with the variable being classified, which is profitability group. Given the results shown in Figure 1, the strongest interpretation of these values is to think of them in terms of the negative ‘low profitability’ result. 

Grouping like features, a firm is more likely to be in the lower profitability group if: 

  • income from goods and services is flat or falling; 
  • total number of jobs are flat or falling; 
  • productivity is flat or falling. 

Generally, these results are consistent with high-level intuition – higher productivity in particular should generally result in higher returns to factors of production (labour and capital, with capital benefiting the most in the short run).  

Sales of goods and services and number of jobs are less direct signals, but still in line with expectations. The relationship found by the classifier suggests two conclusions. Firstly, profitability is being interpreted in the question asked in the survey as total (i.e. dollar) profitability, rather than a profitability rate. And secondly, increasing the scale of the business (sales and jobs) increases profitability. While neither conclusion is surprising, they are not guaranteed from the structure of the survey, and the confirmation provided by the classifier provides a useful check on the internal consistency of the businesses responding. 

Conclusions and future considerations

This pilot project has demonstrated a methodology to use ML classification of BCS data to better understand the implications of the responses provided by businesses on the survey. This capability is of enormous value, as it means that the existing survey responses can be used to create new statistical insights.  

Three areas of further work could be considered by the ABS and the broader research community: 

  • Future iterations of the BCS (and similar surveys) could consider this use in their design. This could involve asking questions or processing responses in a way that lends itself more naturally to this type of analysis (see the Technical Report for further discussion). 
  • With more time series data, time series classification approaches could be used, which would allow for greater causal conclusions to be drawn about the classifier results. In the existing analysis it is hard to determine the direction of causation. 
  • Finally, with further refinement, these techniques could allow for the construction of new headline statistics based on the classifier approaches. This would build on the existing work done in the Digital Intensity Index and Innovation Index in the 2021-22 and 2020-21 BCS publications. 

Acknowledgments

The author thanks Eugene Schon (Methodology Division, ABS) for reading an early draft of the Technical Report and providing useful comments. Gratitude also goes to Franz Király for several useful discussions about sktime features over the course of the project. The author would also like to acknowledge the work of the Technology, Innovation and Business Characteristics Statistics section for their assistance preparing this document, with special thanks to Adam Hill and Rocco Borino.

Dr Jason Whyte
Senior Statistical Analyst
Business Statistics Production and Futures Branch

References

Whyte, Jason M. Producing Official Statistics from Linked Data - Pilot Project: Technical Report. Technology, Innovation and Business Characteristics Statistics, Australian Bureau of Statistics. 2024. 

Python Software Foundation. Python Language Reference, version 3.10.9. Available at https://www.python.org. 

Pedregosa, F, et al. Scikit-learn: machine learning in Python. Journal of Machine Learning Research. 2011, Vol. 12. 

Markus Löning, Anthony Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines and Franz Király. Sktime: A unified framework for machine learning with time series. Vancouver, Canada : s.n., 2019. 

Király, F, et al. Sktime v0.30.2. zenodo.org. [Online] 2024. https://zenodo.org/records/12653146. 

Glossary

Data aspects

Classifier terminology

Back to top of the page