Producing Official Statistics from Linked Data — Technical Report

Investigating if the Business Characteristics Survey data could relate to the profitability of individual businesses

Released

20/08/2024

Source

Innovation in Australian Business, 2022-23 financial year

Characteristics of Australian Business, 2021-22 financial year

Executive summary

This project investigated one thread of the general proposition that Australian Bureau of Statistics’ Business Characteristics Survey (BCS) data could relate to outcomes for individual businesses. This investigation used BCS data for financial years (FY) 2015-16 to 2019-20.

Owing to what seemed a relatively direct relationship between a BCS data item and an outcome, we used compprof_bcs to consider relative profitability of 1070 “in-scope” businesses over the three FY 2017-18 to 2019-20. Profitability values permit the assignment of each business to either a “high profitability score” group, or a “lower profitability score” group. This casts the investigation as a binary classification problem, followed by an interrogation of how classifications were made. The existence of any features having certain values for only one group of businesses (that is, features with a high “feature importance”, FI) indicates a systematic difference between groups.

Random forests were used for classification. Given the dataset features, “raw” FIs were determined using the “permutation” method. Individual trees in a random forest can produce different raw FI values for a given feature. Hence, we calculated the mean and standard deviation of raw FIs associated with individual features. To aid interpretability, raw FIs for all features were used to produce “normalised” mean FIs that sum to 1.

The dataset contained several highly correlated features, which can distort results. Accordingly, a systematic process was used to remove such features. Features were also removed following the calculation of normalised FIs if their 95% confidence interval was not completely above zero. The random forest was then refit to the reduced training set.

Following some rounds of this process, we arrived at a reduced dataset of 17 features. A random forest classifier applied to a training set and test set obtained from this reduced dataset showed adequate performance (classification accuracies of approximately 82% and 79% respectively) in each case.

Proceeding to an inspection of FIs showed that the five largest mean normalised FIs (signalling the most influential features) relate to two different “relative change” features for time points in the last three years of the dataset. Placed in decreasing order, compprod_bcs (“Productivity”) are the first, fourth, and fifth entries. The second and third-largest entries relate to compsogs_bcs (“Income from the sales of goods or services”). Most of the remaining features have far smaller mean normalised FIs, indicating a smaller influence on classifier performance. However, these may still make some contribution to understanding how classification occurs, and feature in the decision rules that conclude the report.

1. Background and preliminaries

This technical report accompanies the summary of outcomes of a project that investigated one thread of the general proposition that Business Characteristics Survey (BCS) data can relate to business outcomes¹. We have specifically used BCS data (accessible through BLADE² ) to investigate the relationship between BCS variables (or “features”) and a derived quantity relating to business profitability. We propose that a machine learning (ML) classifier can recognise those features of BCS data that contribute to a business being regarded as either in a “high profitability score” group, or a distinct “lower profitability score” group.

Our proposed groupings inform the “labels” for businesses in our classification problem. We form these labels with reference to the BCS variable “compprof_bcs”. This variable holds the assessment of each business of their current year’s profit relative to the previous year’s. Aside from certain troublesome values (causing the removal of some businesses from the dataset), valid responses are that profit has 1. decreased, 2. stayed the same, 3. increased.

Initial investigations considered the performance of various classifiers, and approaches to preparing data for use with these. The task for each “pipeline” combining data processing and a classifier was to train the classifier on part of the (appropriately processed) data (the training set), reserving a portion of the data for testing (the test set, not seen in training). A standard way to judge a pipeline’s value is to consider the “classification accuracy” it can produce. We can validate predictions and compute classification accuracy here as we can compare labels predicted by the classifier with the true labels across all businesses considered. More specifically, we determine whether a pipeline that showed acceptable accuracy in classifying training-set instances can also produce an acceptable classification accuracy for the test set. We may consider a classifier to be useful if its accuracy can substantially outperform naïve methods similarly applied to the data.

Given the aim of the project and time constraints, attention was confined to classifiers able to directly produce “feature importances” (FIs): a measure of the influence individual features have on the classification. Following the time available for experimentation, a random forest classifier showed the best test-set accuracy. As this result was considered acceptable, we could proceed to interrogate the FIs.

The remainder of this document is organised as follows. Section 2 provides an overview of the processing steps applied to BCS data that yielded the dataset considered in this study. Limitations of the data are also noted. Section 3 provides an overview of classification results obtained with a baseline method, and a brief description of the use of a random forest classifier and associated results. Section 4 presents the FI that result from the trained random forest classifier found in Section 3. The body of the paper concludes with some discussion of results in Section 5.

Appendix A summarises the data processing used to replace missing values in certain features where these are a consequence of the path of a respondent through the survey. In such cases, a negative response to one question permits us to replace a missing value code in data items for related questions with a negative response. This type of process has substantially reduced the amount of missingness in some features, and either reduced the amount of infilling necessary, or led to features being retained that would otherwise have been excluded due to an unacceptably large number of missing values.

2. Business Characteristics Survey data and its treatment

2.1 Defining the in-scope dataset

BCS data has many variables that are common to each of the financial years (FYs) 2015-16 to 2019-20, making this a suitable period for investigation.

The in-scope population of businesses in the BCS data are those which:

Have their identifier (BCS feature “id”) appear in data for each of the FYs from 2015-16 to 2019-20.
Have compprof_bcs for each of the FYs 2017-18, 2018-19, and 2019-20 taking an informative value (i.e. in the range from one to three inclusive).

These conditions led to an in-scope dataset of 1070 businesses.

The decision to use BCS data over the selected five-year period was informed by preliminary inspection. This prompted us to make certain decisions about which features to include in the study dataset. Selecting features that are available for each year of the chosen dataset (that is, not missing for every business in the in-scope sample due to BCS changes) avoids systematic missingness and permits the study to be framed as a time-series classification problem. However, many BCS features are not present in all FY of the dataset.

It was decided to limit the data range to the FYs 2015-16 to 2019-20³ ⁴ ⁵ ⁶ ⁷ so that a substantial number of features were potentially available for exploratory analysis. A substantial number of features appearing during this time were available for the entire dataset. However, it was decided to retain some variables which were not available only for the 2019-20 FY to broaden the features considered.

Following these choices, it was decided to retain only those variables which had less than 20% of values missing. This cutoff was applied with reference to whether a feature was available for only four years, or for all five years in the BCS data. (See Section 2.2 for more on how systematic missingness was treated before we applied this judgement.) The exceptions were:

id (this was required for all business in the in-scope dataset),
compprof_bcs (see the discussion of labels above).

The result was an in-scope dataset of 154 features with 119 being available over the five FY, and 35 systematically missing from the 2019-20 data. Not included in these features are “id” and the FY (which serve as indices in the dataset), and compprof_bcs used to produce our labels.

2.2 Imputation of missing values

There are two particular cases where infilling of missing values is necessary. In the first case, BCS data has various instances where a specific reason causes a BCS feature to have a substantial number of missing values. In such cases, we can use the logic of the BCS to confidently replace a missing value with another value that is logically consistent with the respondent’s other answers. This has substantially reduced the missingness for various features, permitting the retention of many that would otherwise be omitted. The specific processes used to address missingness for certain features are described in Appendix A.

The second case where infilling is required is when data is “missing at random”. Of the 154 features retained, 138 are categorical (possibly ordinal) and typically take values from a small range of integers (examples of “low-cardinality features”). Missing values for such a feature at a given time point were replaced by the mode of the feature at that time point. The remaining 18 features (examples include busopyr_bcs and busownyr_bcs, empoth_bcs, empprop_bcs, empsaldr_bcs, and emptotal_bcs) are numerical, and have a larger range (“high-cardinality features”). Aside from busopyr_bcs (business years of operation, regardless of ownership) and busownyr_bcs (number of years operating under current ownership) which were handled separately, missing values in a numerical feature at a given timepoint were replaced by the median of that feature at that time point.

2.3 Assignment of instance labels

To obtain the label for a business in the in-scope population, their most recent three years (that is, from FYs 2017-18, 2018-19, and 2019-20) of compprof_bcs values were summed to yield an overall “profitability” score. Recall from the introduction that useful compprof_bcs values for this study are: 1 (decreased), 2 (stayed the same), and 3 (increased). As such, the profitability score can range from three to nine. The distribution of scores across the in-scope business population is shown in Figure 1.

Figure 1: Counts of in-scope businesses by total profitability score.

Figure 1: Counts of in-scope businesses by total profitability score.
score	Count
3	101
4	66
5	281
6	161
7	263
8	73
9	125

Figure 1: Counts of in-scope businesses by total profitability score.

["score","Count"]

[["3","4","5","6","7","8","9"],[[101],[66],[281],[161],[263],[73],[125]]]

[]

[{"axis_id":"0","tick_interval":"","axis_min":"","axis_max":"","axis_title":"","precision":-1,"axis_units":"","tooltip_units":"","table_units":"","data_unit_prefix":"","data_unit_suffix":"","reverse_axis":false}]

[{"value":"0","axis_id":"0","axis_title":"","axis_units":"","tooltip_units":"","table_units":"","axis_min":null,"axis_max":null,"tick_interval":null,"precision":"-1","data_unit_prefix":"","data_unit_suffix":"","reverse_axis":false}]

To aid classification, it is appropriate to avoid:

subdividing the dataset too finely (creating many potential categories, where some have few instances), and
producing categories of very different sizes (“unbalanced” classes).

Mindful of these points, we used the profitability score to divide businesses into two categories of similar sizes. A business having a score in the range 7-9 defines that business as belonging to the “higher profit score” category (label "1” for implementation). Alternatively, a score in the range 3-6 assigns a business to the “lower profit score” category (label “0”).

Following this assignment, 461 businesses have label 1, and 609 businesses have label 0.

The intent of using an aggregate score is that it permits the formation of categories able to capture general trends in business profitability. That is, the total can allow a business to have one particularly good (or bad) year on the compprof_bcs scale, and still be included in the lower (or higher) score category due to its overall performance.

We can demonstrate how the scoring system allows for some variability in performance by considering the membership of the higher-score category in detail. The category is comprised of businesses with reported profitability resulting in:

score 9: profit increased for each of the three years,
score 8: profit increased for two years in the range and unchanged profit for one year,
score 7: profit increased for two years and decreased for one year, or, profit increased for one year and unchanged for two years.

Notably the component with a score of seven may admit businesses with an unprofitable year to the category. We can justify this possibility through a thought experiment considering businesses with a strong commitment to matters like workforce training, innovation or diversifying its markets. We would expect such commitment to support profitability over time. However, such businesses could still have a decrease in profitability in one isolated year due to a generally unfavourable business climate that year. As such, we choose to include such a business in the high-score category, rather than include it in a category of businesses which have a profit decrease more regularly.

An extension of the above argument is that businesses with a generally low profitability should not be confused with other businesses due to one uncharacteristically good year. In a similar manner to the above, consider the lower-score category, composed of businesses with profitability reported with the result:

score 6: three years of unchanged profitability, or, one year each of unchanged, increased, and decreased profitability,
score 5: one year of decreased profitability and two years unchanged, or, two years of decreased profitability and one year of increase,
score 4: two years of decreased profitability, one year unchanged,
score 3: three years of decreased profitability.

As before, there is some variability in this category as the components with a score of five or six can admit businesses that have one profit increase to the group. However, such businesses do not share the characteristic of higher-score businesses of having more profit increases than decreases.

This study was subject to certain limitations which are outlined in the next subsection.

2.4. Data-related study limitations

This pilot study was subject to certain constraints, such as the time available, and was limited to BCS data. Other constraints followed from the nature of the choices made or the data itself. We shall document some study limitations below, expecting that this may be useful for any further work.

2.4.1 In-scope dataset

Recall Section 2.1’s discussion of how the in-scope dataset was defined. Conditions A and B employed there led to the exclusion of certain businesses from the dataset. We shall consider these exclusions, and any consequences, below.

Condition A acts to exclude businesses from the in-scope dataset that are not present in the BCS data for each FY from 2015-16 to 2019-20. As such, we have systematically excluded any business which:

ceases trading in any year of the study range, or,
commences trading after the 2015-16 FY but not after 2019-20 FY.

These conditions are potentially significant to this study. Consider the first condition. Suppose that some businesses were wound up at some point during the FYs from 2015-16 to 2019-20 due to being inadequately profitable. (That is, such businesses have outcomes similar to those of our lower profit score group.) Further suppose that such businesses had certain characteristic features. Then, the exclusion of such businesses from our in-scope dataset has potentially limited our discovery of how BCS features are associated with unprofitable businesses.

In a similar manner, the second condition renders us unable to scrutinise features of new businesses, which may have different patterns in their BCS data compared to those seen for older businesses.

Condition B was required in defining the in-scope dataset due to the range of possible values for compprof_bcs. In addition to the values suitable for use in forming our profit score, compprof_bcs could also be assigned:

0 – “Not applicable”, which may apply for e.g., businesses in their first year of operation.
7777777 (seven 7s) – “Ticked more than one box”, lacking access to the raw data, it was not possible to determine which multiple responses to include in a dataset.
999999999 (nine 9s) – “Missing”.

As it is not possible to use the last two codes in our profit score, businesses showing either of these at any point in FYs 2017-2018 to 2019-20 cannot be added to the in-scope dataset. The zero-value response is also problematic in the current scoring system. A business with one or more incidences of compprof_bcs =0 may not belong to the group that commenced trading after the 2015-16 FY, but also must be omitted from the dataset. This further limits our ability to consider features of the BCS data for newer businesses.

Adding the currently excluded businesses to the in-scope data would require a means of managing particular patterns of missing data. This matter was beyond the scope of this pilot study.

However, future work may include such businesses by replacing the total score used here with another metric. One possibility is to assign each business the average of its non-zero compprof_bcs values in the studied range. However, this would mean that the metric is not (unlike our total score, recall the discussion in Section 2.3) making judgements based on three consecutive years for all businesses. A further study may be required to judge the suitability of other means of using compprof_bcs scores to produce business categories.

2.4.2 Feature compprof_bcs, and others permitting the “stayed the same” response

Recall the discussion of compprof_bcs (relating to profit in the current year compared to that of the previous year) in Section 2.3. In particular, recall three of its possible values: 1 (Decreased), 2 (Stayed the same), and 3 (Increased).

ABS domain experts have advised that the survey question which produces compprof_bcs is seeking a “subjective measure” of business profitability, and there is no guidance provided to respondents on how to answer the question.

Possibilities “1” and “3” are unambiguous if a business is consistently profitable, and if profit is calculated using the same method each FY.

However, we may wonder how a respondent could record relative profitability in the event of any years when their business made a loss. For example, would a loss followed by a larger loss receive “1”? Or, would a loss followed by a smaller loss receive “3”?

Beyond this, and more critically, we may query the “2” response. It is unlikely that an active business will produce exactly the same profit for two consecutive years. As such, a “2” response could potentially result from a respondent’s subjective interpretation of a year that delivered a small reduction in profit or a modest increase, where such a difference is not considered large enough to justify a “1” or “3” response. That is, many of the collection of “2” responses should be “1” or “3”, making the “2” response less than ideal for a project such as this one.

Notably, 16 other features seeking to measure a change from year to year also permit a “2” to represent “Stayed the same”. These are listed in the same documentation block as compprof_bcs, and include:

compsogs_bcs: “Income from the sales of goods and services”, and
compprod_bcs: “Productivity”.

As such, these features may suffer from the limitation noted above for compprof_bcs.

Data analysis projects could benefit from removing the type of response ambiguity described above. Such ambiguity may be lessened by providing respondents with guidance. For example, in the case where a profit was recorded in the survey year and the previous year, respondents may be advised to respond with “2” if the current year’s profit is within (say) two percent of last year’s profit (whether positive or negative), and otherwise to recognise a sufficiently large change in profitability to respond with “1” or “3” as appropriate.

2.4.3 BCS features with missingness not treated due to uncertainty or time constraints

Although it was possible to recognise and remediate various situations where a feature was subject to systematic missingness (see Appendix A), there may have been other opportunities for similar remediation.

The features considered below all have the “missed due to sequencing” option (eight 8s) in the in-scope dataset, and each is available for most, if not all, years of the dataset.

innocoll_bcs: “Business collaboration for innovation (No/Yes)” – this feature had a substantial number of 88888888 codes. The reason for this was unknown at the time of the analysis undertaken, and remains unknown.

It is possible that a “No” response associated with internet_bcs (“Internet use (No/Yes)”) caused the following features to exhibit 88888888 codes:

record_bcs: “Receive orders via the internet (No/Yes)”,
plorder_bcs: “Place orders via the internet (No/Yes)”,
socmedpr_bcs: “Social media presence (No/Yes)”,
wepbres_bcs: “Web presence (No/Yes)”.

Awareness of systematic missingness, and the development of an approach to remediate this, occurred some way into this pilot study. Any further study should benefit from addressing this aspect of a prospective dataset prior to data analysis.

3. Classification methodology and results

3.1 Overview

A preliminary study compared the performance of various classification pipelines on a slightly smaller version of in-scope dataset (one obtained before the correction of systematic missingness). In this initial stage the data was in a “panel” format. That is, we could consider the data for each business id as a matrix with rows corresponding to the five FYs, and columns corresponding to the features under investigation. Much of the original Python⁸ code, intending to investigate time-series classification approaches (as in the module sktime⁹ ¹⁰), was written under the expectation that panel data would be used.

However, the short time series available encouraged the consideration of more standard (“tabular”) methods, as seen in Python module scikit-learn¹¹. This further encouraged a change to the data format, permitting the inclusion of 35 previously ignored variables that were systematically missing 2019-20 FY values.

The result was to transform the data into a “wide” format. That is, each row represents a business id as before, but now each column is the value of a feature at a particular year in the studied range. This format assisted the removal of the feature-year combinations that were systematically missing.

The new data format was suitable for study with a pipeline containing a random forest classifier (RFC). An RFC provides a flexible approach that can also produce feature importances. The time-series classification code written for the initial stage of the project was able to accommodate the new data format by using sktime’s “ColumnConcatenator” to combine all columns of data for an instance into one long column for that instance. The pipeline also included “RandomForestClassifier”, a standard tabular classification method implemented in scikit-learn.

A disadvantage of changing the data format was that the data became incompatible with certain naïve classifier approaches (e.g. Naïve Bayes) that were used earlier in the study. Owing to the limited time available for this project, we did not consider how to resolve this problem. It may be possible to achieve this in any further study.

3.2 Classification methodology

Our reserved test set data was 20% of the dataset (214 instances), leaving 856 instances for the training set. The train-test split was “stratified” to ensure that each of the train and test sets had approximately the same proportion of instances with a “0” label.

The dataset was prepared for classification by converting the 154 features over five time points into a total of 770 features (feature-year combinations), which was subsequently reduced to 735 features (neglecting the 35 features that were not available for 2019-20). In the discussion to follow, we show the time point associated with a feature by appending a value from 0 (indicating FY 2015-16) to 4 (FY 2019-20) to the feature name.

Naive classification methods provide classification results quickly. They also provide a baseline classification accuracy against which predictions from other classifiers (which can be more complicated and time consuming to apply) can be compared. We considered one such naïve method, described below. We also describe our use of a random forest classifier pipeline.

3.2.1 A “baseline” classifier

The sktime “DummyClassifier” ignores all features, assigning the training set’s most common label to all instances. As such, it is not possible to tune this method in search of better results.

3.2.2 Random forest classifier methodology

A preliminary inspection of data characteristics suggested that highly correlated features could conspire to conceal useful information. That is, some feature importance that should be attributed to a particular feature can be spread around a number of highly correlated features, making all appear relatively unimportant. As such, it was appropriate to undertake some processing of the dataset before applying a classifier.

Given the dataset’s 735 features, it is not feasible to show the correlations of pairs of features here. However, an inspection showed that there are a substantial number of highly correlated features. A recommended approach to managing this situation is to consider pairwise correlations, compare a measure of these against a user-set threshold value, form clusters of comparable features, and then to retain only one feature per cluster¹².

Earlier in this project, experimentation with the above process of omitting features followed by refitting a random forest classifier showed that this can produce higher values for the largest FIs compared to results from the original fit. However, there may be an associated decrease in the classification accuracy obtained for this reduced data.

To explore this trade-off, some experimentation with removing features was undertaken. This required trialling thresholds that were used to produce feature clusters. We settled on a threshold that produced a reduced dataset of 328 features. The correlations between distinct features in this dataset ranged between (approximately) -0.75 and +0.75.

A randomised grid-search method was used in hyper-parameter tuning of our classifier pipeline. This process aimed to find an adequate fit to training data and included a five-fold cross validation process to manage overfitting.

The randomized grid search applied to the reduced data training set drew 400 samples from the following grid:

ccp_alpha: stats.uniform(0,2)

criterion: ['gini','log_loss','entropy']

max_depth: stats.randint(3,50)

max_leaf_nodes: stats.randint(2,40)

max_features: ['sqrt','log2', None]

max_samples: stats.uniform(0.001,1.0)

min_samples_leaf: stats.randint(1,10)

min_samples_split: stats.randint(2,10)

n_estimators: stats.randint(5,500)

and had bootstrap = True and oob_score = False.

3.2.3 Feature importances methodology

Scikit-learn/sktime documentation most often describes two approaches to producing “raw” FIs from a fitted random forest classifier.

Mean raw FIs are produced directly by RandomForestClassifier. These relate to “the mean decrease in impurity” (MDI) associated with features. The FI value for a given feature is calculated for each tree in the random forest. It is also possible to calculate the standard deviation of the FIs for each feature.

A weakness of the MDI method of calculating FIs is that features with a high cardinality (some are present in this study) can unduly influence results. An alternative approach is to use the “permutation” method. In this method, values of features are shuffled between instances in (e.g.) the test set, and the trained classifier is used to predict labels for the test set. This process is repeated for a user-specified number of shuffling trials. If a feature is relatively unimportant, then changing its relationship to the labels will not have a substantial effect on the classifier’s performance. However, if there is an important relationship between a feature and labels, then the disruption caused by shuffling will reduce the classifier’s ability to predict labels correctly.

We used the permutation method with the trained classifier and 20 shuffling trials to produce “raw” FIs for the test set.

To aid interpretability, we also normalised the raw FIs for all features so that they sum to 1. As a result, we can interpret each positive normalised feature importance as the percentage contribution of that feature to improving the performance of the classifier.

We note that when an FI is negative, this suggests that the corresponding feature adversely influences the performance of the classifier. As such, it was appropriate to remove some features from our dataset and to refit the classifier. In order to judge which features should be retained, we formed 95% confidence intervals for the mean normalised feature importances. We retained only those features which had a confidence interval that was entirely above zero. In this project it was necessary to undertake multiple rounds of classifier fitting and discarding features to arrive at a set of features that did not show any large negative values for mean normalised FIs.

The next section considers the results obtained from a random forest classifier pipeline. We present some results for different versions of the dataset to show the effect of the feature reduction described here and in Section 3.2.2.

4. Results

Classification results are shown in Section 4.1, followed by a discussion of feature importances in Section 4.2.

4.1 Classification results

Throughout this document we report classification accuracy to two decimal places.

Applying the baseline classifier (Section 3.2.1) to our test-set data yielded an accuracy of 57.01%.

In our pipeline, the RandomForestClassifier() has default hyper-parameters:

bootstrap: True,

ccp_alpha: 0.0,

class_weigh': None,

criterion: 'gini',

max_depth: None,

max_features: 'sqrt',

max_leaf_nodes: None,

max_samples': None,

min_impurity_decrease': 0.0,

min_samples_leaf': 1,

min_samples_split': 2,

min_weight_fraction_leaf': 0.0,

n_estimators': 100,

n_jobs': None,

oob_score': False,

random_state': None,

verbose': 0,

warm_start': False,

Fitting the pipeline with this default classifier for 20 random states to the entire training set produced 100% classification accuracy for each occasion. Considering the accuracy on the test set under the same conditions produced a best accuracy of 77.57%. A comparison of training and test-set accuracy suggests that the classifier is overfitted to the training data, and hence it is appropriate to use an approach to classifier training that can control overfitting. The trial over random states also led to the classifier applied to the test set producing the mean accuracy of 72.62%, and a range of 8.41% between the best and worst results of the trial. It is also appropriate to be mindful of such variability in later model evaluation.

Recall (see Section 3.2.2) that we removed a large number of highly correlated features from the original dataset to produce a “reduced” dataset. Applying the randomised grid search to this reduced dataset (328 features) found the best fit, associated with hyper-parameters:

ccp_alpha: 0.06861698171455322,

criterion: 'log_loss'

max_depth': 3

max_features: None

max_leaf_nodes': 9

max_samples': 0.1792330248103885

min_samples_leaf': 5

min_samples_split': 8

n_estimators': 478

The trained classifier produced training-set accuracy of 78.27% and test-set accuracy 76.64%. The confusion matrix for test-set classifications is shown in Figure 2.

Figure 2: Confusion matrix presented as a table obtained by applying the trained classifier to the "reduced" test set.

Test result	Count	% of test dataset
Lower profitability correctly classified (True Negative)	107	50.00
Lower profitability incorrectly classified (False Positive)	15	7.01
Higher profitability correctly classified (True Positive)	57	26.64
Higher profitability incorrectly classified (False Negative)	35	16.36

Classification results obtained for the reduced dataset showed only a minor reduction in classification accuracy compared to that obtained for the full dataset.

We proceeded to undertake rounds of removing features from the dataset and refitting the classifier to the training set obtained from this modified data (as described in Section 3.2.3). The final iterate of this process reduced a dataset of 33 features to 17. The randomised grid search used in fitting the random forest classifier to the training set derived gave best result with hyper-parameters:

ccp_alpha': 0.00680739735905278,

criterion': 'entropy',

max_depth': 8,

max_features': 'sqrt',

max_leaf_nodes': 28,

max_samples': 0.6416751819212193,

min_samples_leaf': 6,

min_samples_split': 7,

n_estimators': 463,

Associated with this fitted classifier is training set accuracy of 81.78%. The best test set accuracy was 79.44%, found by calculating this across a list of random states, which yielded a range of 2.34%. The results suggest that the classifier is not overfitted to the training set. Also, the random forest classifier clearly outperforms the baseline classifier. Notably, the accuracies obtained are better than those obtained for the dataset of 328 features, noted above.

Further views of classifier performance are shown in Figures 3 and 4. Receiver Operating Characteristic (ROC) curves for the selected classifier applied to the training and test sets are shown in Figure 3, with a comparison of Area Under the Curve (AUC) on training (blue) and test set (orange) data for the best grid search random forest applied to a reduced feature set.

Figure 3: Classifier performance for the data set of 17 selected features.

Image

A line graph plotting the false positive rate along the x-axis and the true positive rate along the y-axis. The blue line (representing the SklearnClassifierPipline with AUC = 0.90) has a higher True Positive rate than the orange line (representing the SklearnClassifierPipline with AUC = 0.85).

We can obtain further detail on our select classifier’s performance by considering results for classifying the higher profitability (label “1”) and lower profitability (label “0”) businesses. The confusion matrix presented as a table for the application of our classifier to the 17-feature test set is presented in Figure 4. We note from the top row that the classifier correctly classified 107 “0” instances out of out of 122 (87.70%). Classifier performance was not as convincing for the “1” instances, with only 63 instances out of 92 classified correctly (68.48%).

Figure 4: Confusion matrix presented as a table for the trained classifier applied to the test set data of 17 features.

Test result	Count	% of test dataset
Lower profitability correctly classified (True Negative)	107	50.00
Lower profitability incorrectly classified (False Positive)	15	7.01
Higher profitability correctly classified (True Positive)	63	29.44
Higher profitability incorrectly classified (False Negative)	29	13.55

The results obtained for the reduced dataset suggest that lower-profitability score businesses have some characteristic(s) that enables a classifier to recognise these businesses with a degree of specificity. As such, it may be possible to extract information from the classifier that can show a systematic difference between our groups of businesses. Towards this, we shall consider the FIs obtained from the trained classifier in the next subsection.

4.2 Feature importances and commentary

The mean normalised FIs taking a value of at least 1% are shown in Figure 5. Two features made a very small negative contribution to the total FI. Approximately 76% of classifier performance is due to the five features with largest mean normalised FIs. In decreasing order of size, these are:

compprod_bcs (2017-18) > compsogs (2019-20) > compsogs (2017-18) > compprod (2019-20) > compprod (2018-19) .

Figure 6 shows 95% confidence intervals for the mean normalised FIs shown in Figure 5. (Figures of this type informed decisions on which features to retain when experimenting with the dataset.)

Figure 5: Largest mean normalised feature importances obtained for the test set of 12 features. Each black line shows an interval of +/- one standard deviation of the FIs from the mean.

Image

Figure 6: 95% confidence intervals for the largest mean normalised feature importances obtained for the test set.

Image

In the next section we delve into features of the results obtained for the reduced dataset.

5. Towards simple classification rules

Recalling the confusion table of Figure 4, the classifier used has the greatest skill in recognising the low-profitability-score businesses. Accordingly, we can consider feature values commonly associated with true negatives and examine how this pattern differs from the feature values seen for other groups. If certain characteristics are only common to lower-score businesses, this points us towards a hypothesis of how BCS features may influence the consistency of profit growth from year to year.

5.1 Test-set features

Some notable characteristics of the true negative (TN) group are:

it is very uncommon to have compprod_bcs3 =3, and
extremely uncommon to have compprod_bcs3 = compprod_bcs2 =3.

Transposing condition b to the false negative (FN) group, we do not see any instance of compprod_bcs3 = compprod_bcs2 =3. That is, it is quite common for businesses classified as having label “0” (correctly (TN) or incorrectly (FN)) to take the value of two or less for both compprod_bcs3 and compprod_bcs2.

Let us now consider characteristics of the True Positive (TP) group. Some distributions and associations of the values of the most important features found earlier are shown in Figure 7. Each cell of the 2d-histograms (shown below the leading diagonal of column graphs) is associated with a particular value of each feature, as shown on the horizontal and vertical scales. Darker cell colours indicate larger counts.

Consider the facet of Figure 7 showing the column graph of compprod_bcs2 values, shown at the top of the third column. This shows that it is very common to have compprod_bcs2 =3, values of 0 or 2 are far less common, and 1 does not occur.

Similarly, the column graph for compprod_bcs3 (top of the fourth column of Figure 7) shows this distribution, like that for compprod_bcs2, to take the value 3 quite often. However, 2 values are more common here than was seen for compprod_bcs2, and there is a small number of 0 or 1 responses.

Given the individual occurrences of 3 values across compprod_bcs2 and compprod_bcs3, let us consider the joint distribution, shown in the facet of column 3, row 2 of Figure 7. Unsurprisingly, it is quite common to have compprod_bcs2 =3 occur with compprod_bcs3 equal to 2 or 3 compared to the other possibilities. Further, we do not see certain combinations, such as compprod_bcs2 = compprod_bcs3 =1. Recalling the discussion above, we conclude that these characteristics may assist us in distinguishing businesses in the TP group from the TN and FN groups.

Figure 7: For the True Positive group, a graphical summary of distributions of, or pairwise associations for, selected features from the reduced test set.

Image

A graphical summary for the True Positive group of pairwise associations for selected features from the reduced test set.

This suggests that the occurrence of values below 3 for both comprod_bcs3 and comprod_bcs2 is a substantial contributor to why a classifier assigns a business to the lower-profitability group. Also, judging from Figure 4, often this judgement is correct.

We may wonder about features of data that impede classifier performance. Inspection of the test set data shows that TNs and FNs have far higher incidence of zero values (denoting “Not applicable”) for comprod_bcs2, comprod_bcs3, and compprod_bcs4 than is seen for FP and TP groups. The presence of such zeros may have disrupted the classifier’s ability to associate classification rules with some of the most important features for classifier performance. Zero values also can limit the practicality of certain decision rules, such as comparing a sum (say of the three compprod values) against a threshold. In this example, a low sum (quite possible in the FN case) may not indicate a “0” business, limiting a classifier’s ability to discriminate between businesses.

Similarly, the ambiguity of the “2” response for certain features (which could contain some mix of relative increases and decreases) may have impeded the classifier’s ability to learn patterns in data. Recall the observation on mean normalised feature performances from Section 4.2:

compprod_bcs2 > compsogs_bcs4 > compsogs_bcs2 > compprod_bcs4 > compprod_bcs3.

Consider the ratios of “2”s to other responses for these features in Figure 7. We can consider two groups. The first has smallest proportion of 2s; compsogs_bcs2 does not have any, compsogs_bcs4 and compprod_bcs2 each have quite a small proportion. The second group, containing compprod_bcs3 and compprod_bcs4, which each show a notably larger proportion of “2”s than features from the first group. Also, the second group’s features have lower mean normalised FIs than seen for the first group. Although further formal analysis is required, based on the TP breakdown for features with larger normalised mean FIs, there may be an inverse relationship between the size of normalised mean FIs and the associated proportion of “2” responses for feature. It is reasonable to hypothesise that BCS features having a high proportion of “2” responses (recall Section 2.4.2) are somewhat unhelpful for a classifier study such as this one.

Other types of data visualisation can also offer insights into characteristics of “0” and “1” businesses that we may use in formulating simple classification rules. Figure 8 shows column graphs of values taken by certain features (some high FI, some from a lower-importance group) across the False Positive, True Positive, True Negative, and False Negative groups, as judged by application of the classifier to the test set. Whilst we cannot consider pairwise distributions here, we can make various comparisons between groups for single features. For example, the TP group is far more likely to have compjobs_bcs2 =3 than any other group.

Figure 8: A sample plot for showing differences in the values of features for groups found by application of the trained classifier to the test set. Features finassub_bcs and hampopro_bcs only have No/Yes (0/1) responses.

Image

5.2 Candidate decision rules

Informed by plots such as Figures 7 and 8, we trialled some decision rules on the test set and retained those which led to finding relatively homogeneous groups of businesses. In essence this resulted in a partial decision tree, where only paths leading to “leaf” nodes (shown with a thick, black border) of low heterogeneity were retained. Given further exploration, it may be possible to improve on these rules. Results are summarised in Figure 9, with the rules provided below this Figure.

Figure 9: A partial decision tree applied to the test set, informed by features recognised as having the highest mean normalised feature importance.

Image

Level 1:

Rule 1: compsogs (2019-20) < 3 OR compsogs (2017-18) = 1 OR hampopro (2019-20) = 1.

The complement of the set defined by Rule 1 is found by Rule 2:

Rule 2: compsogs (2019-20) = 3 AND compsogs (2017-18) ≠ 1 AND hampopro (2019-20) = 0.

Level 2:

Rule 1.1: compprod (2019-20) < 3 OR compjobs (2017-18) = 1

Rule 2.1: compprod_bcs summed over 2017-18 to 2019-20 ≥ 7.

Path to leaf: Using the sequence of restrictions formed by Rule 2 AND Rule 2.1, we can define an overall rule H1 aimed at recognising higher-score businesses:

Step 1. compsogs (2019-20) = 3 AND compsogs (2017-18) ≠ 1 AND hampopro (2019-20) = 0,

Step 2. compprod_bcs summed over 2017-18 to 2019-20 ≥ 7.

Discussion: Step 1 retains those businesses which did not decrease their sales of goods and services in 2017-18, increased these sales in 2019-20, and did not have to lower their profit margins to remain competitive in 2019-20.

Of these businesses, Step 2 retains those which reported a decrease in productivity at most once in the period 2017-18 to 2019-20.

Leaf result: Applying the overall rule H1 to the test set leads us to correctly recognise 29/92 (≈31.5%) high-profit score businesses. The associated misclassification of low-score businesses is 3/122 (≈2.5%). This is quite a discriminating rule.

Level 3:

Rule 1.1.1: 0 < compprod_bcs (2019-20) < 3.

Path to leaf: Using the sequence of restrictions formed by Rules 1, 1.1, and 1.1.1, we can define an overall rule (L1) aimed at recognising lower-score businesses:

Step 1. compsogs (2019-20) < 3 OR compsogs (2017-18) = 1 OR hampopro (2019-20) = 1,

Step 2. compprod (2019-20) < 3 OR compjobs (2017-18) = 1,

Step 3. 0 < compprod_bcs (2019-20) < 3.

Discussion: Step 1 of L1 excludes businesses that have increasing income from sales of goods and services for 2017-18 and 2019-20 which also reduced their profit margins in 2019-20. The condition does allow for variability in income from sales over time.

Similarly, Step 2 further excludes businesses that both increased profitability in 2019-20 and had the total number of jobs and positions in 2017-18 stay steady or increase.

Step 3 omits businesses that increased productivity in 2019-20 (as well as the indeterminate case of a “not applicable” response).

Leaf result: This overall rule leads us to correctly recognise 72/122 (≈59%) of lower-score businesses. The associated misclassification of higher-score businesses is 7/92 (≈7.6%). This rule has some ability to discriminate between groups of businesses.

Rule 1.1.2 finds the complement of the set defined by Rule 1.1.1.

Rule 1.1.2: compprod_bcs (2019-20) equals 0 or 3.

Level 4:

Rule 1.1.2.1: sum compprod from 2017-18 to 2019-20 ≤ 6.

Path to leaf: Using the sequence of restrictions formed by Rules 1, 1.1, 1.1.2, and 1.1.2.1 we define a second overall rule (L2, which shares its first two steps with L1) aimed at recognising one type of low-score businesses:

Step 1. compsogs (2019-20) < 3 OR compsogs (2017-18) = 1 OR hampopro (2019-20) = 1,

Step 2. compprod (2019-20) < 3 OR compjobs (2017-18) = 1,

Step 3. compprod (2019-20) equals 0 or 3,

Step 4. sum compprod from 2017-18 to 2019-20 ≤ 6.

Leaf result: This overall rule leads us to correctly recognise 17/122 (≈14%) lower-score businesses. The associated misclassification of higher-score businesses is 4/92 (≈ 4.3%). This rule has some ability to discriminate between businesses.

Rule 1.1.2.2 finds the complement of the set defined by Rule 1.1.2.1.

Rule 1.1.2.2: sum compprod_bcs from 2017-18 to 2019-20 > 6.

Level 5:

Rule 1.1.2.2.1:

For 2019-20: finassub_bcs + hampopro_bcs + skuscbus_bcs + compcont_bcs ≥ 2, OR

compjobs_bcs (2017-18) ≥ 2 AND compsogs_bcs (2019-20) ≥ 2.

Path to leaf: Using the sequence of restrictions formed by Rules 1, 1.1, 1.1.2, 1.1.2.2, and 1.1.2.2.1, we can define a second overall rule (H2) aimed at recognising higher-score businesses gives:

Step 1. compsogs (2019-20) < 3 OR compsogs (2017-18) = 1 OR hampopro (2019-20) = 1,

Step 2. compprod (2019-20) < 3 OR compjobs (2017-18) = 1,

Step 3. compprod (2019-20) equals 0 or 3,

Step 4. sum compprod from 2017-18 to 2019-20 > 6,

Step 5. For 2019-20: finassub + hampopro + skuscbus + compcont ≥ 2, OR
compjobs (2017-18) ≥ 2 AND compsogs (2019-20) ≥ 2.

Discussion: Notably, Steps 1 and 2 of H2 (subsetting the test set down to level 2 of Figure 9) favour a higher proportion of lower-profit businesses. However, many are excluded by later steps to produce a clear majority of higher-profit businesses later on the path. Step 3 is useful as it retains the compprod (2019-20) = 0 businesses that cannot be categorised with rules seeking high values of features or sums of values. Step 4 works to largely separate the lower and higher-profit businesses at level 4 of the decision tree into different nodes. Step 5 shows that, as we have already made most simple and decisive splits at higher levels of the tree, further splits may need to become more complicated. The first condition of Step 5 considers some lower-importance features, trialling a condition where a business may have two or more positive responses to a No/Yes question for four individual features, without specifying which. The second condition of Step 5 is included as an attempt to discriminate against the low values seen for certain features in low-score businesses.

Leaf result: Using the overall rule H2, 16/92 (≈17.4%) higher-score businesses are classified correctly. The associated misclassification of lower-score businesses is 4/122 (≈3.3%). This rule has some ability to discriminate between businesses.

Combining rules L1 and L2: 89/122 (≈73%) lower-score businesses are recognised correctly, with 11/92 (≈12%) misclassified higher-score businesses.

Combining rules H1 and H2: 45/92 (≈48.9%) higher-score businesses are recognised correctly with 7/122 (≈5.7%) misclassified lower-score businesses.

Together, the four rules correctly recognise 134/214 (≈62.6%) businesses with 18/214 (≈8.4%) businesses incorrectly categorised. Considering only those 154 businesses screened by these rules (≈72% of the total in the test set), 134/152 (≈88.2%) were correctly recognised.

6. Discussion and conclusions

The feature importances (FIs) with the highest scores are correlated strongly with the variable being classified, which is profitability group. Given the results shown in Figure 4, the strongest interpretation of these values is to think of them in terms of the negative ‘low profitability’ result.

Grouping like features, a firm is more likely to be in the lower profitability group if:

income from goods and services is flat or falling;
total number of jobs are flat or falling and/or;
productivity is flat or falling.

Generally these results are consistent with high-level intuition – higher productivity in particular should generally result in higher returns to factors of production (labour and capital, with capital benefiting the most in the short run).

Sales of goods and services and number of jobs are less direct signals, but still in line with expectations. The relationship found by the classifier suggests two conclusions. Firstly, profitability is being interpreted in the question asked in the survey as total (i.e. dollar) profitability, rather than a profitability rate. And secondly, increasing the scale of the business (sales and jobs) increases profitability. While neither conclusion is surprising, they aren’t guaranteed from the structure of the survey, and the confirmation provided by the classifier provides a useful check on the internal consistency of the businesses responding.

This project successfully piloted a methodology that uses ML classification of BCS data to arrive at relatively simple rules relating business features to business outcomes. Despite their simplicity, these rules (outlined in Section 5.2) show some ability to discriminate between those businesses which generally do not increase their profit from year to year, and those which do. This result suggests that it is possible for the ABS or data users to obtain additional insights from ABS data assets, subject to certain qualifications.

Notably, the process used to collect a dataset such as the BCS was not designed to produce data for a classification exercise. As such, any extension of this project for ABS datasets will benefit from some consideration of the nature of that data. For example, it may be appropriate to review the range of possibilities available for certain survey questions, and how these are coded in a dataset, so that the dataset is more amenable to ML methods.

The development stage of this project considered “time-series classification” approaches. These were abandoned as the requirement of more than five years of annual data for each business led to an unhelpfully small dataset. The relatively late abandonment of time-series methods left little time to incorporate features that were not present for each of the FY from 2015-16 to 2019-20 into the modelling approach. There may be some benefit in considering how such features can contribute to classification accuracy.

As this was a pilot project, much effort was spent on data wrangling and experimenting with classification methods in search of one that could achieve a satisfactory accuracy. Considerably less time was available for the development of a method for interrogating the trained random forest, which may contain many complex rules. Further development should consider two key areas. The first is to expand the approach for recognising dependencies between features. The second is to consider how to include a larger range of possible conditions in composing grouping guidelines. Together, these areas could lead to more discriminating versions of the rules presented in Section 5.2.

7. Acknowledgements

The author thanks Eugene Schon (Methodology Division, ABS) for reading an early draft of this report and providing useful comments. Gratitude also goes to Franz Király for several useful discussions about sktime features over the course of the project.

Dr Jason Whyte
Senior Statistical Analyst
Business Statistics Production and Futures Branch

8. References

1. Whyte, Jason M. Producing official statistics from linked data. Technology, Innovation and Business Characteristics Statistics, Australian Bureau of Statistics. 2024.

2. Australian Bureau of Statistics. Business Longitudinal Analysis Data Environment (BLADE).

3. —. Microdata: Business Characteristics Survey (2015-16). 2017.

4. —. Microdata: Business Characteristics Survey (2016-17). 2018.

5. —. Microdata: Business Characteristics Survey (2017-18). 2019.

6. —. Microdata: Business Characteristics Survey (2018-19). 2020.

7. —. Microdata: Business Characteristics Survey (2019-20). 2021.

8. Python Software Foundation. Python Language Reference, version 3.10.9. Available at https://www.python.org.

9. Markus Löning, Anthony Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines and Franz Király. Sktime: A unified framework for machine learning with time series. Vancouver, Canada : s.n., 2019.

10. Király, F, et al. Sktime v0.30.2. zenodo.org. [Online] 2024. https://zenodo.org/records/12653146.

11. Pedregosa, F, et al. Scikit-learn: machine learning in Python. Journal of Machine Learning Research. 2011, Vol. 12.

12. Scikit-learn developers. Permutation Importance with Multicollinear or Correlated Features. scikit-learn.org. [Online] https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html.

13. Australian Bureau of Statistics, Technology, Innovation and Business Statistics. SMURF-MURF Contents, Historical to 2022. [Excel spreadsheet, unpublished]

9. Appendix

A. Data processing to address systematic missingness when the outcome is quite predictable or certain

The Business Characteristics Survey has situations where a specific response to one question causes related downstream questions to be irrelevant. As a result, the data items associated with such downstream questions are assigned a “missing value” code. Depending on the situation, this code could be:

999999999 (nine 9 digits), denoting “missing”,
88888888 (eight 8 digits), denoting “missing due to sequencing”.

In various cases, owing to the logic of the survey, we can refer to some features and confidently deduce an appropriate value for those related features assigned a missing code.

In other cases where a feature has a missing value, we may use related features to deduce a replacement value, and expect that this replacement will be correct most of the time. That is, this approach has a better justification than employing an approach such as infilling missingness with (e.g.) a mode.

Appendix A summarises the replacements employed in the project. Together, these have led to a substantial reduction in the amount of missingness that had to be remedied by infilling. Further, the processes described below have rehabilitated certain features which were previously omitted due to their proportion of missingness exceeding our acceptable threshold.

All details relating to features in the discussion below were extracted from a BCS metadata file¹³.

A.1. Downstream from “Seek any debt or equity finance” (finance_bcs)

The feature “finance_bcs” records the response (or lack of one) to the question: “Seek any debt or equity finance (No/Yes)”. Where a business responds with a “No” (coded as 0), 20 downstream features (also intending to elicit a No/Yes response) are assigned the 88888888 code. Some of these features are available for FYs 2015-16 to 2018-19, and are hence of interest to this project. These features belong to two categories. The first category relates to the type of finance sought, and includes eight features (the “_bcs” suffix did not appear in the source file):

Type of finance sought (Tick all that apply)
Debt – Obtained	dtfin_ob
Debt – Not obtained	dtfin_no
Debt – In progress	dtfin_in
Equity – Obtained	eqfin_ob
Equity – Not obtained	eqfin_no
Equity – In progress	eqfin_in
Derived – Sought debt finance	d_dtfin
Derived – Sought equity finance	d_eqfin

The second category relates to the reasons for seeking finance, and includes a further 12 features:

Reason for seeking finance (Tick all that apply)
Ensure survival of business	finbussu
Maintain short-term cash flow or liquidity	fincashf
Replacement of IT hardware	finrepit
Replacement of other equipment or machinery	finrepot
Upgrade of IT hardware or software	finupgri
Upgrade of other equipment or machinery	finupgro
Purchase of additional IT hardware or software	finprchi
Purchase of additional other equipment or machinery	finprcho
Purchase of additional assets not related to expansion	Finprcha
Expand business	finexpan
To introduce new or improved goods, services, processes or methods	finnewgo
Other (2007-08 onwards)	finothr

Consider the case where a business has finance_bcs =0. Where any of the associated features listed in the two categories above have the value 88888888, this is replaced by 0.

A.2. Downstream from “Receive orders via the internet” (record_bcs)

The feature “record_bcs” records the response (or lack of one) to the question: “Receive orders via the internet (No/Yes)”. Where a business responds with a “No” (coded as 0), various downstream variables are assigned the 88888888 code. The following are available for at least the FYs 2015-2016 to 2018-19, making them of interest to this project:

Estimated percentage of sales income which resulted from orders received via the internet for goods and services
Less than 1%	incordl1
1% or more	incordin
Derived - Income resulting from orders received via the internet for goods and services	d_intinc

Consider the case where a business has record_bcs =0. Where any of the associated features listed above have the value 88888888, this is replaced by 0. Whilst this does change the meaning of incordin_bcs, we attend to this with further processing.

This processing is a further replacement of missing values, permitted by the mutually exclusive nature of the non-missing responses for incordl1_bcs and incordin_bcs. Where incordl1_bcs =1 (a positive response), incordin_bcs is assigned 999999999. Alternatively, where a percentage value is assigned to incordin_bcs (an integer between 1 and 100), incordl1_bcs is assigned 999999999. As such, when record_bcs=1: if incordl1_bcs=1, replace the associated incordin_bcs =999999999 with 0. This records the information that less than 1% of orders are received via the internet. By applying this rule over all appropriate businesses in the in-scope data, the cases of low internet orders (incordl1_bcs=1) are encoded in incordin_bcs. Hence, incordl1_bcs can be omitted from the dataset.

As the processing steps above have changed the meaning of incordin_bcs, it is appropriate to redefine it as the integer-valued: “Percentage of sales income which resulted from orders received via the internet for goods and services”.

A.3. Resolving the break in “Lack of access to knowledge or technology” (hampgkno_bcs)

One block of BCS data items relate to “Factors significantly hampering innovation”. We consider the following related features (which intend to elicit a No/Yes response):

Description	Name	Availability
Lack of access to knowledge or technology	hampgkno	Before and after 2019-20
Lack of access to knowledge to enable development or introduction/implementation	hampgknl	Only 2019-20
Lack of access to technology to enable development or introduction/implementation	hampgtch	Only 2019-20

Briefly, rather than collecting hampgkno_bcs in 2019-20, the BCS collected data on its two component features through separate questions, producing data items hampgknl_bcs and hampgtch_bcs. As this data was only collected for one year, it was not directly useful for the planned time-series approach. However, we can use the two components to infer values for hampgkno in the 2019-20 FY, which allows us to use this feature.

By using the maximum of a business id’s hampgknl_bcs and hampgtch_bcs for 2019-20, we can confidently ascribe a value to hampgkno_bcs for 2019-20 when:

Case 1. hampgknl_bcs =1 (regardless of the value of hampgtch_bcs, including 999999999) – assign hampgkno_bcs =1,

Case 2. hampgtch_bcs =1 (similarly to case 1, regardless of hampgknl_bcs) – assign hampgkno_bcs =1,

Case 3. hampgknl_bcs = hampgtch_bcs = 0 – assign hampgkno_bcs=0.

Cases 1-3 above demonstrate how the use of the maximum of the two components preserves the hampgkno_bcs coding of 0 for when there is no hampering of innovation due to either knowledge or technology limitations, and 1 for when there is any type of limitation hampering innovation.

However, we cannot make a definite assignment for hampgknl_bcs when:

one of hampgknl_bcs or hampgtch_bcs is zero, and the other is 999999999,

hampgkno_bcs = hampgtch_bcs = 999999999.

A.4. Missingness in “Years of operation - Current ownership” (busownyr_bcs) and “Years of operation - Regardless of changes in ownership” (busopyr_bcs)

When assigned a non-missing value, each of busownyr_bcs and busopyr_bcs will have an integer value. Both features were not available for FY 2019-20. However, it was reasoned that these could be reasonably infilled for each business with reference to earlier years. In a small number of cases, one or both of the features were missing for other FY. Employing a similar approach to that described above, it is still expected that infilling can provide a reasonable value. In the case of busownyr_bcs there is a possibility that missing values may coincide with a change of ownership that could not be predicted from past years’ data. However, the features have an amount of noise, and given the small number of cases requiring infilling, it was decided to retain the feature.

A.5. Downstream from “Reliance on a small number of clients, customers or buyers” (smlnocl_bcs)

BCS feature smlnocl_bcs records the response (or lack thereof) to the question “Reliance on a small number of clients, customers or buyers (No/Yes)”. Where a business responds with a “No” (coded as 0), the downstream feature lostcl_bcs (“Loss of one of these clients, etc - most likely impact (Tick one box only)”) is assigned the 88888888 code.

Consider the case where a business has smlnocl_bcs=0. Where the associated lostcl_bcs has the value 88888888, this is replaced by 1 (“Little or no impact on business's income”).

Glossary

Data aspects

Classifier terminology

Area Under the Curve (AUC)

A feature of an ROC curve that can range between 1 (perfect classifier) and 0 (no correct classification). This provides a means of assessing the performance of a trained classifier applied to a dataset. Higher values for a test set are generally preferable, but a higher value that results from “overfitting” the classifier to the training set is generally undesirable.

Binary classification task

The task is to assign each instance to one of two possible categories, most commonly 0 or 1. This project is largely a binary classification task.

Classifier

A particular method for classifying instances into categories. A non-trivial classifier can be thought of as a particular example of a family of related classifiers, where each example corresponds to each hyper-parameter taking a particular value.

Classification accuracy

The percentage of correct classifications out of the total number of instances. In a binary classification problem:

$\small Classification\;accuracy\;\%=\left(\frac{no.\;true\;positives\;+\;no.\;true\;negatives}{no.\;true\;positives\;+\;no.\;false\;postives\;+\;no.\;true\;negatives\;+no.\;\;false\;negatives}\right) \times 100\normalsize$

Classifier hyper-parameters

Options for a family of classifiers that can be set by the user. When each hyper-parameter is set to a particular value, this defines a particular classifier in the family. Examples of hyper-parameters for the random forest classifier family are the number of trees in the random forest, and the maximum number of levels allowed for each tree in the forest.

Classifier parameters

Distinct from hyper-parameters, these are features of a classifier that are set through the training process (and hence, not set by the user).

Classifier training (or fitting)

A process in which classifier parameters are adjusted with the aim of maximising the classifier’s ability to correctly classify instances in the training set.

Criterion

A measure that is used in random forest classifier training to judge if “splitting” some part of the training data into different portions in a decision tree produces a useful result. (A useful split creates distinct groups of instances that have an increased similarity within the groups compared to the original group.)

Cross validation

This is an approach to manage overfitting, and may be used inside a hyper-parameter tuning process. To use the example of five-fold cross validation, the training set is divided into five portions (folds) of roughly equal size (number of instances). There are five rounds of training, each round uses four folds, and the trained classifier is evaluated on the fold not used in training. This gives five scores for some measure of quality (such as classification accuracy). The score associated with the process is a function of the five scores, e.g. mean accuracy. This final score is used to judge which hyper-parameter values produce better performance in classifier training.

False negative

In a binary classification problem with instance labels 0 and 1, this is an instance with label 1 that is incorrectly classified (that is, assigned label 0).

False positive

In a binary classification problem with instance labels 0 and 1, this is an instance with label 0 that is incorrectly classified (that is, assigned label 1).

Feature importances (FIs)

A property of classification produced natively by certain classifiers in Python, such as random forests. The FI for a given feature can vary over the trees in the random forest. For this reason, mean FI is useful, as well as the standard deviation of FIs. Higher mean FI values indicate the features that are more important to the classifier. After “normalising”, the normalised FIs add up to 100%.

Hyper-parameter tuning

The performance of distinct classifiers in a family can vary substantially on the same dataset. The tuning process requires the user to define a range of particular values (or possibly a distribution) for their selection of hyper-parameters. This results in a number of particular classifiers. Each such classifier is trained on the training set, applied to the test set, and the results noted. The classifier with the “best performance” (which can be defined in various ways) is adopted for further use.

Overfitting

The generally undesirable situation where a classifier performs very well on the training set, and quite poorly on the test set. Sometimes referred to as “fitting the noise”: the classifier learns specific features of the training set which causes the classifier’s internal rules to not generalise well to data not seen in training.

Pipeline

An ordered list of sequential processes that assists in documenting the workflow applied in a classification problem. One example of a pipeline is a data processing function (e.g. for infilling missing data), then a data transformation function (e.g. summarising data) and a classifier.

Prediction

The use of a trained classifier to predict the labels of all instances in a dataset (e.g. the test set).

Random forest classifier

There is a large literature on classification methods, and various subtleties. Briefly, a random forest classifier is collection (or “ensemble”) of decision trees that are independently trained to produce a classification for each instance drawn from the training set. The classification produced for an instance is obtained by combining the classifications produced by each tree in the forest. (For example, the classification given may be the most common classification produced across the ensemble of trees.) In fitting a random tree to data, a group of features are randomly selected. For each in turn, a random value is chosen to serve as a value for a “split” (decision point). Instances having a feature value below the split value are allocated to one group, all other instances are assigned to a separate group. A suitable split produces subsequent branches of a decision tree that are more similar (“less impure”) than the data subsets seen at higher levels of the tree. Otherwise, an unsuitable split is disregarded, and the process is trialled again with another feature. The end result is the bottom layer of the tree (leaves), where each leaf will ideally contain instances that are quite similar and which mostly have the same label.

Receiver Operator Characteristics (ROC) curve

A graphical means of evaluating the performance of a classifier. (See also AUC.)

Trained classifier

The classifier that results from classifier training. This can be used for prediction.

True negative

In a binary classification problem with instance labels 0 and 1, this is an instance with label 0 that is correctly classified.

True positive

In a binary classification problem with instance labels 0 and 1, this is an instance with label 1 that is correctly classified.

APA

Producing Official Statistics from Linked Data — Technical Report

APA

Executive summary

1. Background and preliminaries

2. Business Characteristics Survey data and its treatment

2.1 Defining the in-scope dataset

2.2 Imputation of missing values

2.3 Assignment of instance labels

Figure 1: Counts of in-scope businesses by total profitability score.

Figure 1: Counts of in-scope businesses by total profitability score.

2.4. Data-related study limitations

2.4.1 In-scope dataset

2.4.2 Feature compprof_bcs, and others permitting the “stayed the same” response

2.4.3 BCS features with missingness not treated due to uncertainty or time constraints

3. Classification methodology and results

3.1 Overview

3.2 Classification methodology

3.2.1 A “baseline” classifier

3.2.2 Random forest classifier methodology

3.2.3 Feature importances methodology

4. Results

4.1 Classification results

Figure 2: Confusion matrix presented as a table obtained by applying the trained classifier to the "reduced" test set.

Figure 3: Classifier performance for the data set of 17 selected features.

Figure 4: Confusion matrix presented as a table for the trained classifier applied to the test set data of 17 features.

4.2 Feature importances and commentary

Figure 5: Largest mean normalised feature importances obtained for the test set of 12 features. Each black line shows an interval of +/- one standard deviation of the FIs from the mean.

Figure 6: 95% confidence intervals for the largest mean normalised feature importances obtained for the test set.

5. Towards simple classification rules

5.1 Test-set features

Figure 7: For the True Positive group, a graphical summary of distributions of, or pairwise associations for, selected features from the reduced test set.

Figure 8: A sample plot for showing differences in the values of features for groups found by application of the trained classifier to the test set. Features finassub_bcs and hampopro_bcs only have No/Yes (0/1) responses.

5.2 Candidate decision rules

Figure 9: A partial decision tree applied to the test set, informed by features recognised as having the highest mean normalised feature importance.

Level 1:

Level 2:

Level 3:

Level 4:

Level 5:

6. Discussion and conclusions

7. Acknowledgements

​​8. References

9. Appendix

A. Data processing to address systematic missingness when the outcome is quite predictable or certain

Glossary

Data aspects

Classifier terminology

8. References