This is the first part of a two-part series covering all things categorisation. Part one will cover the what, why and some of the how with part two going deeper on the technical nitty-gritty of how we approach categorisation at Credit Kudos.
Let’s start by defining what we mean by categorisation. The term categorisation is used to describe the classification of financial transactions, labelling each transaction with a pre-defined category label.
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
The immediate benefit of categorisation is in deriving meaning from a set of transactional data. For example, knowing that I spend 35% of my monthly income on restaurants might prompt me to cut down. Similarly, I might wish to label larger, recurring expenditure such as rent in order to track my fixed monthly costs or even track this information for credit scoring purposes. Categorisation is also applied to classify income. For example, breaking down salary, benefits, and other tertiary income such as part-time or freelance work.
This is by no means a new problem. Personal Financial Management (PFM) tools that help consumers aggregate and manage personal finances aren’t new; Intuit’s Quicken and QuickBooks launched in the early ’80s. In recent years, the passing of the Revised Payment Service Directive (PSD2) and the introduction of Open Banking standards in the UK have prompted a renewed focus to the categorisation problem.
Introduced by the CMA as part of the retail banking market investigation, Open Banking is designed to increase competition by allowing users to easily compare, apply, switch and save. This goes far beyond the PFM, using the data as a basis for validation and origination use cases (for example applying for credit).
Sadly, Open Banking doesn’t provide categorisation out of the box. In fact, although Merchant Category Codes (MCCs) are included in the official specification, there is no requirement for banks to provide the field if it is not present in their online banking interface. This leaves any business wishing to tap into the Open Banking APIs needing to source categories from another source. In addition, the stakes are raised as labelling a transaction incorrectly could be the difference between a loan being accepted or declined.
So, how do you measure your efficacy and be sure you’re getting it right? Before looking at measurements, we need to think about how we’re going to stress test our classifier with “real” data (bearing in mind that this is a moving target — merchants and payees come and go all the time).
The first thing we can look at is measuring accuracy — what percentage of labelled transactions were correct. The way we can test this is by applying our classifier to a set of ring-fenced transactions (i.e. not used in the training of the model).
Example of a categorisation model applied to a set of test transactions.
The problem with taking this measure at face value is that it is entirely dependent on the size and breadth of the test set. In the example below, we achieve a 90% accuracy with a completely useless model that labels all transactions as supermarket:
This model yields 90% accuracy despite labelling everything as “SUPERMARKET”.
Measuring accuracy alone also means we overlook the coverage or recall of our model. For example, if my classifier only labels transactions it is certain of (say “Sainsbury’s” to supermarket). We might end up with 100% accuracy but only a small subset of transactions receiving a label — not very useful.
Clearly, there is a trade-off between precision (accuracy) and recall. The more transactions of the overall set we label (greater recall), the greater the chance of getting one wrong (lower precision). We need a way to capture this trade-off and measure the overall efficacy of the model. For this, we can use F1 score. The F1 score (also known as F-score or F-measure) is a statistical combination of precision and recall that helps evaluate the performance of a test. An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
F1 score still falls down without a statistically significant test set — be sure to check what data are being used to test any product before taking the results at face value (better yet, supply your own data for the test). Don’t settle for “our categorisation is 95% accurate” without drilling into the detail. All good technology companies should be happy discussing their approach to testing with you.
So, we know what categorisation is, why it’s important, and some basic measurements that can be applied to a classifier. In the next part of this series, we’ll discuss how we approach categorisation at Credit Kudos. Since first aggregating transactions in 2015, we’ve developed a number of sophisticated approaches specifically tailored to credit decisioning. Stay tuned!