The views expressed in this article reflect those of the author and not those of his employer or colleagues.
Money laundering remains a significant global problem as criminals of all stripes have strong financial incentives to innovate and adapt to the changing detection methods employed by financial institutions, law enforcement, and government actors. According to Deloitte’s Anti-Money Laundering Preparedness Survey Report 2020, it is estimated that the amount of money laundered in one year is between 2% and 5% of global GDP ($800B – $2.0T). In their December 2018 joint agency statement, the US federal banking agencies issued a statement encouraging innovative industry approaches to addressing Anti-Money Laundering (AML) and Combating the Financing of Terrorism (CFT). Banks have responded by exploring applications of new technologies to enhance onboarding, due diligence, and transaction monitoring requirements. Some of these innovations have made their way into production and are making a difference in how we, as a collective force, combat the use of the financial system by criminals.
This article, the first of three planned pieces addressing behavior detection (BD) in AML/CFT, fraud, and cybercrime, will focus on how banks align their organizations and processes to better leverage advances in data and behavior classification techniques in their transaction monitoring programs. In particular, it will explore several ways that the current alignment of these functions could be optimized to improve the responsiveness and effectiveness of BD programs.
At this point, it is important to acknowledge my good fortune to work with many very bright and talented individuals in the AML modeling, investigations, technology, and validation space. The innovations discussed herein are a distillation of the ideas to which I have been exposed and are being offered as a means to foster broader consideration of these topics. Unlike many aspects of the financial services industry, AML is a team sport. We all share a common goal of reducing the exploitation of the financial services industry by criminals. Sharing ideas in this manner is one way to advance our common objective.
Traditional AML Behavior Detection
Broadly speaking, BD has two primary origination points: human and automated. Human detection involves an employee reporting behavior that seems unusual for the customer or circumstances and is critical as the frontline staff of financial institutions are in a unique position to directly observe behaviors that warrant a closer look. Conversely, automated BD reflects the application of screening and classification algorithms that can leverage transaction data, and increasingly non-transaction data, to identify customers and accounts that are exhibiting potentially interesting1 behavior. Traditionally, BD has been organized around particular types of behaviors, customer types, and products. For example, the simplest forms of BD can consist of grouping customers and products into more homogeneous sets and applying filters that screen for particular behavior types.
For example, a scenario focused on detecting structuring will likely focus on cash deposits over a short period that, in aggregate, exceed currency transaction reporting limits but are comprised of multiple cash deposits under the filing limit. A scenario looking for changes in behavior patterns may compare the total amount of transactions over different periods and classify those with large changes as potentially interesting.
In the main, these types of scenarios can detect activity worthy of investigation. However, due to their simplicity, they are often very inefficient. For example, a low-activity account that was opened by a criminal six months ago for the purpose of laundering proceeds today, will probably be flagged by a simple change in behavior scenario when it is used. However, a non-criminal that transfers funds from an account at a brokerage firm to their bank checking account in order to make a college tuition payment may also flag as interesting. This is an example of a “false positive alert” in which a scenario classifies normal behavior as interesting and is an all too common occurrence in some BD scenarios. Even if the scenario is effective at finding a high percentage of all interesting activity, which is not guaranteed, it may do so very inefficiently.
Another option that improves efficiency and effectiveness is the use of decision tree logic to classify behavior. Decision trees expand upon traditional filters by requiring that behaviors meet multiple conditions before being classified as interesting. For example, the aforementioned change in behavior scenario could be improved by requiring both a large change in total transaction amount and transaction counts. In this example, let’s say the criminal’s account has a relatively small number of total transactions (say 10) but the college parent’s account has a typical amount for a family (say 50). Then, for a given dollar increase in total transaction amount, the scenario could be designed to alert only if the transaction count is less than 15. It goes without saying, that many combinations of different behaviors can be tested for their effectiveness at properly classifying activity.
In the model development world, the behaviors noted above are converted into mathematical or logical expressions and are referred to as variables or features. A large number of features increases the chances of finding combinations that are effective at flagging bad actors for investigation while ignoring the average customer. This stems from the fact that diverse features increase the likelihood that one, or a combination of several, will be effective at finding hidden patterns. Feature variety is particularly important when considering the application of machine learning which is particularly well suited to evaluating very large data sets. The importance of this discipline will be examined further later in this article.
The development of decision trees is often a painstaking process in which a model developer uses statistical data and their own experience to find combinations that achieve their effectiveness and efficiency objectives. While this process can be very effective for relatively focused applications, the obvious constraint is the time and effort required for a person to test combinations of features. In contrast, machine learning algorithms enable a model developer to test the effectiveness of huge numbers of different feature combinations in relatively short periods of time. They do this by using optimization algorithms to test millions of combinations of hundreds of features and in doing so allow a developer to efficiently compare the relative effectiveness of different algorithms. This capability allows for the detection of hidden relationships between features that are effective at isolating the good actors from the bad actors.
A discussion of the best machine learning algorithms for AML/CFT is well beyond the scope of this article but gradient boosted trees and random forests have been shown to be effective.
So two points should be becoming clear at this point. First, features are very critical to BD. They are the “oil” of behavior detection, to re-use a tired analogy. Second, machine learning is very helpful at finding helpful combinations of features. These two facts point to opportunities to re-tool the behavior detection model manufacturing process. In particular, they suggest that we could rearrange the development process and consider moving from episodic scenario development to a more continuous development and updating process.
From Baking the Cake to Food Engineering
In my grandparents’ youth, baking involved the use of ingredients that were available at the local store (or could be made at home) and an oven to transform them into a cake.
Today, food engineering involves the careful construction of ingredients tailored to please our palate that are combined using a variety of technologies (ovens, microwaves, etc.) to create an appealing product at a low cost. Furthermore, the process for creating ingredients has evolved to be separate from their specific use. Some food engineers design new ingredients while others find the best combination of those ingredients to make desirable products.
The construction of behavior detection models lends itself to this approach. Rather than the traditional process where a model developer selects from readily available features and perhaps creates a few more for the job at hand, we could transform the feature creation into a continuous process that steadily adds features to a “feature store”. BD would likely be strengthened by transforming feature development into a distinct discipline within the AML/CFT behavior detection world.
Specifically, a team comprised of experts in criminal behavior and data science could be established to focus exclusively on the creation and deployment of features. The experts in criminal behavior could be borrowed from the firm’s surveillance operations group and tasked with identifying emergent behaviors. They may derive their inspiration from reading the bank’s investigative summaries, reading government agency bulletins, or networking with others in the field. Regardless of the source, this team will be on the lookout for patterns that identify bad or good actors.
The data engineers on the team would work closely with the typology experts to understand the behaviors and convert them into mathematical representations that can be used by model developers. For example, an investigator may determine that accounts flagged as interesting are often one-hop removed from accounts that were previously identified as interesting. The data scientist may then recommend that a graph query be used to tag all accounts that are one hop away on a network graph from all accounts previously flagged as interesting.
In doing so, the team will have converted an experience based observation into data that can be used in the model development process. Through a continual focus on creating new features, the feature store will be refreshed with new and relevant features that can be used by model developers.
Meanwhile, the model developers would be hard at work creating algorithms that are effective at detecting classifying the good from the not-so-good actors in the portfolio. For example, if the bank decides it needs to implement a scenario to better target a particular form of human trafficking, the model developers would take the lead on developing samples, working with the operations team to identify clear examples of the typology, selected potential features from the feature store, and selecting the best algorithms, conducting hold-out testing, gaining approval from the business, and managing the model validation process.
One Model to Rule Them All?
The advent of machine learning has opened up the possibility of shifting from heavy reliance upon multiple behavior-focused scenarios towards a framework centered around a primary detection engine (PDE). Specifically, since machine learning algorithms are effective at finding subtle relationships between features, there is evidence to support the idea that we can create a PDE that estimates a customer’s AML risk, or at least one that a rank-orders the relative AML risk of all customers. Peripheral scenarios would likely be needed for edge cases where a bank lacks sufficient data to train its model to detect a particular activity. However, with appropriate sampling strategies and the thoughtful creation of features, the development of a PDE is a possibility worth exploring.
Two benefits of a PDE include are worth calling out. First, it is easier to update a single machine learning model with new data and features than 10 or 20 scenarios. As will be discussed later, the process for building a PDE needs to be carefully considered to reap the full benefits of this approach. Second, it enables more efficient use of sampling resources in that the number of samples needed to develop and monitor a single model may be lower than that required to support a collection of scenarios covering the same risks.
The March Towards Continuous Improvement
The process of building new behavior detection models is, by definition, episodic. The business determines that a new layer of detection is required and then directs its experts to tackle the problem. The process of updating existing scenarios follows a similar path in that scenarios are typically revised in response to performance concerns or due to changes in products, customer composition, and geographic exposure. These discrete event-driven changes typically require the development team to adjust each affected model individually. The time required to review samples, fit models, obtain appropriate approvals, and deploy the final application can run to several months or more. There are ways to potentially expedite this, however.
The first approach is by instituting an ongoing monitoring sampling process. Samples are critical for the development of scenarios as they provide labeled examples of interesting and normal activities that can be used to train a model. They are typically episodic, large, and time-consuming to disposition. In contrast, ongoing monitoring sampling involves using a statistically appropriate method for selecting samples from the population of customers each time the scenarios execute. These samples are then dispositioned by the surveillance operations team along with the regular production alerts generated by the scenarios. The benefits of this process are many.
First, if the operations team dispositions both true alerts and monitoring samples, bias is eliminated because the reviewer does not know that the samples have a lower probability of being uninteresting. Second, as samples are accumulated run by run, a store of samples is available for the developers to leverage. For example, samples can be used to measure the type 2 error of the model while simultaneously providing information to the feature development team, and support model redevelopment.
Finally, as with the feature store, the sample store can be used to rapidly redevelop models. Rather than relying upon large discrete samples to collect data, developers can rely solely upon monitoring samples or, if they need a closer look at certain regions, they can gather smaller development samples. Either way, the development process is accelerated.
Regular Updates and Releases to Models
The move to machine learning can also expedite the updating of models. For example, updating a data set and incorporating features from the firm’s feature store to a machine learning model can be done in weeks. With the use of automated machine learning tools, the process can even be expanded to test alternative algorithms to determine which fit the updated data set best. An update frequency of quarterly or semi-annually for certain scenarios, and more importantly for a PDE, can help to keep models up to date with changes in customer behaviors. Of course, all of this requires a well-choreographed set of procedures and clear decision criteria for making a go/no go decision. Heavy reliance upon validation and hold-out testing to evidence model performance, combined with rigorous ongoing testing, is critical to enabling frequent model updates.
Documentation remains the bane of the developer’s existence. However, it is no less vital than the model itself for without it the model does not exist in the eyes of management or regulators.
Given the enormous effort involved in completing documentation for complex models, this process needs to be carefully considered at the same time the development process is re-engineered. Each model change has a documentation cost and therefore must be considered in light of the creative latitude afforded developers. For example, standardized reporting on data quality, model results, approvals, and other evidence will streamline the documentation process.
However, such standardization may also limit the ability of developers to explore alternative approaches. This means finding ways to minimize exceptions to the development process would often be at odds with innovation. However, the movement towards standardization would likely result in a net increase in the velocity with which the overall BD infrastructure can be improved. It also means finding agreement with the model risk management group on the best approach to validating frequently updated models is vital.
Model risk management is a vital part of ensuring that only properly constructed models which align with a firm’s risk tolerances are deployed into production. However, model risk management can be a time-consuming process as reviewers need weeks or months, depending upon the model, to review and challenge the submitted documentation. If a bank wishes to continuously update its models, rules need to be established governing the evidence required to demonstrate the soundness of a model. The model developers would need to create standardized development procedures that can be approved ex-ante by the model risk management team.
Model validation teams would need to become comfortable with approving the deployment of models developed using these procedures and relying upon periodic testing to evidence compliance with the procedures. The use of periodic compliance testing with an agreed-upon development process, as opposed to deep dives into each model release, could accelerate the responsiveness of BD programs to changes in coverage needs.
The stakeholders in the fight against money laundering and terrorist financing are exploring ways how best to leverage improvements in data and technology to stay abreast of the ever-changing criminal landscape. Machine learning and changes in model development processes offer the tantalizing promise that institutions will be able to both improve the effectiveness of their BD programs while improving the speed with which they can deploy enhancements.
I am hopeful that this article will stimulate further discussion between the various stakeholders in this fight so that we may collectively find ways to help law enforcement in identifying suspicious activity.