Reinforcement learning, supervised learning, contextual bandits, active learning – how do they all fit together to make better customer outcomes?
The general idea of next-best-action recommendation is very simple. As it says on the box – if you have a set of actions you could choose between for a customer, which one should you choose? A deeper question is: how do you make better decisions over time?
In this post we explain the relationship between various approaches to machine learning and experimentation as they apply to next-best-action recommendation. Clearly, there are many different aspects to each of the machine learning approaches and different factors are important in different business settings. Here we are focussing on the key similarities and differences relating to exploring customer behaviours in order to collect knowledge about them, and in how that knowledge is used to generate better customer outcomes.
Picking a winner
We will start with the simplest sort of next-best-action recommendation systems - those where the desire is to pick a single winner out of a set of actions, and eventually to always use that best action for every customer, regardless of their characteristics. This is the province of A/B testing, and typical applications are choosing between website design layouts, or picking the best advertising banner to show. When you have new actions to try, you know very little about how they will behave. This means that you need to explore how customers react to each action, then choose a winner to exploit the knowledge that you have gained while exploring.
This tradeoff between exploring to gain new knowledge and exploiting the knowledge you have already gained is key to understanding how next-best-action recommendation systems work. The basic question is do you explore to learn and then exploit your knowledge, or do you simultaneously explore and exploit what you know so far?
Explore then exploit
For picking a winner, the simplest approach is to run a randomised experiment in an A/B test. In this case, an action is randomly selected and executed for a customer, and the reward (that is the outcome of the action – it might be zero or even negative!) is recorded. At the end of the experiment, a winning action is declared and that action is then used for every customer.
In this case, the knowledge exploration and knowledge exploitation phases are separated.
Explore and exploit at the same time
A more efficient approach to picking a winner is to use algorithms that solve what is known as the “Multi-arm Bandit” problem (see this blog for more details). These algorithms dynamically measure which of the actions gets the best results (like picking which slot machine is best to play…), and incrementally allocate more and more traffic to that action until it is confident it has found the winner. This way you optimise the overall reward you receive during the test by minimising the number of bad actions you choose.
Typically, people don’t think of A/B testing as a type of next-best-action recommendation, as they are largely thinking about personalised recommendations, driven by machine learning. That is, given some information that I know about some customer (the context), what is the next-best-action for that customer? However, multi-arm bandit algorithms and A/B testing are important methods as
- sometimes it is not worth doing personalisation - the cost of personalisation outweighs the benefit of giving a personalised treatment
- in the absence of data that is predictive about the customer’s preference, personalisation cannot give better results than and A/B test
- every next-best-action recommendation system has to solve the explore / exploit trade-off, and the basic model of separate phases (ala A/B testing) or simultaneous optimisation (ala multi-arm bandits) applies.
There are three types of machine learning approaches that can be used for next-best-action recommendation that we explore here:
- supervised learning - we are considering this from the explore/exploit tradeoff, not which learning algorithm should be chosen
- reinforcement learning - initially we focus on solutions to the contextual bandit problem, which is a special case of reinforcement learning
- active learning - which we include because of the way it makes exploration for new knowledge more efficient.
Note that some unsupervised machine learning methods are also used for action recommendation - matrix factorisation for product recommendation, for instance. In this context, these algorithms are very similar in usage to supervised learning approaches, where the matrix factors are analogous the models that have been learnt from historical data.
Learn from the past
Supervised learning is a type of machine learning that can be applied to next-best-action recommendation. It takes data relating to customer context in the form of machine learning features, and outcomes in the form of the rewards from actions that were taken, and builds predictive models that can predict rewards for actions for customers that have not been seen yet. In this way, it learns from the history of interactions with past customers (the exploration part is implicit in the collection of training data) and then makes predictions for new customers (the knowledge exploitation part). Product recommender systems on eCommerce sites that tell you what “other people who bought this also bought” often are built this way. These systems are periodically retrained to update the knowledge of customer behaviour encapsulated in the model.
These systems have two limitations with respect to next-best-action recommendation:
- when a new action is introduced they have to go back to an exploration phase
- when customer behaviours change the systems adapt at the rate that the model is retrained.
It is of interest that in modern MLOps frameworks (e.g. Seldon Core ), A/B testing and Multi-armed bandits are used to choose between supervised learning algorithms in production.
Continuous learning from outcomes
As noted above, supervised learning systems are retrained periodically. What happens if you retrain them after every interaction? Also, what happens if you do a little exploration during your application of a supervised learning model? If you do these things, you are now solving a contextual bandit problem - making a prediction of a next best action, exploiting the contextual information that you know, and efficiently exploring for new knowledge. Systems that do this are used every day in product recommendation by Amazon, show recommendation by Netflix, and a variety of call center allocation systems.
How these systems work is explored in detail in this blog post.
The key point to understand is that contextual bandit problems are typically solved using supervised learning models that can quantify their uncertainty. Actions where there is a lot of uncertainty in the outcomes get more exploration. Actions where the algorithms are confident that the reward is high get more exploitation.
A contextual bandit algorithm that uses context and models that are not predictive of the outcomes of the actions will reduce to a multi-arm bandit algorithm.
Off-policy-evaluation is a key concept for testing new contextual bandit algorithms on historical sequences of actions - it is analogous to the training-validation loop of supervised learning. Supervised learning is used in some important off-policy-evaluation methods.
Choosing who to learn from
A slightly less well known branch of machine learning is called Active Learning. Active learning algorithms are used when you want to minimise the cost of exploration. Just say you wanted to send out offers to existing customers for renewal, but you didn’t know which offers to send to which customers. Making the wrong choices can be expensive. So you want to build a model to predict the best offers to send to each customer, but with the least cost. In this case you want to choose which customers get the first offers very carefully - you want to choose those that teach you the most, that maximise your knowledge through exploration, so that you are making better decisions sooner.
To solve this offers problem with active learning, an algorithm says which of your customers it is most uncertain about and sends an offer to them to learn as much as possible from that test. This is closely related to the continuous learning problem where you want to explore your space efficiently (but don’t have control over who you are making recommendations for). The most general approaches are those that combine active learning and contextual bandits, but that is still an area of active research.
Next-best-journeys and reinforcement learning
Most readers of this blog will have heard of AlphaGo and AlphaZero, the game playing machines from Google DeepMind. These are based on reinforcement learning - where the algorithm plans sequences of actions (moves in the games) - and learn from the eventual results (did it win the game or not). To do this they must simulate many moves in advance, and predict the eventual outcomes from the states that they simulate.
These algorithms can also be applied to next-best-action recommendation. Indeed, algorithms that solve the contextual bandits problem are simple types of reinforcement learning algorithms, but they only have to deal with predicting a single next-best-action from the user context not sets of actions.
Full reinforcement learning can be applied to a more general concept of next-best-action which we call next-best-journeys. When you have a user journey that can go through multiple paths - what is the optimal path for a given person and what is the next action that leads to or along that path? Reinforcement learning algorithms can plan and optimise through the states of the user journey to reach an eventual desired target.
We’ve introduced the relationships between the important machine learning concepts in next-best-action recommendation, and differentiated them based on how they solve the knowledge exploration and exploitation trade off. They can be summarised in the following figure which we hope will be a handy guide to the community. High quality PDF is available here.
Have a look at our posts describing
- the differences between A/B testing and contextual bandits
- the relationships between continuous intelligence and contextual bandits
If you have a next-best-action recommendation problem and would like some assistance in solving it scientifically, please contact us through our website or email firstname.lastname@example.org. We specialise in next-best-action recommendation, uplift modelling, and experimentation to prove return-on-data.