Data Mining Problems in Retail

Retail is one of the most important business domains for data science and data mining applications because of its prolific data and numerous optimization problems such as optimal prices, discounts, recommendations, and stock levels that can be solved using data analysis methods. The rise of omni-channel retail that integrates marketing, customer relationship management, and inventory management across all online and offline channels has produced a plethora of correlated data which increases both the importance and capabilities of data-driven decisions.

Although there are many books on data mining in general and its applications to marketing and customer relationship management in particular [BE11, AS14, PR13 etc.], most of them are structured as data scientist manuals focusing on algorithms and methodologies and assume that human decisions play a central role in transforming analytical findings into business actions. In this article we are trying to take a more rigorous approach and provide a systematic view of econometric models and objective functions that can leverage data analysis to make more automated decisions. With this paper, we want to describe a hypothetical revenue management platform that consumes a retailer’s data and controls different aspects of the retailer’s strategy such as pricing, marketing, and inventory:

There are two major reasons why this study focuses on a combination of economic frameworks and data mining methods:

Hundreds of economic models relevant to retail can be found in economic textbooks and articles because markets, discounts, competition etc. were a subject of intensive research over the last century, if not longer. However, many of these models are highly parametric (i.e. defined by rigid equations with a finite number of parameters) and not flexible enough to model real life with sufficient accuracy. Data mining offers a variety of techniques for nonparametric modeling that helps to create flexible and practical models. Many articles and case studies published during the last decade successfully achieve the balance between abstract models and machine learning.
Fast data circulation in modern retail enables retailers to make accurate forecasts using relatively simple models because small incremental predictions are generally simpler than big decisions. For instance, it might be difficult to calculate the optimal price for a new disruptive product because its perceived value is not known, but it can be relatively easy to automatically adjust promotion prices in real time depending on demand and inventory levels. Some commercially successful solutions for price optimization discard most of economic modeling simply moving prices up and down depending on closed loop feedback from point of sales [JL11].

These two considerations suggest a high potential for automated decision making and dynamic optimization in retail, so we were keen to study this subject. Most of this article represents an overview of the results published by retailers and researchers who built practical decision making and optimization systems combining abstract economic models with data mining methods. More specifically, the article was inspired by three major case studies from Albert Heijn [KOK07], the largest supermarket chain in the Netherlands, Zara [CA12], an international apparel retailer, and RueLaLa [JH14], an innovative online fashion retailer. We also incorporate results from Amazon, Netflix, LinkedIn and many independent researchers and commercial projects. At the same time, we avoid academic results with little or no empirical support.

The study focuses mainly on optimization problems related to revenue management discipline which includes marketing and pricing questions. More specialized data mining applications like supply chain optimization and fraud detection are out of scope, as well as the implementation details of the data mining process (such as evaluation of model quality).

The rest of the article is organized as follows:

We first introduce a simple framework that ties together a retailer’s actions, profits and data. This framework will later be used to describe analytical problems in a more uniform way.
The main body of the article represents a catalog of optimization problems relevant to retail. We describe the problems one by one in separate sections. Each section provides a brief problem statement, a list of business use cases and applications, and a detailed description of how the problem can be decomposed into econometric models and data mining tasks that help to solve the business problem by means of numerical optimization.
We next provide a section that discusses the economic benefits that can be expected in practice.
Finally, we conclude the article with a discussion of dependencies between the considered problems to figure out common principles and important cross-cuts.

[This article is also available in PDF – see the Whitepapers page.]

The Optimization Framework

This article describes six major optimization problems related to marketing and pricing that can be solved leveraging data mining techniques. Although these problems are very different, we are trying to establish a common framework that helps to design optimization and data mining tasks required for solutions.

The basic idea of the framework is to use an economic metric such as gross margin as the optimization objective and consider it a function of possible retailer’s actions such as marketing campaigns or assortment adjustments. The econometric objective is also a function of data in the sense that econometric models should be parameterized by properties of a particular retailer to produce a numerical value, such as gross margin, at its output. For instance, consider a retailer planning a marketing mailing campaign. The space of possible actions can be defined as a set of send/no-send decisions with regard to individual customers and the gross margin of the campaign depends both on actions (who will receive the incentive and who will not) and data such as expected revenue from a given customer and mailing costs. This approach can be expressed in more formal way by the following equation:

$A_0=\underset{A}{argmax}\; G\left(A,d\right)\quad (1)$

where is the data available for analysis, is the space of a retailer’s actions and decisions, $G(\cdot)$ is an econometric model defined as a function of actions and data, and A_0 is the optimal strategy. This framework resembles the approach suggested in [JK98].

The design of the model heavily depends on the problem. In most cases it is reasonable to model and optimize gross margin, but, as we will discuss in the next section dedicated to response modeling, other objectives are also possible. It is also important to keep in mind that the optimization problem (1) as a whole is somewhat dependent on time because of environmental changes (new products appear on the market, competitors make their moves etc.) and retailer’s own actions. The most typical approach for handling this dependency is to use stateless treating it as a mathematical function, but allow for historical data in the arguments to account for memory effects.

The role of data mining in the optimization problem (1) is crucial because econometric models are typically complex and have to be learned from data by means of regression and other data mining techniques. In some cases the model cannot be completely specified either because of high complexity (e.g. user behavior cannot be precisely predicted) or because it’s impossible to extrapolate the existing data to the case of interest (e.g. the action is to introduce a completely new service). A/B testing and panel surveys are used in such cases to get additional data points that improve the precision of the model.

Problem 1 : Response Modeling

Problem Statement

Some resource such as an advertisement or a special offer will be distributed to a group of customers. Each unit of the resource is associated with a monetary cost such as the mailing cost of a printed catalog, or some negative effect (such as causing a customer to unsubscribe from irrelevant email notifications). At the same time, the resource can influence customers’ decisions urging them to make more purchases, buy promoted products etc. The goal is to find a set of the most promising candidates who should receive the resource in order to maximize the overall performance of the targeted group of customers.

The resource can be homogenous (i.e. all participating customers will get the same incentive) or personalized. In the latter case, a retailer has a set of different incentives such as discount coupons on different products and the goal is to offer a unique subset of incentives or no incentives to each customer to maximize the overall performance.

Applications

Response modeling is widely applicable in marketing and customer relationship management:

Targeting specific discounts, coupons, and special offers requires the identification of customers who are likely to respond on the offer.
Targeted mailing campaigns and special gifts (e.g. free sunglasses from a car dealer) often require the identification of the most valuable customers to reduce the marketing costs.
Customer retention programs can require the identification of customers who are likely to stop the relationship with a retailer but can change their minds under the influence of incentives. For instance, an online retailer can send a special offer to customers who had abandoned their online carts or search sessions before the checkout.
Online catalog and search results can be rearranged depending on a customer’s likeliness to respond to particular items.
Response modeling helps to optimize email campaigns to avoid unnecessary spamming which can cause customers to unsubscribe from email notifications.

Solution

From the points we’ve talked about above, we now realize that the problem of resource distribution is an optimization problem that should be driven by an objective function. One of the most basic approaches is to model the overall profit of the campaign in terms of probability of response and the expected net value for a customer. Let us denote the entire population of customers as and the subset of customers reached in the scope of the campaign as $U \subseteq P$ . The expected gross profit of the campaign can then be modeled as follows:

$G=\sum_{u \in U} Pr(R|u;I)\cdot (g(u|R) - c) + (1 - Pr(R|u;I))\cdot (-c)\quad(1.1)$

where Pr(R|u;I) is the probability of the response on the incentive from the customer , g(u|R)) is a response net value for the customer , and is a cost of the incentive resource. The first term corresponds to the expected gain from a responding customer and the second term corresponds to the expected loss of sending an incentive to which there’s no response. The objective is to maximize by finding a subset of customers that are likely to respond in the most profitable way. Since the equation (1.1) can be reduced as follows

$Pr(R | u;I) \cdot (g(u|R)-c) + (1-Pr(R|u;I))\cdot (-c) = Pr(R|u;I)\cdot g(u|R) -c =\\= E\left\{g|u;I \right\}-c$

where $E\left\{g|u;I\right\}$ denotes the mathematical expectation of the gross margin for a given user assuming that the user will receive the incentive, the customer selection criteria boils down to the following condition

$Pr(R|u;I)\cdot g(u|R)>c$

and the optimal subset of customers can be determined as a subset that maximizes the gross margin:

$\underset{U\subseteq P}{argmax}\;G = \underset{U\subseteq P}{argmax}\; \sum_{u\in U}E\left\{g|u;I\right\}-c\quad (1.2)$

This approach can also be considered the maximization of targeted net value compared to random resource distribution. To see this, let us compare these two options assuming a fixed number of customers |U| participating in a campaign. First, let us extend the equation (1.2) to explicitly include the expected gross margin from a campaign that distributes incentives among |U| customers selected at random:

$\underset{U\subseteq P}{argmax} \sum_{u\in U}\left(E\left\{g|u;I\right\}-c\right)-|U|(E\left\{g|I\right\}-c)=\\=\underset{U\subseteq P}{argmax} \sum_{u\in U}\left(E\left\{g|u;I\right\}-E\left\{g|I\right\}\right)=\\=\underset{U\subseteq P}{argmax}\sum_{u\in U}E\left\{g|u;I\right\}\quad,\quad |U|=const \quad (1.3)$

where $E\left\{g|I\right\}$ is the average net value per customer over the population. This average net value is constant, hence it can be omitted assuming the fixed cardinality |U| . The equation (1.2) can be also reduced in the case of fixed |U| yielding the same result as (1.3):

$\underset{U\subseteq P}{argmax} \sum_{u\in U}\left(E\left\{g|u;I\right\}-c\right)=\underset{U\subseteq P}{argmax}\sum_{u\in U}E\left\{g|u;I\right\}\quad,\quad |U|=const$

However, it can be argued [VL02] that this model is imperfect because it favors customers who are likely to respond to an incentive, but does not take into account customers who are likely to respond anyway generating the same profit even without incentives. To address this shortcoming, let us separately calculate the gross margin for the set of customers in the following four cases:

– select U according to the equation (1.2) and send incentives to everyone in U
– select U randomly and send incentives to everyone in U
– select U according to the equation (1.2) but do not send incentives at all
– select U randomly but do not send incentives at all

The equation (1.2) maximizes the difference G_1-G_2 i.e. the lift of targeting compared to the random distribution. The alternative approach is to maximize (G_1-G_2 )-(G_3-G_4) which measures not only the lift compared to the random distribution but also the lift compared to the no-action baseline on the same set of customers. In that case, the equation (1.2) transforms into the following:

$\underset{U\subseteq P}{argmax} \sum_{u\in U} E\left\{g|u;I\right\} - c - E\left\{g|u;\bar{I}\right\} \quad (1.4)$

where the last term corresponds to the expected net value for customers who were not provided with the incentive. This approach is known as differential response analysis or uplift modeling [BE09].

It is worth noting that the expressions (1.2) and (1.4) are not necessarily optimized by maximizing marketing budgets. Consider the situation when the response net profit is $100 per customer and the incentive cost is $1. If a group of 1 million customers contains 0.5% potential responders, the most expensive marketing campaign that reaches each customer will effect a loss of $500K (the total response net value of $500K minus the campaign cost of $1M). At the same time, a data model that identifies ten thousand of the most likely customers with a response probability of 5% (10x lift) will produce a profit of $40,000 (a total response value of $50,000 minus the campaign cost of $10,000).

The equation (1.4) is especially important for different types of price discounts (coupons, temporary price discounts, and special offers). Consider the following question: “Should a retailer offer a discount coupon on apples to a person who buys apples every day?” This question would most likely will be answered in the affirmative according to the equation (1.2) because the person is likely to redeem a coupon. However, it is more probable that the customer would just buy the same amount of apples for a lower price, basically decreasing retailer’s profit. The equation (1.4) alleviates this problem by incorporating default customer behavior. We continue to discuss price discrimination in the next sections because it is a complex topic that goes far beyond the equation (1.4).

The mathematical expectations of the net revenue in the equations (1.2) and (1.4) can be estimated by means of classification and regression models trained on historical data for customers who have received incentives in the past and those who did not. This problem can be very challenging, especially when the incentive under evaluation is somewhat dissimilar to everything used in the past; in this case, the incentives may require testing on a customer panel before running a full-scale campaign. Moreover, gross margin is not the only performance metric that is important for retailers. The gross margin metric, in the sense it is used in the equations (1.2) and (1.4), is concerned with the immediate return from the first purchase which is a very simplistic view of customer relationship management. A retailer might be concerned with a variety of other metrics and this variety is so huge that there is a separate econometric discipline – propensity modeling [SG09, LE13] – that develops different models that predict customers’ future behaviors. The most important propensity models include:

Predicted lifetime value. The lifetime value model is one of the most important models that estimates the amount of revenue or profit a customer will generate over his or her lifetime. This metric is especially important for campaigns that aim to acquire new customers.
Predicted share of wallet. The share of wallet model estimates how much a customer spends at a given retailer compared to how much he or she spends at competitors for some category of products such as groceries or apparel. This metric reveals customers with high revenue potential, hence it can be used in loyalty programs and usage expansion campaigns.
Propensity to category expansion. This model estimates the likelihood of first-time spending in certain product categories e.g. switching from casual to luxury products. This model helps to design targeted usage expansion campaigns.
Propensity to churn. This model estimates the likelihood of stopping to buy from a given retailer permanently and switching to competitors. Customers with a high propensity to churn can be targeted in retention campaigns. For instance, a retailer can identify customers who abandoned their online shopping carts or search sessions but are likely to proceed to order placement if offered a discount or gift.
Propensity to change shopping habits. Each customer has shopping habits that eventually determine a customer’s value for a retailer – how often the customer buys, what products, from what categories etc. These habits are generally stable over time, and once a retailer manages to change a customer’s level of engagement, this level tends to last. Consequently, retailers are generally interested to find customers who are open to change their habits – people who moved from one city to another, graduated a school or university, just married and so on. The canonical example of such modeling is Target’s attempt to predict customers’ pregnancies in the early stages [DG12] because births obviously change the way customers shop.

The models above can be embedded into a framework similar to the equation (1.4) as the alternatives to the gross margin objective. We will take a closer look at propensity modeling in a later section dedicated to price discrimination where we will model propensity to response on a discount. More details on propensity modeling can be found in dedicated studies and books like [FX06] and [SG09].

The framework can also be extended to select an optimal incentive among multiple alternatives. For instance, a retailer can estimate the expected performance of two incentives A and B (e.g. chocolate ice-cream versus vanilla ice-cream) and then select the optimal option for a given user according to the following criteria [WE07]:

$A\to u:\begin{cases}E\left\{g|u;A\right\}-c_A>E\left\{g|u;B\right\}-c_B\\E\left\{g|u;A\right\}-E\left\{g|u;\bar{A};\bar{B}\right\}>c_A\end{cases} \quad (1.5)$

Finally, it is worth noting that the problem of response modeling is tightly coupled with customer segmentation:

Response modeling can be used to validate feasibility of customer segments discovered by clustering. A segment that consistently responds to a certain marketing program is actionable and solid.
Propensity models are regression and classification models trained on customer data. The analysis of principal regressors can suggest customer segments. On the other hand, clustering can suggest suitable propensity models.