Sunday, October 26, 2014

Combating Multicollinearity by Asking the Right Questions and Uncovering Latent Features

Overview. When responding to questions about brand perceptions or product feature satisfaction, consumers construct a rating  by relying on their overall satisfaction with the brand or product plus some general category knowledge of how difficult it is to deliver each feature. In order to get pass such halo effects, we need to ask questions that require respondents to relive their prior experiences and access memories of actual occurrences. Then, we must find a statistical model to analyze the high-dimensional and sparse data matrix produced when so many detailed probes return no, never, or none responses. The R package NMF (nonnegative matrix factorization) provides a convenient entry into analysis of latent features for those familiar with factor analysis and mixture models.

Revisiting Stated versus Derived Importance

It has become common practice in survey research to ask participants to act as self-informants. After all, who knows more about the reasons for your behavior than yourself? So why not simply ask why or some variation of that question and be done with it? For example, exit polling wants to how you voted and then the reasons for your vote. The same motivation drives consumer researchers who are not satisfied with purchase intent alone, so they drill down into the causes with direct questions, either open-ended or lists of possible reasons.

All is well as long as the respondent is able and willing to provide a response that can be used to improve the marketing of products or candidates. Unfortunately, "know thyself" is no easier for us than it was for the ancient Greeks. The introspection illusion has been well documented. Simply put, we feel that those reasons that are easiest to provide when asked why must be the motivations for our behavior. The response to the exit poll may be nothing more than a playback of something previously heard or read. Yet, it is so easy to repeat that it must be the true reason. The questions almost write themselves, and the responses are tabulated and tracked over time without much effort at all. You have seen the headlines: "Top 10 Reasons for This or That" or "More Doing This for That Reason." So, what is the alternative?

Marketing research faced a similar situation with the debate over stated versus derived importance. Stated importance, as you might have inferred from the name, is a self-report by the respondent concerning the contribution of a feature or benefit to a purchase decision. The wording is typically nonspecific, such as "how important is price" without any actual pricing information. Respondents supply their own contexts, presumably derived from variations in price that they commonly experience, so that in the end we have no idea of the price range they are considering. Regrettably, findings do not generalize well for what is not important in the abstract becomes very important in the marketplace. The devil is in the details, and actual buying and selling is filled with details.

Derived importance, on the other hand, is the result of a statistical analysis. The experimental version is conjoint analysis or choice modeling. By systematically varying the product description, one estimates the impact of manipulating each attribute or feature. With observational data, one must rely on natural variation and perform a regression analysis predicting purchase intent for a specific brand from other ratings of the same brand.

In both case we are looking for leverage, specifically, a regression coefficient derived from regressing purchase interest on feature levels or feature ratings. If the goal is predicting how consumers will respond to changing product features, then conjoint seems to be winner once you are satisfied that the entire process is not so intrusive that the results cannot be generalized to the market. Yet, varying attributes in an experiment focuses the consumer's attention on aspects that would not be noticed in the marketplace. In the end, the need for multiple ratings or choices from each respondent can create rather than measure demand.

 On the other hand, causal inferences are not possible from observation alone. All we know from the regression analysis are comparisons of the perceptual rating patterns of consumers with different purchase intent. We do not know the directionality or if we have a feedback loop. Do we change the features to impact the perceptions in order to increase purchase intent? Or, do we encourage purchase by discounting price or adding incentives so that product trial will alter perceptions? Both of these approaches might be successful if the associative relationship between perception and intend results from a homeostatic process of mutual feedback and reinforcement.

Generalized Perceptions Contaminated with Overall Satisfaction

Many believe that "good value for the money" is a product perception and not another measure of purchase intent. Those that see value as a feature interpret its high correlation with likelihood to buy as an indication of its derived importance. Although it is possible to think of situations where one is forced to repurchase a product that is not a good value for the money, in general, both items are measuring the same underlying positive affect toward the product. Specifically, the memories that are retrieved to answer the purchase question are the same memories that are retrieved to respond to the values inquiry. Most of what we call "perceptions" are not concrete features or services asked about within a specific usage context that tap different memories. Consequently, we tend to see predictors in the regression equation with substantial multicollinearity from halo effects because we only ask our respondents to recall the "gist" of their interactions and not the details.

Our goal is to collect informative survey data that measures more than a single approach-avoidance evaluative dimension (semantic memory). The multicollinearity among our predictors that continually plagues our regression analyses stems from the lack of specificity in our rating items. Questions that probe episodic memories of features or services used or provided will reduce the halo effect. Unfortunately, specificity creates its own set of problems trying to analyze high-dimensional and sparse data. Different needs generate diverse usage experiences resulting in substantial consumer heterogeneity. Moreover, the infrequently occurring event or the seldom used feature can have a major impact and must be included in order for remedial action to be taken. Some type of regularization is one approach (e.g., the R package glmnet), but I prefer an alternative that attempts to reduce the large number of questions to a smaller set of interpretable latent features.

An Example to Make the Discussion Less Abstract

If we were hired by a cable provider to assess customer satisfaction, we might start with recognizing that not everyone subscribes to all the services offered (e.g., TV, internet, phone and security). Moreover, usage is also likely to make a difference in their satisfaction judgments, varying by the ages and interest of household members. This is what is meant by consumers residing in separate subspaces for parents who use their security system to monitor their children when they are at work have very different experiences from a retired couple without internet access. Do I need to mentions teens in the family? Now, I will ask you to list all the good and bad experiences that a customer might have using all possible services provided by the cable company. It is a long list, but probably not any longer than a comprehensive medical inventory. The space formed by all these items is high-dimensional and sparse.

This is a small section from our data matrix with every customer surveyed as a row and experiences that can be probed and reliably remembered as the columns. The numbers are measures of intensity, such as counts or ratings. The last two respondents did not have any interaction with the six features represented by these columns. The entire data matrix is just more of the same with large patches of zeros indicating that individuals with limited interactions will repeatedly response no, never, or none.


In practice, we tend to compromise since we are seeking only actionable experiences that are frequent or important enough to make a difference and that can be remediated. Yet, even given such restrictions, we are still tapping episodic or autobiographical memories that are relatively free of excessive halo effects because the respondent must "relive" the experience in order to provide a response.

Our data matrix is not random but reflects an underlying dynamics that creates blocks of related rows and columns. In order to simplify this discussion we can restrict ourself to feature usage. For example, sports fans must watch live events in high definition. One's fanaticism is measure by the breadth and frequency of events watched. It is easy to image a block in our data matrix with sport fans as the rows, sporting events as the columns and frequency as the cell entries. Kids in the household along with children's programming generate another block, and so on. To be clear, we co-cluster or bicluster the rows and columns simultaneously for it is their interaction that creates clusters.

The underlying dynamics responsible for the co-clustering of the rows and the columns can be called a latent feature. It is latent because it is not directly observed, and like factor analysis, we will name the latent construct using coefficients or loadings reflecting its relationships to the observed columns. "Feature" was chosen due to the sparsity of the coefficients with only a few sizeable values and the remaining close to zero. As a result, we tend to speak of co-clustering rows and columns so that "latent feature" seems more appropriate than latent variable.

You can find an example analysis of a feature usage inventory using the R package NMF in a previous post. In addition, all the R code needed to run such an analysis can be found in a separate post. In fact, much of my writing over the last several months has focused on NMF, so you may wish to browse. There are other alternatives for biclustering in R, but nonnegative matrix factorization is such an easy transition from principal component analysis and mixture modeling that most should have little trouble performing and interpreting the analysis.

Tuesday, October 21, 2014

Modeling Plenitude and Speciation by Jointly Segmenting Consumers and their Preferences

In 1993, when music was sold in retail stores, it may have been informative to ask about preference across a handful of music genre. Today, now that the consumer has seized control and the music industry has responded, the market has exploded into more than a thousand different fragmented pairings of artists and their audiences. Grant McCracken, the cultural anthropologist, refers to such proliferation as speciation and the resulting commotion as plenitude. As with movies, genre become microgenre forcing recommender systems to deal with more choices and narrower segments.

This mapping from the website Every Noise at Once is constantly changing. As the website explains, there is a generating algorithm with some additional adjustments in order to make it all readable, and it all seems to work as an enjoyable learning interface. One clicks on the label to play a music sample. Then, you can continue to a list of artists associated with the category and hear additional samples from each artist. Although the map seems to have interpretable dimensions and reflects similarity among the microgenre, it does not appear to be a statistical model in its present form.

At any given point in time, we are stepping into a dynamic process of artists searching for differentiation and social media seeking to create new communities who share at least some common preferences. Word of mouth is most effective when consumers expect new entries and when spreading the word is its own reward. It is no longer enough for a brand to have a good story if customers do not enjoy telling that story to others. Clearly, this process is common to all product categories even if they span a much smaller scale. Thus, we are looking for a scalable statistical model that captures the dynamics through which buyers and sellers come to a common understanding. 

Borrowing a form of matrix factorization from recommender systems, I have argued in previous posts for implementing this kind of joint clustering of the rows and columns of a data matrix as a replacement for traditional forms of market segmentation. We can try it with a music preference dataset from the R package prefmod. Since I intend to compare my finding with another analysis of the same 1993 music preference data using the new R package RCA and reported in the American Journal of Sociology, we will begin by duplicating the few data modifications that were made in that paper (see the R code at the end of this post). 

In previous attempts to account for music preferences, psychologists have focused on the individual and turned to personality theory for an explanation. For the sociologist, there is always the social network. As marketing researchers, we will add the invisible hand of the market. What is available? How do consumers learn about the product category and obtain recommendations? Where is it purchased? When and where is it consumed? Are others involved (public vs private consumption)?

The Internet opens new purchase pathways, encourages new entities, increases choice and transfers control to the consumer. The resulting postmodern market with its plenitude of products, services, and features cannot be contained within a handful of segments. Speciation and micro-segmentation demand a model that reflects the joint evolution where new products and features are introduced to meet the needs of specific audiences and consumers organize their attention around those microgenre. Nonnegative matrix factorization (NMF) represents this process with a single set of latent variables describing both the rows and the columns at the same time.

After attaching the music dataset, NMF will produce a cluster heatmap summarizing the "loadings" of the 17 music genre (columns below) on the five latent features (rows below): Blues/Jazz, Heavy Metal/Rap, Country/Bluegrass, Opera/Classical, and Rock. The dendrogram at the top displays the results of a hierarchical clustering. Although there are five latent features, we could use the dendrogram to extract more than five music genre clusters. For example, Big Band and Folk music seem to be grouped together, possibly as a link from classical to country. In addition, Gospel may play a unique role linking country and Blues/Jazz. Whatever we observe in the columns will need to be verified by examining the rows. That is, one might expect to find a segment drawn to country and jazz/blues who also like gospel.


We would have seen more of the lighter colors with coefficients closer to zero had we found greater separation. Yet, this is not unexpected given the coarseness of music genre. As we get more specific, the columns become increasingly separated by consumers who only listen to or are aware of a subset of the available alternatives. These finer distinctions define today's market for just about everything. In addition, the use of a liking scale forces us to recode missing values to a neutral liking. We would have preferred an intensity scale with missing values coded as zeros because they indicate no interaction with the genre. Recoding missing to zero is not an issue when zero is the value given to "never heard of" or unaware.

Now, a joint segmentation means that listeners in the rows can be profiled using the same latent features accounting for covariation among the columns. Based on the above coefficient map, we expect those who like opera to also like classical music so that we do not require two separate scores for opera and classical but only one latent feature score. At least this is what we found with this data matrix. A second heatmap enables us to take a closer look at over 1500 respondents at the same time.


We already know how to interpret this heatmap because we have had practice with the coefficients. These colors indicate the values of the mixing weights for each respondent. Thus, in the middle of the heatmap you can find a dark red rectangle for latent feature #3, which we have already determined to represent country/bluegrass. These individuals give the lowest possible rating to everything except for the genre loading on this latent feature. We do not observe that much yellow or lighter colors in this heatmap because less than 13% of the responses fell into the lowest box labeled "dislike very much." However, most of the lighter regions are where you might expect them to be, for example, heavy metal/rap (#2), although we do uncover a heavy metal segment at the bottom of the figure.

Measuring Attraction and Ignoring Repulsion

We often think of liking as a bipolar scale, although what determines attraction can be different from what drives repulsion. Music is one of those product categories where satisfiers and dissatisfiers tend to be different. Negative responses can become extreme so that preference is defined by what one dislikes rather than what one likes. In fact, it is being forced to listen to music that we do not like that may be responsible for the lowest scores (e.g., being dragged to the opera or loud music from a nearby car). So, what would we find if we collapsed the bottom three categories and measured only attraction on a 3-point scale with 0=neutral, dislike or dislike very much, 1=like, and 2=like very much?

NMF thrives on sparsity, so increasing the number of zeros in the data matrix does not stress the computational algorithm. Indeed, the latent features become more separated as we can see in the coefficient heatmap. Gospel stands alone as its own latent feature. Country and bluegrass remain, as does opera/classical, blues/jazz, and rock. When we "remove" dislike for heavy metal and rap, heavy metal moves into rock and rap floats with reggae between jazz and rock. The same is true for folk and easy mood music, only now both are attractive to country and classical listeners.

More importantly, we can now interpret the mixture weights for individual respondents as additive attractors so that the first few rows are the those with interest in all the musical genre. In addition, we can easily identify listeners with specific interests. As we continue to work our way down the heatmap, we find jazz/blues(#4), followed by rock(#5) and a combination of jazz and rock. Continuing, we see country(#2) plus rock and country alone, after which is a variety of gospel (#1) plus some other genre. We end with opera and classical music, by itself and in combination with jazz.

Comparison with the Cultural Omnivore Hypothesis

As mentioned earlier, we can compare our findings to a published study testing whether inclusiveness rules tastes in music (the eclectic omnivore) or whether cultural distinctions between highbrow and lowbrow still dominate. Interestingly, the cluster analysis is approached as a graph-partitioning problem where the affinity matrix is defined as similarity in the score pattern regardless of mean level. All do not agree with this calculation, and we have a pair of dueling R packages using different definitions of similarity (the RCA vs. the CCA).

None of this is news for those of us who perform cluster analysis using the affinity propagation R package apcluster, which enables several different similarity measures including correlations (signed and unsigned). If you wish to learn more, I would suggest starting with the Orange County R User webinar for apcluster. The quality and breadth of the documentation will flatten your learning curve.

Both of the dueling R packages argue that preference similarity ought to be defined by the highs and lows in the score profiles ignoring the mean ratings for different individuals. This is a problem for marketing since consumers who do not like anything ought to be treated differently from consumers who like everything. One is a prime target and the other is probably not much of a user at all.

Actually, if I were interesting in testing the cultural omnivore hypothesis, I would be better served by collecting familiarity data on a broader range of more specific music genre, perhaps not as detailed as the above map but more revealing than the current broad categories. The earliest signs of preference can be seen in what draws our attention. Recognition tends to be a less obtrusive measure than preference, and we can learn a great deal knowing who visits each region in the music genre map and how long they stayed.

NMF identifies a sizable audience who are familiar with the same subset of music genre. These are the latent features, the building blocks as we have seen in the coefficient heatmaps. The lowbrow and the highbrow each confine themselves to separate latent features, residing in gated communities within the music genre map and knowing little of the other's world. The omnivore travels freely across these borders. Such class distinctions may be even more established in the cosmetics product category (e.g., women's makeup). Replacing genre with brand, you can read how this was handled in a prior post using NMF to analyze brand involvement.

R code to perform all the analyses reported in this post
library(prefmod)
data(music)
 
# keep only the 17 genre used
# in the AMJ Paper (see post)
prefer<-music[,c(1:11,13:18)]
 
# calculate number of missing values for each
# respondent and keep only those with no more
# than 6 missing values
miss<-apply(prefer,1,function(x) sum(is.na(x)))
prefer<-prefer[miss<7,]
 
# run frequency tables for all the variables
apply(prefer,2,function(x) table(x,useNA="always"))
# recode missing to the middle of the 5-point scale
prefer[is.na(prefer)]<-3
# reverse the scale so that larger values are
# associated with more liking and zero is
# the lowest value
prefer<-5-prefer
 
# longer names are easier to interpret
names(prefer)<-c("BigBand",
"Bluegrass",
"Country",
"Blues",
"Musicals",
"Classical",
"Folk",
"Gospel",
"Jazz",
"Latin",
"MoodEasy",
"Opera",
"Rap",
"Reggae",
"ConRock",
"OldRock",
"HvyMetal")
 
library(NMF)
fit<-nmf(prefer, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
 
# recode bottom three boxes to zero
# and rerun NMF
prefer2<-prefer-2
prefer2[prefer2<0]<-0
# need to remove respondents with all zeros
total<-apply(prefer2,1,sum)
table(total)
prefer2<-prefer2[total>0,]
 
fit<-nmf(prefer2, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
Created by Pretty R at inside-R.org

Wednesday, October 15, 2014

Beware Graphical Networks from Rating Scales without Concrete Referents

We think of latent variables as hidden causes for the correlations among observed measures and rely on factor analysis to reveal the underlying structure. In a previous post, I borrowed an alternative metaphor from the R package qgraph and produced the following correlation network. Instead of depression as a disease entity represented as a factor, this figure displays depression as a set of mutually reinforcing ratings located toward the bottom of the graph.


I selected the bifi dataset from the psych R package so that readers could reproduce the analysis and so that one could compare the factor structure and the correlation network. However, I was thinking in terms of actual behaviors and not agreement ratings for items from a personality inventory. This distinction was discussed in an earlier post introducing item response theory. The node "Mood Swings" should be measured by a series of concrete behaviors in actual situations. This is the goal of the patient outcome measurement and the call for context-aware measurement. Moreover, one sees the same focus on behaviors or symptoms in the work of Borsboom and his associates, including the author of the R package qgraph that generated the above graphical network.

In an excellent tutorial on network analysis of personality data in R, Sacha Epskamp and others present another example along with all the necessary R code. Correlations networks are produced with qgraph along with partial correlation and LASSO networks, the later with the help of the R package parcor. This paper ("State of the aRt personality research") outlines all the steps to generate graphical models and interpret the indices that describe the network structure. This is not social network analysis for the nodes are variables and the links are different measures of relationship.

The data comes from a personality inventory with a list of 60 statements and a five-point agreement scale. The scoring key lists the six constructs, abbreviated HEXACO, and their associated items. The first in the list is Sincerity, one of the 24 nodes in the network maps, measured by the following three statements:
  • I wouldn't use flattery to get a raise or promotion at work, even if I thought it would succeed.
  • If I want something from someone, I will laugh at that person's worst jokes. [scale reversed]
  • I wouldn't pretend to like someone just to get that person to do favors for me.
I understand that we share a common conceptual space embedded in our language in which the endorsement or rejection of these items might provide some information about self-presentation. Yet, I expect that someone who has never worked could answer the first question because it has nothing to do with actual experience. All that I am being asked is whether I view myself as the type of person depicted in the statement. Similarly, I can respond to the second statement even if I never laugh at anyone's bad jokes. In fact, I would answer the same regardless of any propensity to laugh or not laugh at other's jokes.

The HEXACO model of personality structure is but one of a number of different approaches based on the lexical hypothesis that personality gets coded in language. There is a meeting of the minds over the distinctions that are made and what it might mean to position ourselves at different locations within this landscape. In order to communicate with others, we must come to some agreement about the meanings of the statements used in personality inventories. It is the talk and not the behavior that is responsible for the factor structure or the positioning of nodes in the network.

Where are the feedback loops or mutually reinforcing nodes with such measures? It makes sense to talk about a network when the nodes are behaviors, as in the lower portion of our above network map. I get irritated, so I am more likely to get angry. In this agitated state I panic more easily and experience mood swings, all of which is makes me feel blue. You can download the 60-item self-report form and decide for yourself if the statements are linked by anything more than a shared conception and way of talking about personality traits.

Thursday, October 2, 2014

Consumer Preference Driven by Benefits and Affordances, Yet Management Sees Only Products and Features

Return on Investment (ROI) is management's bottom line. Consequently, everything must be separated and assigned a row with associated costs and profits. Will we make more by adding another product to our line? Will we lose sales by limiting the features or services included with the product?

The assumption is that consumers see and value the same products and features that management lists as line items on their balance sheets. It simply makes data collection and analysis so easy that the most popular techniques never question this assumption. For example, in my last post about TURF Analysis, I discussed the ice cream flavors problem. How many and what flavors of ice cream should you offer given limited freezer space?

A typical data collection would present each flavor separately and ask about purchase intent, either a binary buy or no buy or an ordered rating scale that is split into a buy-or-not-buy dichotomy using a cutoff score. Even if we assume that our client only sells ice cream in grocery stores, we still do not know anything about the context triggering this purchase. Was it bought for an individual or household? Will adults or children or both be eating it for snacks or after dinner? How will the ice cream be served (e.g., cones, bowls, or with something else like pie or cake)?

Had we started with a list of usage occasions, we could have asked about flavor choices for each occasion. In addition, we could have obtained some proportional allocation of how much each occasion contributed to total ice cream consumption. Obviously, we have multiplied the number of observations from every respondent since we ask about flavor selection for every usage occasion. Much of the data matrix will be empty since individuals are likely to buy only a few flavors over a limited set of occasions.

The typical TURF Analysis, on the other hand, strips away context. By removing the "why" for the purchase, we have induced a bias toward focusing on the flavor without any context. Technically, this was the goal of the research design in the first place. Management knows the costs associated with offering the flavor, it needs to know the profit, but that it has failed to measure. In fact, it is unclear what is being measured. Does the respondent provide their own context by thinking of the most common purchase occasion, or do they report personal preferences as they might in any social gathering when asked about their favorite flavor of ice cream? Nonetheless, we still cannot calculate profit for that would require a weighted average of selections over purchase occasions with the weights reflecting volume.

Contextualized measurement yields high-dimensional sparse data that create problems for most optimization routines. Yet, we can analyze such data by searching for low-dimensional subspaces defined by benefits delivered and affordances provided. Purchases are made to deliver benefits. Flavors are but affordances. Someone in the household likes chocolate, so the ice cream must contain some minimal level of chocolate. Flavor has an underlying structure, and the substitution pattern reflects that structure. However, chocolate may not be desirable when the ice cream is served with cake or pie. Moreover, those "buy a second at a discount" sales change everything, as do special occasions when guests are invited and ice cream is served. Customers are likely to be acquired or lost at the margins, that is, in less common usage occasions where habit does not prevail. These will never be measured when we ask for preference "out of context" because they are simply not remembered without a specific purchase occasion probe.

Deconstructing Consumption

We start by identifying the situations where ice cream is the answer. Preference construction is triggered by situational need, and the consumer relies on situational constraints to assist in the purchase process. Situations tend to be separated by time and place (e.g., after dinner in the kitchen or dining area and late night snack in front of TV) and consequently can be modeled as additive effects. Each consumer can be profiled as some weighted combination of these recurring situations.

Moreover, we make sense of individual consumption by grouping together others displaying similar patterns. We can think of this as a type of collaborative filtering. Here again, we see additive effects where the total markets can be decomposed into clusters of consumers with similar preferences. In order to capture such additive effects, I have suggested the use of nonnegative matrix factorization (NMF) in a previous post. The nonnegative restrictions help uncover additive effects, in this case, the additive effects of situations within consumers and decomposition of the total market into additive consumer segments.

You can find the details covering how to use and interpret the R package NMF in a series of posts on this blog published in July, August and September 2014. R provides an easy-to-use interface to NMF, and the output is no more difficult to understand than that produced by factor and cluster analyses. In this post I have focused on one specific application in order to make explicit the correspondence between a matrix factorization and the decomposition of a product category into its components reflecting both situational variation and consumer heterogeneity.

Bradley Efron partitions the history of statistics into three centuries with each defined by the problems that occupied its attention. The 21st century focuses on large data sets and complex questions (e.g., gene expression or data mining). Such high-dimensional data present special problems that must be faced by both statistics and people engaging in everyday life. Modeling consumption from this new perspective, we hope to achieve some insight into the purchase process and measures that will reflect what the consumer will and will not buy when they actually go shopping.

Monday, September 29, 2014

TURF Analysis: A Bad Answer to the Wrong Question

Now that R has a package performing Total Unduplicated Reach and Frequency (TURF) Analysis, it might be a good time to issue a warning to all R users. DON'T DO IT!

The technique itself is straight out of media buying from the 1950s. Given some number of n alternative advertising options (e.g., magazines), which set of size k will reach the most readers and be seen the most often? Unduplicated reach is the primary goal because we want everyone in the target audience to see the ad. In addition, it was believed that seeing the ad more than once would make the ad more effective (that is, until wearout), which is why frequency is a component. When TURF is used to create product lines (e.g., flavors of ice cream to carry given limited freezer space), frequency tends to be downplayed and the focus placed on reaching the largest percentage of potential customers. All this seems simple enough until one looks carefully at the details, and then one realizes that we are interpreting random variation.

The R package turfR includes an example showing how to use their turf() function by setting n to 10 and letting k range from 3 to 6.

library(turfR)
data(turf_ex_data)
ex1 <- turf(turf_ex_data, 10, 3:6)
ex1
Created by Pretty R at inside-R.org

This code produces a considerable amount of output. I will show only the first 10 best triplets from the 120 possible sets of three that can be formed from 10 alternatives. The rchX columns tells the weighted proportion of the 180 individuals in the dataset that would buy one of the 10 products listed in the columns labeled with integers from 1 to 10. Thus, according to the first row, 99.9% would buy something if Items 8, 9, and 10 were offered for sale.

combo
rchX
frqX
1
2
3
4
5
6
7
8
9
10
1
120
0.998673
2.448993
0
0
0
0
0
0
0
1
1
1
2
119
0.998673
2.431064
0
0
0
0
0
0
1
0
1
1
3
99
0.995773
1.984364
0
0
0
1
0
0
0
1
0
1
4
110
0.992894
2.185398
0
0
0
0
1
0
0
0
1
1
5
64
0.991567
1.898693
0
1
0
0
0
0
0
0
1
1
6
109
0.990983
2.106944
0
0
0
0
1
0
0
1
0
1
7
97
0.99085
1.966436
0
0
0
1
0
0
1
0
0
1
8
116
0.989552
2.341179
0
0
0
0
0
1
0
0
1
1
9
85
0.989552
2.042792
0
0
1
0
0
0
0
0
1
1
10
36
0.989552
1.800407
1
0
0
0
0
0
0
0
1
1

The sales pitch for TURF depends on showing only the "best" solution for 3 through 6. Once we look down the list, we find that there are lots of equally good combinations with different products (e.g., the combination in the 7th position yields 99.1% reach with products 4, 7 and 10). With a sample size of 180, I do not need to run a bootstrap to know that the drop from 99.9% to 99.1% reflects random variation or error.

Of course, the data from turfR is simulated, but I have worked with many clients and many different datasets across a range of categories and I have never found anything but random differences among the top solutions. I have seen solutions where the top several hundred combinations cannot be distinguished based on reach, which is reasonable given that the number of combinations increases rapidly with n and k (e.g., the R function choose(30,5) indicates that there are 142,506 possible combinations of 30 things in sets of 5). You can find an example of what I see over and over again by visiting the TURF website for XLSTAT software.

Obviously, there is no single best item combination that dominates all others. It could have been otherwise. For example, it is possible that the market consists of distinct segments with each wanting one and only one item.

With no overlapping in this Venn diagram, it is clear that vanilla is the best single item, followed by vanilla and chocolate as the best pair, and so on had there been more flavors separated in this manner.

However, consumer segments are seldom defined by individual offerings in the market. You do not stop buying toothpaste because your brand has been discontinued. TURF asks the wrong question because consumer segmentation is not item-based.

As a quick example, we can think about credit card reward programs with its categories covering airlines, cash back, gas rebates, hotel, points, shopping and travel. Each category could contain multiple reward offers. A TURF analysis would seek the best individual rewards ignoring the categories. Yet, comparison websites use categories to organize searches because consumer segments are structured around the benefits offered by each category.

The TURF Analysis procedure from XLSTAT allows you to download an Excel file with purchase intention ratings for 27 items from 185 respondents. A TURF analysis would require that we set a cutoff score to transform the 1 through 5 ratings into a 0/1 binary measure. I prefer to maintain the 5-point scale and treat purchase intent as an intensity score after subtracting one so that the scale now ranges from 0=not at all to 4=quite sure. A nonnegative matrix factorization (NMF) reveals that the 27 items in the columns fall into 8 separable row categories marked by the red indicating a high probability of membership and yellow with values close to zero showing the categories where the product does not belong.

The above heatmap displays the coefficients for each of the 27 products, as the original Excel file names them. Unfortunately, we have only the numbers and no description of the 27 products. Still, it is clear that interest has an underlying structure and that perhaps we ought to consider grouping the products based on shared features, benefits or usages. For example, what do Products 5, 6 and 17 clustered together at the end of this heatmap have in common? Understand, we are looking for stable effects that can be found in the data and in the market where purchases are actually made.

The right question asks about consumer heterogeneity and whether it supports product differentiation. Different product offerings are only needed when the market contains segments seeking different benefits. Those advocating TURF analysis often use ice cream flavors as their example, as I did in the above Venn diagram. What if the benefit driving sales of less common flavors was not the flavor itself but the variety associated with a new flavor or a special occasion when one wants to deviate from their norm? A segmentation, whether NMF or another clustering procedure, would uncover a group interested in less typical flavors (probably many such flavors). This is what I found from the purchase history of whiskey drinkers, a number of segments each buying one of the major brands and a special occasion or variety seeking segment buying many niche brands. All of this is missed by a TURF analysis that gives us instead a bad answer to the wrong question.

Appendix with R Code needed to generate the heatmap:

First, download the Excel file, convert it to csv format, and set the working directory to the location of the data file.

test<-read.csv("demoTurf.csv")
library(NMF)
fit<-nmf(test[,-1]-1, 8, method="lee", nrun=20)
coefmap(fit)

Created by Pretty R at inside-R.org

Saturday, September 27, 2014

Recognizing Patterns in the Purchase Process by Following the Pathways Marked By Others

Herbert Simon's "ant on the beach" does not search for food in a straight line because the environment is not uniform with pebbles, pools and rough terrain. At least the ant's decision making is confined to the 3-dimensional space defining the beach. Consumers, on the other hand, roam around a much higher dimensional space in their search for just the right product to buy.

Do you search online or shop retail? Do you go directly to the manufacturer's website or do you seek out professional reviews or user ratings? Does YouTube or social media hold the key? Similar decisions must be made for physical searches of local retailers and superstores?  Of course, embedded within each of these decision points are more choices concerning features, servicing and price.

Yet, we do not observe all possible paths in the consumer purchase journey. Like the terrain of the beach, the marketplace makes some types of searches easier than others. In addition, like the ant, the first consumers leave trails that later consumers can follow. This can be direct word of mouth or indirect effects such as internet searches where the websites shown first depend on the number of previous visits. But it can also be marketing messaging and expert reviews, that is, markers along the trail telling us what to look for and where to look. We are social creatures, and it is fascinating to see how quickly all the possible paths through the product offerings are narrowed down to several well-worn trails that we all follow. Culture impacts what and how we buy, and statistical modeling that incorporates what others are doing may be our best hope of discovering those pathways.

In order to capture everyone in the product market and all possible sources of information, we require a wide net with fine webbing. Our data matrix will contain heterogeneous rows of consumers with distinctive needs who are seeking very different benefits. Moreover, our columns must be equally diverse to span everywhere that a consumer can search for product information. As a result, we can expect our data matrix to be sparse for we have included many more columns of information sources than any one consumer would access.

To make sense of such a data matrix, we will require a statistical model or algorithm that reflects this construction process, by which I mean the social and cultural grouping of consumers who share a common understanding of what is important to know and where one should seek such information. For example, someone looking for a new credit card could search and apply solely online, but not any consumer, for some do not shop on the internet or feel insecure without the presence of a physical building close to home. Those wanting to apply in-person may wait for a credit card offer to be inserted in their monthly bank statement or they may see an advertisement in the local newspaper.

Modeling the Joint Separation of Consumers and Their Information Sources

Nonnegative matrix factorization (NMF) decomposes the nonnegative data matrix into the product of two other nonnegative matrices, one for consumers and the other for information sources. The goal is dimension reduction. Before NMF, we needed all p columns of the data matrix to describe the consumer. Now, we can get by with only the r latent features, where r is much smaller than p. What are these latent features? They are defined in the same manner as the factors in factor analysis. Our second matrix from the nonnegative factorization contains coefficients that can be interpreted as one would factor loadings. We look for the information sources with the largest weights to name the latent feature.

Returning to our credit card example, the data matrix includes rows for consumers banking online and in-person plus columns for online search along with columns for direct mail and newspaper ads. Online banking customers use online information sources, while in-person banking customers can be found looking for information in a different cluster of columns. We have separation with online row and columns forming one block and in-person rows and columns coming together in a separate block.

The nonnegativity of the two product matrices enables such a "parts-based" representation with the simultaneous clustering of both rows and columns. We start with the observed data matrix. It is nonnegative so that zero indicates none and a larger positive value suggest more of whatever is being measured. Counts or frequencies of occurrence would work. Actually, the data matrix can contain any intensity measure. Hopefully, you can visualize that the data matrix will be more sparse (more zeros) with greater separation between the row-column blocks, and in turn, this sparsity will be associated with corresponding sparsity in the two product matrices.

A toy example might help with this explanation.

V1
V2
V3
V4
S1
6
3
0
0
S2
4
2
0
0
S3
2
1
0
0
S4
0
0
6
3
S5
0
0
4
2
S6
0
0
2
1

The above data matrix shows the intensity of search scores from 0 (no search) to 6 (intense search) for six consumers across four different information sources. What might have produced such a pattern? The following could be responsible:
  • Online sources in the first two columns with V1 more popular than V2,
  • Offline sources in the last two columns with V3 more popular than V4,
  • Online customers in the first three rows with individual search intensity S1 > S2 > S3, and
  • Offline customers in the last three rows with individual search intensity S4 > S5 > S6.
The pattern might seem familiar as row and column effects from an analysis of variance. The columns form a two-level repeated measures factor with V1 and V2 nested in the first level (online) and V3 and V4 in the second level (offline). Similarly, the rows fall into two levels of a between-subject factor with the first three rows nested in level one (online) and the last three rows in level two (offline). Biclustering algorithms approach the problem in this manner (e.g., the R package biclust). Matrix factorization achieves a similar outcome by factoring the data matrix into the product of two new matrices with one representing row effects and the other column effects.

The NMF R package decomposes the data matrix into the two components that are believed to have generated the data in the first place. In fact, I created the data matrix as a matrix product and then use NMF to retrieve the generating matrices. The R code is given at the end of this post. The matrices W and H, below, reflect the above four bullet points. When these two matrices are multiplied, their product W x H is the above data matrix (e.g., the first entry in the data matrix is 3x2+0x0=6).

W
R1
R2
H
V1
V2
V3
V4
S1
3
0
R1
2
1
0
0
S2
2
0
R2
0
0
2
1
S3
1
0
S4
0
3
S5
0
2
S6
0
1

As expected, when we run the nmf() function with rank r=2 on this data matrix, we get these two matrices back again with W as the basis and H as the coefficient matrix. Actually, because W and H are multiplied, we might find that every element in W is divided by 2 and every element in H is multiplied by 2, which would yield the same product. Looking at the weights in H, one concludes that R1 taps online information sources, leaving R2 as the offline latent feature. If you wished to standardize the weights, all the coefficients in a row could be transformed to range from 0 to 1 by dividing by the maximum value in that row.

Decompositions such as NMF are common in statistical modeling. Regression analysis in R using the lm() function is performed as a QR decomposition. The singular value decomposition (SVD) underlies much of principal component analysis. Nothing usual here, except for the ability of NMF to thrive when the data are sparse.

To be clear, sparsity is achieved when we ask about the details of consumer information search. Such details enable management to make precise changes in their marketing efforts. As important, detailed probes are more likely to retrieve episodic memories of specific experiences. It is better to ask about the details of price comparison (e.g., visit competitor website or side-by-side price comparison on Amazon or some similar site) than just inquire if they considered price during the purchase process.

Although we are not tracking ants, we have spread sensors out all over the beach, a wide network of fine mesh. Our beach, of course, is the high-dimensional space defined by all possible information sources. This space can be huge, over a billion combinations when we have only 30 information sources measured as yes or no. Still, as long as consumers confine their searches to low-dimensional subspaces, the data matrix will have the sparsity needed by the decompositional algorithm. That is, NMF will be successful as long as consumers adopt one of several established search pathways clearly marked by repeated consumer usage and marketing signage.

R code to create the V=WH data matrix and run the NMF package:

W=matrix(c(3,2,1,0,0,0,0,0,0,3,2,1), nrow=6)
H=matrix(c(2,0,1,0,0,2,0,1), nrow=2)
V=W%*%H
W; H; V
 
library(NMF)
fit<-nmf(V, 2, method="lee", nrun=20)
fit
round(basis(fit),3)
round(coef(fit))
round(basis(fit)%*%coef(fit))

Created by Pretty R at inside-R.org