Tuesday, July 15, 2014

Taking Inventory: Analyzing Data When Most Answer No, Never, or None

Consumer inventories, as the name implies, are tallies of things that consumers buy, use or do. Product inventories, for example, present consumers with rather long lists of all the offerings in a category and ask which or how many or how often they buy each one. Inventories, of course, are not limited to product listings. A tourist survey might inquire about all the different activities that one might have enjoyed on their last trip (see Dolnicar et al. for an example using the R package biclust). Customer satisfaction studies catalog all the possible problems that one could experience with their car, their airline, their bank, their kitchen appliances and a growing assortment of product categories. User experience research gathers frequency data for all product features and services. Music recommender systems seek to know what you have listened to and how often. Google Analytics keeps track of every click. Physicians inventory medical symptoms.

For most inventories the list is long, and the resulting data are sparse. The attempt to be comprehensive and exhaustive produces lists with many more items than any one consumer could possibly experience. Now, we must analyze a data matrix where no, never, or none is the dominant response. These data matrices can contain counts of the number of times in some time period (e.g., purchases), frequencies of occurrences (e.g., daily, weekly, monthly), or assessments of severity and intensity (e.g., a medical symptoms inventory). The entries are all nonnegative values. Presence and absence are coded zero and one, but counts, frequencies and intensities include other positive values to measure magnitude.

An actual case study would help, however, my example of a feature usage inventory relies on proprietary data that must remain confidential. This would be a severe limitation except that almost every customer inventory analysis will yield similar results under comparable conditions. Specifically, feature usage is not random or haphazard, but organized by wants and needs and structured by situation and task. There are latent components underlying all product and service usage. We use what we want and need, and our wants and needs flow from who we are and the limitations imposed by our circumstances.

In this study a sizable sample of customers were asked how often they used a list of 72 different features. Never was the most frequent response, although several features were used daily or several times a week. As you might expect, some features were used together to accomplish the same tasks, and tasks tended to be grouped into organized patterns for users with similar needs. That is, one would not be surprised to discover a smaller number of latent components controlling the observed frequencies of feature usage.

The R package NMF (nonnegative matrix factorization) searches for this underlying latent structure and displays it in a coefficient heatmap using the function coefmap(object), where object is the name of list return by the nmf function. If you are looking for detailed R code for running nmf, you can find it in two previous posts demonstrating how to identify pathways in the consumer purchase journey and how to uncover the structure underlying partial rankings of only the most important features (top of the heap).

The following plot contains 72 columns, one for each feature. The number of rows are supplied to the function by setting the rank. Here the rank was set to ten. In the same way as one decides on the best number of factors in factor analysis or the best number of clusters in cluster analysis, one can repeat the nmf with different ranks. Ten works as an illustration for our purposes. We start by naming those latent components in the rows. Rows 3 and 8 have many reddish rectangles side-by-side suggesting that several features are accessed together as a unit (e.g., all the features needed to take, view, and share pictures with your smartphone). Rows 1, 2, 4 and 5, on the other hand, have one defining feature with some possible support features (e.g., 4G cellular connectivity for your tablet).
The dendrogram at the top summarizes the clustering of features. The right hand side indicates the presence of two large clusters spanning most of the features. Both rows 3 and 8 pull together a sizable number of features. However, these blocks are not of uniform color hinting that some features may not be used as frequently as others of the same type. Rows 6, 7, 9 and 10 have a more uniform color, although the rectangles are smaller consisting of combinations of only 2, 3 or 4 features. The remaining rows seem to be defined by a single feature each. It is in the manner that one talks about NMF as a feature clustering technique.

You can see that NMF has been utilized as a rank-reduction technique. Those 4 blocks of features in rows 6, 7, 9 and 10 appear to function as units, that is, if one feature in the block is used, then all the features in the block are used, although to different degrees as shown by the varying colors of the adjacent rectangles. It is not uncommon to see a gate-keeping feature with a very high coefficient anchoring the component with support features that are used less frequently in the task. Moreover, features with mixture coefficients across different components imply that the same feature may serve different functions. For example, you can see in row 8 a grouping of features near the middle of the row with mixing coefficients in the 0.3 to 0.6 range for both rows 3 and 8. We can see the same pattern for a rectangle of features a little more left mixing rows 3 and 6. At least some of the features serve more than one purpose.

I would like to offer a little more detail so that you can begin to develop an intuitive understanding of what is meant by matrix factorization with nonnegativity constraints. There are no negative coefficients in H, so that nothing can be undone. Consequently, the components can be thought of as building blocks for each contain the minimal feature pattern that act together as a unit. Suppose that a segment only used their smartphones to make and receive calls so that their feature usage matrix had zeroes everywhere except for everyday use of the calling features. Would we not want a component to represent this usage pattern? And what if they also used their phone as a camera but only sometimes? Since there is probably not a camera-only segment, we would not expect to see camera-related features as a standalone component. We might find, instead, a single component with larger coefficients in H for calling features and smaller coefficients in the same row of H for the camera features.

Recalling What We Are Trying to Do

It always seems to help to recall that we are trying to factor our data matrix. We start with an inventory containing the usage frequency for some 72 features (columns) for all the individual users (rows). Can we still reproduce our data matrix using fewer columns? That is, can we find fewer than 72 component scores for individual respondents that will still reproduce approximately the scores for all 72 features? Knowing only the component scores for each individual in our matrix W, we will need a coefficient matrix H that takes the component scores and calculates feature scores. Then our data matrix V is approximated by W x H (see Wikipedia for a review).

We have seen H (feature coefficients), now let's look at W (latent component scores). Once again, NMF displays usage patterns for all the respondents with a heatmap. The columns are our components, which were defined earlier in terms of the features. Now, what about individual users? The components or columns constitute building blocks. Each user can decide to use only one of the components or some combination of several components. For example, one could choose to use only the calling features or seldom make calls and text almost everything or some mixture of these two components. This property is often referred to in the NMF literature as additivity (e.g., learning the parts of objects).

So, how should one interpret the above heatmap? Do we have 10 segments, one for each component? Such a segmentation could be achieved by simply classifying each respondent as belonging to the component with the highest score. We start with fuzzy membership and force it to be all or none. For example, the first block of users at the top of column 7 can be classified as Component #7 users, where Component #7 has been named based on the features in H with the largest coefficients. As an alternative, the clustered heatmap takes the additional step of running a hierarchical cluster analysis using distances based on all 10 components. By treating the 10 components as mixing coefficients, one could select any clustering procedure to form the segments. A food consumption study referenced in an earlier post reports on a k-means in the NMF-derived latent space.

Regardless of what you do next, the heatmap provides the overall picture and thus is a good place to start. Heatmaps can produce checkerboard patterns when different user groups are defined by their usage of completely different sets of features (e.g., a mall with distinct specialty stores attracting customers with diverse backgrounds). However, this is not what we see in this heatmap. Instead, Component #7 acts almost as continuous usage intensity factor: the more ways you use your smartphone, the more you use your smartphone (e.g., business and personal usage). The most frequent flyers fly for both business and pleasure. Cars with the most mileage both commute and go on vacation. Continuing with examples will only distract from the point that NMF has enabled us to uncover structure from a large and largely sparse data matrix. Whether heterogeneity takes a continuous or discrete form, we must be able to describe it before we can respond to it.

Thursday, July 10, 2014

How Much Can We Learn from Top Rankings using Nonnegative Matrix Factorization?

Purchases are choices from available alternatives. Post-purchase, we know what is the most preferred, but all the other options score the same. Regardless of differences in appeal, all the remaining items received the same score of not chosen. A second choice tells us more, as would the alternative selected as third most preferred. As we add top rankings from first to second to the kth choice, we seem to gain more and more information about preferences. Yet, what if we concentrated only on the top performers, what might be called the "pick of the litter" or the "top of the heap" (e.g., top k from J alternatives)? How much can be learn from such partial rankings?

Jan de Leeuw shows us what can be done with a complete ranking. What if we were to take de Leeuw breakfast food dataset and keep only the top-3 rankings so that all we know is what each respondent selected as their first, second and third choices? Everything that you would need to know is contained in the Journal of Statistical Software article by de Leeuw and Mair (see section 6.2). The data come in a matrix with 42 individuals and 15 breakfast foods. I have reproduce his plot below to make the discussion easier to follow. Please note that all the R code can be found at the end of this post.

The numbers running from 1 to 42 represent the location of each individual ranking the 15 different breakfast foods. That is, rows are individuals, columns are foods, and the cells are rankings from 1 to 15 for each row. What would you like for breakfast? Here are 15 breakfast foods, please order them in terms of your preference with "1" being your most preferred food and "15" indicating your least preferred.

The unfolding model locates each respondent's ideal and measures preference as distance from that ideal point. Thus, both rows (individuals) and columns (foods) are points that are positioned in the same space such that the distances between any given row number and the columns have the same ordering as the original data for that row. As a result, you can reproduce (approximately) an individual's preference ordering from the position of their ideal point relative to the foods. Who likes muffins? If you answered, #23 or #34 or #33 or anyone else nearby, then you understand the unfolding map.

Now, suppose that only the top-3 rankings were provided by each respondent. We will keep the rankings for first, second and third and recode everything else to zero. Now, what values should be assigned to the first, second and third picks? Although ranks are not counts, it is customary to simply reverse the ranks so that the weight for first is 3, second is 2, and third is 1. As a result, the rows are no longer unique values of 1 to 15, but instead contain one each of 1, 2 and 3 plus 12 zeroes. We have wiped out 80% of our data. Although there are other approaches for working with partial rankings, I will turn to nonnegative matrix factorization because I want a technique that works well with sparsity, for example, top 3 of 50 foods or top 5 of 100 foods. Specifically, we are seeking a general approach for dealing with any partial ranking that generates sparse data matrices. Nonnegative matrix factorization seems to be up for the task, as demonstrated in a large food consumption study.

We are now ready for the nmf R package as soon as we specify the number of latent variables. I will try to keep it simple. The data matrix is 42 x 15 with each row having 12 zeroes and three entries that are 1, 2 and 3 with 3 as the best (ranking reversed). Everything would be simpler if the observed breakfast food rankings resulted from a few latent consumption types (e.g., sweet-lovers tempted by pastries, the donuts-for-breakfast crowd, the muffin-eaters and the bread-slicers). Then, observed rankings could be accounted for by some combination of these latent types. "Pure Breads" select only toast or hard roll. "Pure Muffins" pick only the three varieties of muffin, though corn muffin may not be considered a real muffin by everyone. Coffee cakes may be its own latent type, and I have idea how nmf will deal with cinnamon toast (remember that the data is at least 40 years old). From these musings one might reasonably try three or four latent variables.

The nonnegative matrix factorization (nmf) was run with four latent variables. The first function argument is the data matrix, followed by the rank or number of latent variables, with the method next, and a number indicating the number of times you want the analysis rerun with different starting points. This last nrun argument works in the same manner as the nstart argument in kmeans. Local minimum can be a problem, so why not restart the nmf function with several different initialization and pick the best solution? The number 10 seemed to work with this data matrix, by which I mean that I obtained similar results each time I reran the function with nrun=10. You will note that I did not set the seed, so that you can try it yourself and see if you get a similar solution.

The coefficient matrix is shown below. The entries have been rescaled to fall along a scale from 0 to 100 for no other reason than it is relative value that is important and marketing research often uses such a thermometer scale. Because I will be interpreting these coefficients as if they were factor loadings, I borrowed the fa.sort() function from the psych R package. Hopefully, this sorting make it easier to see the underlying pattern.

Obviously, these coefficients are not factor loadings, which are correlations between the observed and latent variables. You might want to think of them as if they were coefficients from a principal component analysis. What are these coefficients? You might wish to recall that we are factoring our data matrix into two parts: this coefficient matrix and what is called a basis matrix. The coefficient matrix enables us to name the latent variables by seeing the association between the observed and latent variables. The basis matrix includes a row for every respondent indicating the contribution of each latent variable to their top 3 rankings. I promise that all this will become clearer as we work through this example.

Coffee Cake Muffin Pastry Bread
cofcake 70 1 0 0
cornmuff 2 0 0 2
engmuff 0 38 0 4
bluemuff 2 36 5 0
cintoast 0 7 0 3
danpastry 1 0 100 0
jdonut 0 0 25 0
gdonut 8 0 20 0
cinbun 0 6 20 0
toastmarm 0 0 12 10
toast 0 0 2 0
butoast 0 3 0 51
hrolls 0 0 2 22
toastmarg 0 1 0 14
butoastj 2 0 7 10

These coefficients indicate the relative contribution of each food. The columns are named as one would name a factor or a principal component or any other latent variable. That is, we know what a danish is and a glazed or jelly donut, but we know nothing about the third column except that interest in these three breakfast foods seem to covary together. Pastry seemed like a good, although not especially creative, name. These column names seem to correspond to the different regions in the joint configuration plot derived from the complete rankings. In fact, I borrowed de Leeuw's cluster names from the top of his page 20.

And what about the 42 rows in the basis matrix? The nmf package relies on a heatmap to display the relationship between the individuals and the latent variables.

Interpretation is made easier by the clustering of the respondents along the left side of the heatmap. We are looking for blocks of solid color in each column, for example, the last 11 rows or the 4 rows just above the last 11 rows. The largest block falls toward the middle of the third column associated with pastries, and the first several rows tend to have their largest values in the first column. although most have membership in more than one column. The legend tells us that lighter yellows indicate the lowest association with the column and the darkest reds or browns identify the strongest connection. The dendrogram divides the 42 individuals into the same groupings if you cut the tree at 4 clusters.

The dendrogram also illustrates that some of the rows are combinations of more than one type. The whole, meaning the 42 individuals, can be separated into four "pure" types. A pure type is an individual whose basis vector contains one value very near one and the remaining values very near zero. Everyone is a combination of the pure types or latent variables. Some are all pure types, and some are mixtures of different types. The last 4 rows are a good example of a mixture of muffins and breads (columns 4 and 2).

Finally, I have not compared the location of the respondents on the configuration plot with their color spectrum in the heatmap. There is a correspondence, for example, #37 is near the breads on the plot and in the bread column on the heatmap. And we could continue with #19 into pastries and #33 eating muffins, but we will not since one does not expect complete agreement when the heatmap has collapsed the lower 80% of the rankings. We have our answer to the initial question raised in the title. We can learn a great deal about attraction using only the top rankings. However, we have lost any avoidance information contained in the complete rankings.

So, What Is Nonnegative Matrix Factorization?

I answered this question at the end of a previous post, and it might be helpful for you to review another example. I show in some detail the equation and how the coefficient matrix and the basis matrix combine to yield approximations of the observed data.

What do you want for breakfast? Is it something light and quick, or are you hungry and want something filling? We communicate in food types. A hotel might advertise that their price includes a continental breakfast. Continental breakfast is a food type. Bacon and eggs are not included. This is the structure shaping human behavior that nonnegative matrix factorization attempts to uncover. There were enough respondents who wanted only the foods from each of the four columns that we were able to extract four breakfast food types. These latent variables are additive so that a respondent can select according to their own individual proportions how much they want the foods from each column.

Nonnegative matrix factorization will succeed to the extent that preferences are organized as additive groupings of observed choices. I would argue that a good deal of consumption is structured by goals and that these latent variables reflect goal-derived categories. We observe the selections made by individuals and infer their motivation. Those inferences are the columns of our coefficient matrix, and the rows of the heatmap tell us how much each respondent relies on those inferred latent constructs when making their selections.

R code needed to recreate all the tables and plots:

res <- smacofRect(breakfast)
plot(res, plot.type = "confplot")
apply(breakfast, 2, table)
apply(partial_rank, 2, table)
fit<-nmf(partial_rank, 4, "lee", nrun=10)

Created by Pretty R at inside-R.org

Tuesday, July 8, 2014

Are Consumer Preferences Deep or Shallow?

John Hauser, because no one questions his expertise, is an excellent spokesperson for the viewpoint that consumer preferences are real, as presented in his article "Self-Reflection and Articulated Consumer Preferences." Simply stated, preferences are enduring when formed over time and after careful consideration of actual products. As a consequence, accurate measurement requires us to encourage self-reflection within realistic contexts. "Here true preferences mean the preferences consumers use to make decisions after a serious evaluation of the products that are available on the market."

However, serious evaluation takes some time and effort, in fact, a series of separate online tasks including revealed preference plus self-reports of both attribute-level preferences and decision-making strategies. We end up with a lot of data from each respondent enabling the estimation of a number of statistical models (e.g., a hierarchical Bayes choice-based conjoint that could be fit using the bayesm R package). All this data is deemed necessary in order for individuals to learn their "true" preferences. Underlying Hauser's approach is a sense of inevitably that a decision maker will arrive at the same resolution regardless of their path as long as they begin with self-reflection.

A more constructivist alternative can be found in my post on "The Mind is Flat!" where it is argued that we lack the cognitive machinery to generate, store and retrieve the extensive array of enduring preferences demanded by utility theory. Although we might remember our previous choice and simply repeat it as a heuristic simplification strategy, working our way through the choice processes anew will likely result in a different set of preferences. Borrowing a phrase from Stephen Jay Gould, replaying the "purchase process tape" will not yield the same outcome unless there are substantial situational constraints forcing the same resolution.
Do preferences control information search, or are preferences evoked by the context? Why would we not expect decision making to be adaptive and responsive to the situation? Enduring preferences may be too rigid for our constantly changing marketplaces. Serendipity has its advantages. After the fact, it is easy to believe that whatever happened had to be. Consider the case study from Hauser's article, and ask what if there had not been an Audi dealership near Maria? Might she been just as satisfied or perhaps even more happy with her second choice? It all works out for the best because we are inventive storytellers and cognitive dissonance will have its way. Isn't this the lesson from choice blindness?

Still, most of marketing research continues to believe in true and enduring preferences that can be articulated by the reflective consumer even when confronted by overwhelming evidence that the human brain is simply not capable of such feats. We recognize patterns, even when they are not there, and we have extensive episodic memory for stuff that we have experienced. We remember faces and places, odors and tastes, and almost every tune we have ever heard, but we are not as proficient when it comes to pin numbers and passwords or dates or even simple calculations. Purchases are tasks that are solved not by looking inside for deep and enduring preferences. Instead, we exploit the situation or task structure and engage in fast thinking with whatever preferences are elicited by the specifics of the setting. Consequently, preferences are shallow and contextual.

As long as pre-existing preferences were in control, we were free to present as many alternatives and feature-levels as we wished. The top-down process would search for what it preferred and the rest would be ignored. However, as noted above, context does matter in human judgment and choice. Instead of deciding what you feel like eating (top down), you look at the menu and see what looks good to you (bottom up). Optimal experimental designs that systematically manipulate every possible attribute must be replaced by attempts to mimic the purchase context as closely as possible, not just the checkout but the entire consumer decision journey. Purchase remains the outcome of primary interest, but along the way attention becomes the dependent variable for "a wealth of information creates a poverty of attention" (Herbert A. Simon).

Future data collection will have us following consumers around real or replicated marketplaces and noting what information was accessed and what was done. Our statistical model will then be forced to deal with the sparsity resulting from consumers who concentrate their efforts on only a very few of the many touchpoints available to them. My earlier post on identifying the pathways in the consumer decision journey will provide some idea of what such an analysis might look like. In particular, I show how the R package nmf is able to uncover the underlying structure when the data matrix is sparse. More will follow in subsequent posts.

Wednesday, July 2, 2014

Using Biplots to Map Cluster Solutions

FactoMineR is a quick and easy R package for generating biplots, such as the following plot showing the columns as arrows with the rows to be added later as points. As you might recall from a previous post, a biplot maps a data matrix by plotting both the rows and columns in the same figure. Here the columns (variables) are arrows and the rows (individuals) will be points. By default, FactoMineR avoids cluttered maps by separating the variables and individuals factor maps into two plots. The variables factor map appears below, and the individuals factor map will be shown later in this post.
The dataset comes from David Wishart's book Whiskey Classified, Choosing Single Malts by Flavor. Some 86 whiskies from different regions of Scotland were rated on 12 aromas and flavors from "not present" (a rating of 0) to "pronounced" (a rating of 4). Luba Gloukhov ran a cluster analysis with this data and plotted the location where each whisky was distilled on a map of Scotland. The dataset can be retrieved as a csv file using the R function read.csv("clipboard'). All you need to do is go to the web site, select and copy the header and the data, and run the R function read.csv pointing to the clipboard. All the R code is presented at the end of this post.

Each arrow in the above plot represents one of the 12 ratings. FactoMineR takes the 86 x 12 matrix and performs a principal component analysis. The first principal component is labeled as Dim 1 and accounts for almost 27% of the total variation. Dim 2 is the second principal component with an additional 16% of the variation. One can read the component loadings for any rating by noting the perpendicular projection of the arrow head onto each dimension. Thus, Medicinal and Smoky have high loadings on the first principal component with Sweetness, Floral and Fruity anchoring the negative end. One could continue in the same manner with the second principal component, however, at some point we might notice the semi-circle that runs from Floral, Sweetness and Fruity through Nutty, Winey and Spicy to Smoky, Tobacco and Medicinal. That is, the features sweep out a one-dimensional arc, not unlike a multidimensional scaling of color perceptions (see Figure 1).
Now, we will add the 86 points representing the different whiskies. But first we will run a cluster analysis so that when we plot the whiskies, different colors will indicate cluster membership. I have included the R code to run both a finite mixture model using the R package mclust and a k-means. Both procedures yield four-cluster solutions that classify over 90% of the whiskies into the same clusters. Luba Gloukhov also extracted four clusters by looking for an "elbow" in the plot of the within-cluster sum-of-squares from two through nine clusters. By default, Mclust will test one through nine clusters and select the best model using the BIC as the selection criteria. The cluster profiles from mclust are presented below.

Black Red Green Blue Total
27 36 6 17 86
31% 42% 7% 20% 100%
Body 2.7 1.4 3.7 1.9 2.1
Sweetness 2.4 2.5 1.5 2.1 2.3
Smoky 1.5 1.0 3.7 1.9 1.5
Medicinal 0.0 0.2 3.3 1.0 0.5
Tobacco 0.0 0.0 0.7 0.3 0.1
Honey 1.9 1.1 0.2 1.0 1.3
Spicy 1.6 1.1 1.7 1.6 1.4
Winey 1.9 0.5 0.5 0.8 1.0
Nutty 1.9 1.3 1.2 1.4 1.5
Malty 2.1 1.7 1.3 1.7 1.8
Fruity 2.1 1.9 1.2 1.3 1.8
Floral 1.6 2.1 0.2 1.4 1.7

Finally, we are ready to look at the biplot with the rows represented as points and the color of each point indicating cluster membership, as shown below in what FactoMineR calls the individuals factor map. To begin, we can see clear separation by color suggesting that differences among the cluster reside in the first two dimensions of this biplot. It is important to remember that the cluster analysis does not use the principal component scores. There is no data reduction prior to the clustering.
The Green cluster contains only 6 whiskies and falls toward the right of the biplot. This is the same direction as the arrows for Medicinal, Tobacco and Smoky. Moreover, the Green cluster received the highest scores on these features. Although the arrow for Body does not point in that direction, you should be able to see that the perpendicular projection of the Green points will be higher than that for any other cluster. The arrow for Body is pointed upward because a second and larger cluster, the Black, also receives a relatively high rating. This is not the case for other three ratings. Green is the only cluster with high ratings on Smoky or Medicinal. Similarly, though none of the whiskies score high on Tobacco, the six Green whiskies do get the highest ratings.

You can test your ability to interpret biplots by asking on what features the Red cluster should score the highest. Look back up to the vector map, and identify the arrows pointing in the same direction as the Red cluster or pointing in a direction so that the Red points will project toward the high end of the arrow. Do you see at least Floral and Sweetness? The process continues in the same manner for the Black cluster, but the Blue cluster, like its points, fall in the middle without any distinguishing features.

Hopefully, you have not been troubled by my relaxed and anthropomorphic writing style. Vectors do not reposition themselves so that all the whiskies earning high scores will project themselves toward its high end, and points do not move around looking for that one location that best reproduces all their ratings. However, principal component analysis does use a singular value decomposition to factor data matrices into row and column components that reproduce the original data as closely as possible. Thus, there is some justification for such talk. Nevertheless, it helps with the interpretation to let these vectors and points come alive and have their own intentions.

What Did We Do and Why Did We Do It?

We began trying to understand a cluster analysis derived from a data matrix containing the ratings for 86 whiskies across 12 aroma and taste features. Although not a large data matrix, one still has some difficulty uncovering any underlying structure by looking one variable/column at a time. The biplot helps by creating a low-dimensional graphic display with ratings as vectors and whiskies as points. The ratings appeared to be arrayed along an arc from floral to medicinal, and the 86 whiskies were located as points in this same space.

Now, we are ready to project the cluster solution onto this biplot. By using separate ratings, the finite mixture model worked in the 12-dimensional rating space and not in the two-dimensional world of the biplot. Yet, we see relatively coherent clusters occupying different regions of the map. In fact, except for the Blue cluster falling in the middle, the clusters move along the arc from a Red floral to a Black malty/honey/nutty/winey to a Green medicinal. The relationships among the four clusters are revealed by their color coding on the biplot. They are no longer four qualitatively distinct entries, but a continuum of locally adjacent groupings arrayed along a nonlinear dimension from floral to medicinal.

R code needed to run all the analysis in this post.

# read data from external site
# after copied into the clipboard
data <- read.csv("clipboard")
# runs finite mixture model
# compares with k-means solution
kcl<-kmeans(ratings, 4, nstart=25)
table(fmm$classification, kcl$cluster)
# creates biplots
plot(pca, choix=c("ind"), label="none", col.ind=fmm$classification)

Created by Pretty R at inside-R.org

Saturday, June 21, 2014

Separating Statistical Models of "What Is Learned" from "How It Is Learned"

Something triggers our interest. Possibly it's an ad, a review or just word of mouth. We want to know more about the movie, the device, the software, or the service. Because we come with different preferences and needs, our searches vary in intensity. For some it is one and done, but others expend some effort and seek out many sources. My last post on the consumer decision journey laid out the argument for using nonnegative matrix factorization and the R package nmf to identify the different pathways taken in the search for product information.

Information Search ≠ Knowledge Acquired

It is easy to confuse the learning process with what is learned. The internet gives consumers control over their information search, and they are free to improvise as they wish. However, what is learned remains determined by the competition in the product category. What knowledge do we acquire as we search online or in person? Careful, there is no exam, therefore we are not required to be objective or thorough. Andy Clark reminds us that "...minds evolved to make things happen." So we learn what is available and what we want because we intend to make a purchase. We learn only what we need to know to make a choice.

The marketer and the consumer join forces to simplify the purchase process so that only a limited amount of information search and knowledge acquisition is needed to reach a satisfying decision. When the choice is hard, only a few buy. The simplification is a one-dimensional array of features and benefits running from the basic to the premium product, from the least to the most expensive. Every product category offers alternatives that are good, better, and best. Learning this is not difficult since everyone is ready and willing to help. The marketing department, the experts who review and recommend, and even other users will let you know what features differentiate the good from the better and the better from the best. One cannot search long for product information without learning what features are basic, what features are added to create the next quality level, and finally what features indicate a premium product.

In the end, we require one statistical model for analyzing how well the brand is doing and a different statistical model for investigating the pathways taken in the consumer decision journey. As we have already seen, R provides a method for uncovering the learning pathways with matrix factorization packages such as nmf. Brand performance or achievement (what is learned) can be modeled using latent-trait or item response theory (see the section "Thinking Like an Item Response Theorist"). I have provided more detail in previous posts showing how to analyze both checklists and rating scales.

Marketers have resource allocation decisions to make. They need to put their money where consumers are looking. When marketing is successful, the brand will do well. Since process and product are distinct systems governed by different rules, each demands its own statistical model and associated measurement procedures.

Friday, June 13, 2014

Identifying Pathways in the Consumer Decision Journey: Nonnegative Matrix Factorization

The Internet has freed us from the shackles of the yellow page directory, the trip to the nearby store to learn what is available, and the forced choice among a limited set of alternatives. The consumer is in control of their purchase journey and can take any path they wish. But do they? It's a lot of work for our machete-wielding consumer cutting their way through the product jungle. The consumer decision journey is not an itinerary, but neither is it aimless meandering. Perhaps we do not wish to follow the well-worn trail laid out by some marketing department. The consumer is free to improvise, not by going where no one has gone before, but by creating personal variation using a common set of journey components shared with others.

Even with all the different ways to learn about products and services, we find constraints on the purchase process with some touchpoint combinations more likely than others. For example, one could generate a long list of all the possible touchpoints that might trigger interest, provide information, make recommendations, and enable purchase. Yet, we would expect any individual consumer to encounter only a small proportion of this long list. A common journey might be no more than a seeing an ad followed by a trip to a store. For frequently purchased products, the entire discovery-learning-comparing-purchase process could collapse into a single point-of-sale (PoS) touchpoint, such as product packaging on a retail shelf.

The figure below comes from a touchpoint management article discussing the new challenges of online marketing. This example was selected because it illustrates how easy it is to generate touchpoints as we think of all the ways that a consumer interacts with or learns about a product. Moreover, we could have been much more detailed because episodic memory allows us to relive the product experience (e.g., the specific ads seen, the packaging information attended to, the pages of the website visited). The touchpoint list quickly gets lengthy, and the data matrix becomes sparser because an individual consumer is not likely to engage intensively with many products. The resulting checklist dataset is a high-dimensional consumer-by-touchpoint matrix with lots of columns and cells containing some ones but mostly zeroes.

It seems natural to subdivide the columns into separate modes of interaction as shown by the coloring in the above figure (e.g., POS, One-to-One, Indirect, and Mass). It seems natural because different consumers rely on different modes to learn and interact with product categories. Do you buy by going to the store and selecting the best available product, or do you search and order online without any physical contact with people or product? Like a Rubik's cube, we might be able to sort rows and columns simultaneously so that the reordered matrix would appear to be block diagonal with most of the ones within the blocks and most of the zeroes outside. You can find an illustration in a previous post on the reorderable data matrix. As we shall see later, nonnegative matrix factorization "reorders" indirectly by excluding negative entries in the data matrix and its factors. A more direct approach to reordering would use the R packages for biclustering or seriation. Both of these links offer different perspectives on how to cluster or order rows and columns simultaneously.

Nonnegative Matrix Factorization (NMF) with Simulated Data

I intend to rely on the R package NMF and a simulated data set based on the above figure. I will keep it simple and assume only two pathways: an online journey through the 10 touchpoints marked with an "@" in the above figure and an offline journey through the remaining 20 touchpoints. Clearly, consumers are more likely to encounter some touchpoints more often than others, so I have made some reasonable but arbitrary choices. The R code at the end of this post reveals the choices that were made and how I generated the data using the sim.rasch function from the psych R package. Actually, all you need to know is that the dataset contains 400 consumers, 200 interacting more frequently online and 200 with greater offline contact. I have sorted the 30 touchpoints from the above figure so that the first 10 are online (e.g., search engine, website, user forum) and the last 20 are offline (e.g., packaging information, ad in magazine, information display). Although the patterns within each set of online and offline touchpoints are similar, the result is two clearly different pathways as shown by the following plot.

It should be noted that the 400 x 30 data matrix contained mostly zeroes with only 11.2% of the 12,000 cells indicating any contact. Seven of the respondents did not indicate any interaction at all and were removed from the analysis. The mode was 3 touchpoints per consumer, and no one reported more than 11 interactions (although the verb "reported" might not be appropriate to describe simulated data).

If all I had was the 400 respondents, how would I identify the two pathways? Actually, k-means often does quite well, but not in this case with so many infrequent binary variables. Although using the earlier mentioned biclustering approach in R, Dolnicar and her colleagues will help us understand the problems encounters when conducting market segmentation with high-dimensional data. When asked to separate the 400 into two groups, k-means clustering was able to identify correctly only 55.5% of the respondents. Before we overgeneralize, let me note that k-means performed much better when the proportions were higher (e.g., raise both lines so that they peak above 0.5 instead of below 0.4), although that is not much help with high-dimensional scare data.

And, what about NMF? I will start with the results so that you will be motivated to remain for the explanation in the next section. Overall, NMF placed correctly 81.4% of the respondents, 85.9% of the offline segment and 76.9% of the online segment. In addition, NMF extracted two latent variables that separated the 30 touchpoints into the two sets of 10 online and 20 offline interactions.

So, what is nonnegative matrix factorization?

Have you run or interpreted a factor analysis? Factor analysis is matrix factorization where the correlation matrix R is factored into factor loadings: R = FF'. Structural equation modeling is another example of matrix factorization, where we add direct and indirect paths between the latent variables to the factor model connecting observed and latent variables. However, unlike the two previous models that factor the correlation or variance-covariance matrix among the observed variables, NMF attempts to decompose the actual data matrix.

Wikipedia uses the following diagram to show this decomposition or factorization:

The matrix V is our data matrix with 400 respondents by 30 touchpoints. A factorization simplifies V by reducing the number of columns from the 30 observed touchpoints to some smaller number of latent or hidden variables (e.g., two in our case since we have two pathways). We need to rotate the H matrix by 90 degrees so that it is easier to read, that is, 2x30 to 30x2. We do this by taking the transpose, which in R code is t(H).

Search engine
Price comparison
Hint from Expert
User forum
Banner or Pop-up
E-mail request
Packaging information
PoS promotion
Recommendation friends
Show window
Information at counter
Advertising entrance
Editorial newspaper
Consumer magazine
Ad in magazine
Personal advice
Information screen
Information display
Customer magazine
Catalog loyalty program
Offer loyalty card
Service hotline

As shown above, I have labeled the columns to reflect their largest coefficients in the same way that one would name a factor in terms of its largest loadings. To continue with the analogy to factor analysis, the touchpoints in V are observed, but the columns of W and the rows of H are latent and named using their relationship to the touchpoints. Can we call these latent variables "parts," as Seung and Lee did in their 1999 article "Learning the Parts of Objects by NMF"? The answer depends on how much overlap between the columns you are willing to accept. When each row of H contains only one large positive value and the remaining columns for that row are zero (e.g., Website in the third row), we can speak of latent parts in the sense that adding columns does not change the impact of previous columns but simply adds something new to the V approximation.

So in what sense is online or offline behavior a component or a part? There are 30 touchpoints. Why are there not 30 components? In this context, a component is a collection of touchpoints that vary together as a unit. We simulated the data using two different likelihood profiles. The argument called d in the sim.rasch function (see the R code at the end of this post) contains 30 values controlling the likelihood that the 30 touchpoints will be assigned a one. Smaller values of d result in higher probabilities that the touchpoint interaction will occur. The coefficients in each latent variable of H reflect those d values and constitute a component because the touchpoints vary together for 200 individuals. Put another way, the whole with 400 respondents contains two parts of 200 respondents each and each with its own response generation process.

The one remaining matrix, W, must be of size 400x2 (# respondents times # latent variables). So, we have 800 entries in W and 60 cells in H compared to the 12,000 observed values in V. W has one row for each respondent. Here are the rows of W for the 200th and 201st respondents, which is the dividing line between the two segments:
200 0.00015 0.00546
201 0.01218 0.00038
The numbers are small because we are factoring a data matrix of zeroes and ones. But the ratios of these two numbers are sizeable. The 200th respondent has an offline latent score (0.00546) more than 36 times its online latent score (0.00015), and the ratio for the 201st respondent is more than 32 in the other direction with online dominating. Finally, in order to visualize the entire W matrix for all respondents, the NMF package will produce heatmaps like the following with the R code basismap(fit, Rowv=NA).
As before, the first column represent online and the second points to offline. The first 200 rows are offline respondents or our original Segment 1 (labeled basis 2), and the last 200 or our original Segment 2 were generated using the online response pattern (labeled basis 1). This type of relabeling or renumbering occurs over and over again in cluster analysis, so we must learn to live with it. To avoid confusion, I will repeat myself and be explicit.

Basis 2 is our original Segment 1 (Offliners).
Basis 1 is our original Segment 2 (Onliners).

As mentioned earlier, Segment 1 offline respondents had a higher classification accuracy (85.9% vs. 76.9%). This is shown by the more solid and darker red lines for the first 200 offline respondents in the second basis 2 column.

Consumer Improvisation Might Be Somewhat More Complicated

Introducing only two segments with predominantly online or offline product interactions was a simplification necessary to guide the reader through an illustrative example. Obviously, the consumer has many more components that they can piece together on their journey. However, the building blocks are not individual touchpoints but set of touchpoints that are linked together and operate as a unit. For example, visiting a brand website creates opportunities for many different micro-journeys over many possible links on each page. Recurring website micro-journeys experienced by several consumers would be identified as a latent components in our NMF analysis. At least, this what I have found using NMF with touchpoint checklists from marketing research questionnaires.

R Code to Reproduce All the Analysis in this Post
offline<-sim.rasch(nvar=30, n=200, mu=-0.5, sd=0,
online<-sim.rasch(nvar=30, n=200,  mu=-0.5, sd=0,
names(tp)<-c("Search engine",
             "Price comparison",
             "Hint from Expert",
             "User forum",
             "Banner or Pop-up",
             "E-mail request",
             "Packaging information",
             "PoS promotion",
             "Recommendation friends",
             "Show window",
             "Information at counter",
             "Advertising entrance",
             "Editorial newspaper",
             "Consumer magazine",
             "Ad in magazine",
             "Personal advice",
             "Information screen",
             "Information display",
             "Customer magazine",
             "Catalog loyalty program",
             "Offer loyalty card",
             "Service hotline")
seg_profile<-t(aggregate(tp, by=list(segment), FUN=mean))
    max(seg_profile[2:30,])), type="n",
    xlab="Touchpoints (First 10 Online/Last 20 Offline)", 
    ylab="Proportion Experiencing Touchpoint")
lines(seg_profile[2:30,1], col="blue", lwd=2.5)
lines(seg_profile[2:30,2], col="red", lwd=2.5)
       c("Offline","Online"), lty=c(1,1),
       lwd=c(2.5,2.5), col=c("blue","red"))
tp_cluster<-kmeans(tp[rows>0,], 2, nstart=25)
fit<-nmf(tp[rows>0,], 2, "frobenius")

Created by Pretty R at inside-R.org