Tuesday, July 29, 2014

Variable Selection in Market Segmentation: Clustering or Biclustering?

Will you have that segmentation with one or two modes?

The data matrix for market segmentation comes to us with two modes, the rows are consumers and the columns are variables. Clustering uses all the columns to transform the two-mode data matrix (row and columns are different) into a one-mode distance matrix (rows and columns are the same) either directly as in hierarchical clustering or indirectly as in k-means. The burden falls on the analyst to be judicious in variable selection since too few will miss distinctions that ought to be made and too many will blur those distinctions that need to be made.

This point is worth repeating. Most data matrices have two modes with the rows and columns referring to different types. In contrast, correlation and distance matrices have only one mode with both the rows and columns referring to the same entities. Although a market segmentation places its data into a two-way matrix with the rows as consumers and the columns as the variables, the intent is to cluster the rows and ignore the columns once they have been used to define the distances between the rows. All the columns enter that distance calculation, which is why variable selection becomes so important in market segmentation.

Such an all-or-none variable selection may seem too restrictive given the diversity of products in many categories. We would not be surprised to discover that what differentiates one end of the product category is not what separates consumers at the other end. Examples are easy to find. Credit card usage can be divided into those who pay their bill in full each month and those who pay interest on outstanding balances. The two groups seek different credit card features, although there is some overlap. Business travelers are willing to pay for benefits that would not interest those on vacation, yet again one finds at least some commonality. Additional examples can be generated without difficulty for many product categories contain substantial heterogeneity in both their users and their offerings.

Biclustering offers a two-mode alternative that allows different clusters to be defined by different variables. The "bi" refers to the joint clustering of both rows and columns at the same time. All the variables remain available for selection by any cluster, but each cluster exercises its own option to incorporate or ignore. So why not refer to biclustering as two-mode clustering or simultaneous clustering or co-clustering? They do. Names proliferate when the same technique is rediscovered by diverse disciplines or when different models and algorithms are used. As you might expect, R offers many options (including biclust for distance-based biclustering). However, I will focus on NMF for factorization-based biclustering, a package that has proven useful with a number of my marketing research projects.

Revisiting K-means as Matrix Factorization

For many of us k-means clustering is closely associated with clouds of points in two-dimensional space. I ask that you forget that scatterplot of points and replace it with the following picture of matrix multiplication from Wikipedia:
What does this have to do with k-means? The unlabeled matrix in the lower right is the data matrix with rows as individuals and columns as variables (4 people with 3 measures). The green circle is the score for the third person on the third variable. It is calculated as a(1,1)*b(1,3)+a(3,2)*b(2,3). In a k-means A is the membership matrix with K columns as clusters indicators. Each row has a single entry with a one indicating its cluster membership and K-1 zeros for the other clusters. If Person #3 has been classified as cluster #1, then his or her third variable reduces to b(1,3) since a(1,1)=1 and a(3,2)=0. In fact, the entire row for this individual is simply a copy of the first row for B, which contains the centroid for the first cluster.

With two clusters k-means gives you one of two score profiles. You give me the cluster number by telling me which column of A is a one, and I will read out the scores you ought to get from the cluster profile in the correct row of B. The same process is repeated when there are more clusters, and the reproduced data matrix contains only K unique patterns. Cluster membership means losing your identity and adopting the cluster centroid as your data profile.

With a hard all-or-none cluster membership, everyone in the same cluster ought to have the same pattern of scores except for random variation. This would not be the case with soft cluster membership, that is, the rows of the membership matrix A would still sum to one but the entries would be probabilities of cluster membership varying from zero to one. Similarly, B does not need to be the cluster centroid. The rows of B could represent an archetype, an extreme or unusual pattern of scores. Archetypal analysis adapts such an approach and so does nonnegative matrix factorization (NMF), although the rows of B have different constraints. Both techniques are summarized in Section 14.6 of the online book Elements of Statistical Learning.

Given the title for this post, you might wish to know what any of this has to do with variable selection. The nonnegative in NMF restricts all the values of all three matrices to be either zero or a positive number: the data matrix contains counts or quantities, cluster membership is often transformed to look like a probability varying between 0 and 1, and the clusters are defined by always adding variables or excluding them entirely with a zero coefficient. As a result, we find in real applications that many of the coefficients in B are zero indicating that the associated variable has been excluded. The same is true for the A matrix suggesting a simultaneous co-clustering of the columns and rows forming high-density sub-matrices by rearranging these rows and columns in order to appear as homogeneous blocks. You can see this in the heatmaps from the NMF R package with lots of low-density regions and only a few high-density blocks.

Expanding the Definition of a Cluster

Biclustering shifts our attention away from the scatterplot and concentrates it directly on the data matrix, specifically, how it might be decomposed into components and then recomposed one building block at a time. Clusters are no longer patterns discovered in cloud of points plotted in some high-dimensional space and observed piecemeal 2 or 3 dimensions at a time. Nor are they sorts into groups or partitions into mixtures of different distributions. Clusters have become components that are combined together additively to reproduce the original data matrix.

In the above figure, the rows of B define the clusters in terms of the observed variables in the columns of B, and A shows how much each cluster contributes to each row of the data matrix. For example, a consumer tells us what information they seek when selecting a hotel. Biclustering sees that row of the data matrix as a linear combination of a small set of information search strategies. Consumers can hold partial membership in more than one cluster, and what we mean by belonging to a cluster is adopting a particular information search strategy. A purist relies on only one strategy so that their row in the cluster membership matrix will have one value close to one. Other consumers will adopt a mixture of strategies with membership spread across two or more row entities.

previous post provides more details, and there will be more to come.

Wednesday, July 23, 2014

Uncovering the Preferences Shaping Consumer Data: Matrix Factorization

How do you limit your search when looking for a hotel? Those trying to save money begin with price. Members of hotel reward programs focus on their brand. At other times, location is first to narrow our consideration set. What does hotel search reveal about hotel preference?

What do consumer really want in a hotel? I could simply provide a list of features and ask you to rate the importance of each. Or, I could force a trade-off by repeatedly giving you a small set of features and having you tell me which was the most and least important in each feature set. But self-report has its flaws requiring that consumers know what they want and that they are willing and able to articulate those desires. Besides, hotels offer lots of features, often very specific features that can have a major impact on choice (e.g., hours when the pool or restaurant are open, parking availability and cost, check-out times, pet policy, and many more). Is there a less demanding route to learning consumer preferences?

Who won the World Series last year, or the Academy Award for best director, or the Nobel Prize for Economics? You would know the answer if you were a baseball fan, a movie buff, or an econometrician. What you know reflects your preferences. Moreover, familiarity with budget hotels is informative and suggests some degree of price sensitivity. One's behavior on a hotel search engine would also tell us a great deal about preference. With a little thought and ingenuity, we could identify many more sources of consumer data that would be preference-revealing had we the analytic means to uncover the preferences shaping such data matrices.

All these data matrices have a common format. Consumers are the rows, and the columns could be either features or brands. If we asked about hotel familiarity or knowledge, the columns would be a long list of possible hotels and the cells would contain the familiarity score with most of those values equal to zero indicating no awareness or familiarity at all. Substituting a different measure in the cells would not change the format or the analysis. For example, the cell entries could be some measure of depth of search for each hotel (e.g., number of inquiries or amount of time). Again, most of the entries for any one consumer would be zero.

In both cases, the measurements are outcomes of the purchase process and are not constructed in response to being asked a question. That is, the hotel search process is observed, unobtrusively, and familiarity is a straightforward recall question with minimal inference required from the consumer. Familiarity is measured as a sequence of achievements: one does not recognize the name of the hotel, one has some sense of familiarity but no knowledge, one has heard something about the hotel, or one has stayed there themselves. Preference has already shaped these measures. That which is preferred becomes familiar over time through awareness, consideration and usage.

Consumer Preference as Value Proposition and Not Separate Utilities

Can I simply tell you what I am trying to accomplish? I want to perform a matrix factorization that takes as input the type of data matrix that we have been discussing with consumers as the rows and brands or features as the columns. My goal is to factor or decompose that data matrix into two parts. The first part will bring together the separate brands or features into a value proposition, and the second part will tells us the appeal of each value proposition for every respondent.

Purchase strategies are not scalable. Choice modeling might work for a few alternatives and a small number of features, but it will not help us find the hotel we want. What we want can be described by the customer value proposition and recovered by matrix factorization of any data matrix shaped by consumer preferences. If it helps, one can think of the value proposition as the ideal product or service and the purchase process as attempting to get as close to that ideal as possible. Of course, choice is contextual for the hotel that one seeks for a business meeting or conference is not the hotel that one would select for a romantic weekend getaway. We make a serious mistake when we ignore context for the consumer books hotel rooms only when some purpose is served.

In a previous post I showed how nonnegative matrix factorization (NMF) can identify pathways in the consumer decision journey. Hotel search is merely another application, although this time the columns will be features and not information sources. NMF handles the sparse data matrix resulting from hotel search engines that provide so much information on so many different hotels and consumers who have the time and interest to view only a small fraction of all that is available. Moreover, the R package NMF brings the analysis and the interpretation within reach of any researcher comfortable with factor loadings and factor scores. You can find the details in the previous post from the above link, or you can go to another example in a second post.

Much of what you have learned running factor analyses can be applied to NMF. Instead of factor loadings, NMF uses a coefficient matrix to link the observed features or brands in the columns to the latent components. This coefficient matrix is interpreted in much the same way as one interprets factor loadings. However, the latent variables are not dimensions. I have called them latent components; others refer to them as latent features. We do not seem to possess the right terminology because we see products and services as feature bundles with preference residing in the feature levels and the overall utility as simply the sum of its feature-level utilities. Utility theory and conjoint analysis assume that we live in the high-dimensional parameter space defined by the degrees of freedom associated with feature levels (e.g., 167 dimensions in the Courtyard by Marriott conjoint analysis).

Matrix factorization takes a somewhat different approach. It begins with the benefits that consumers seek. These benefits define the dimensionality or rank of the data matrix, which is much smaller than the number of columns. The features acquire their value as indicators of the underlying benefit. Only in very restricted settings is the total equal to the sum of its part. As mentioned earlier in this post, choice modeling is not scalable. With more than a few alternatives or a handful of features, humans turn to simplification strategies to handle the information overload. The appeal or beauty of a product design cannot be reduced to its elements. The persuasiveness of a message emerges from its form and not its separate claims. It's "getting a deal" that motivates the price sensitive and not the price itself, which is why behavioral economics is so successful at predicting biases. Finally, Choice architecture works because the whole comes first and the parts are seen only within the context of the initial framing.

Our example of the hotel product category is organized by type and storyline within each type. As an illustration of what I mean by storyline, there are luxury hotels (hotel type) that do not provide luxury experiences (e.g., rude staff, unclean rooms, or uncomfortable beds). We would quickly understand any user comment describing such a hotel since we rely on such stories to organize our experiences and make sense out of massive amounts of information. Story is the appropriate metaphor because each value proposition is a tale of benefits to be delivered. The search for a hotel is the quest for the appealing story delivering your preferred value proposition. These are the latent components of the NMF uncovered because there exists a group of consumers seeking just these features or hotels. That is, a consumer segment that only visits the links for budget hotels or filters their search by low price will generate a budget latent component with large coefficients for only these columns.

This intuitive understanding is essential for interpreting the results of a NMF. We are trying to reproduce the data matrix one section at a time. If you picture Rubik's cube and think about sorting rows and columns until all the consumers whose main concern is money and all the budget hotels or money-saving features have been moved toward the first positions, you should end up with something that looks like this biclustering diagram:
Continuing with the other rows and columns, we would uncover only blocks in the main diagonal if everyone was seeking only one value proposition. But we tend to see both "pure" segments focusing on only one value proposition and "mixed" segments wanting a lot of this one plus some of that one too (e.g., low price with breakfast included).

So far, we have reviewed the coefficient matrix containing the latent component or pure value propositions, which we interpreted based on their association with the observed columns. All we need now is a consumer matrix showing the appeal of each latent component. That is, a consumer who wants more than offered by any one pure value proposition will have a row in the data matrix that cannot be reproduced by any one latent component. For example, a pure budget guest spends a lot of time comparing prices, while the budget-plus-value seeker spends half of their time on price and the other half on getting some extra perks in the package. If we had only two latent components, then the budget shoppers would have weights of 1 and 0, while the other would have something closer to 0.5 and 0.5.

The NMF R package provides the function basismap to generate heatmaps, such as the one below, showing mixture proportions for each row or consumer.
You can test your understanding of the heatmap by verifying that the bottom three rows identified as #7, #2 and #4 are pure third latent component and the next two rows (#17 and #13) require only the first latent component to reproduce their data. Mixtures can be found on the first few rows.

Mining Consumer Data Matrices for Preference Insights

We can learn a lot about consumer preferences by looking more carefully at what they do and what they know. The consumer is not a scientist studying what motivates or drives their purchase behavior. We can ask for the reasons why, and they will respond. However, that response may be little more than a fabrication constructed on the fly to answer your question. Tradeoffs among abstract words with no referents tell us little about how a person will react in a specific situation. Yet, how much can we learn from a single person in one particular purchase context?

Collaborative filtering exploits the underlying structure in a data matrix so that individual behavior is interpreted through the latent components extracted from others. Marketing is social and everything is shared. Consumers share common value propositions learned by telling and retelling happy and upsetting consumption stories in person and in text. Others join the conversation by publishing reviews or writing articles. Of course, the marketing department tries to control it all by spending lots of money. The result is clear and shared constraints on our data matrix. There are a limited number of ways of relating to products and services. Individual consumers are but variations on those common themes.

NMF is one approach for decomposing the data matrix into meaningful components. R provides the interface to that powerful algorithm. The bottleneck is not the statistical model or the R code but our understanding of how preference guides consumer behavior. We mistakenly believe that individual features have value because the final choice is often between two alternatives that differ on only a few features. It is the same error that we make with the last-click attribution model. The real work has been done earlier in the decision process, and this is where we need to concentrate our data mining. Individual features derive their value from their ability to deliver benefits. These are our value propositions uncovered by factoring our data matrices into preference generating components.

Tuesday, July 15, 2014

Taking Inventory: Analyzing Data When Most Answer No, Never, or None

Consumer inventories, as the name implies, are tallies of things that consumers buy, use or do. Product inventories, for example, present consumers with rather long lists of all the offerings in a category and ask which or how many or how often they buy each one. Inventories, of course, are not limited to product listings. A tourist survey might inquire about all the different activities that one might have enjoyed on their last trip (see Dolnicar et al. for an example using the R package biclust). Customer satisfaction studies catalog all the possible problems that one could experience with their car, their airline, their bank, their kitchen appliances and a growing assortment of product categories. User experience research gathers frequency data for all product features and services. Music recommender systems seek to know what you have listened to and how often. Google Analytics keeps track of every click. Physicians inventory medical symptoms.

For most inventories the list is long, and the resulting data are sparse. The attempt to be comprehensive and exhaustive produces lists with many more items than any one consumer could possibly experience. Now, we must analyze a data matrix where no, never, or none is the dominant response. These data matrices can contain counts of the number of times in some time period (e.g., purchases), frequencies of occurrences (e.g., daily, weekly, monthly), or assessments of severity and intensity (e.g., a medical symptoms inventory). The entries are all nonnegative values. Presence and absence are coded zero and one, but counts, frequencies and intensities include other positive values to measure magnitude.

An actual case study would help, however, my example of a feature usage inventory relies on proprietary data that must remain confidential. This would be a severe limitation except that almost every customer inventory analysis will yield similar results under comparable conditions. Specifically, feature usage is not random or haphazard, but organized by wants and needs and structured by situation and task. There are latent components underlying all product and service usage. We use what we want and need, and our wants and needs flow from who we are and the limitations imposed by our circumstances.

In this study a sizable sample of customers were asked how often they used a list of 72 different features. Never was the most frequent response, although several features were used daily or several times a week. As you might expect, some features were used together to accomplish the same tasks, and tasks tended to be grouped into organized patterns for users with similar needs. That is, one would not be surprised to discover a smaller number of latent components controlling the observed frequencies of feature usage.

The R package NMF (nonnegative matrix factorization) searches for this underlying latent structure and displays it in a coefficient heatmap using the function coefmap(object), where object is the name of list return by the nmf function. If you are looking for detailed R code for running nmf, you can find it in two previous posts demonstrating how to identify pathways in the consumer purchase journey and how to uncover the structure underlying partial rankings of only the most important features (top of the heap).

The following plot contains 72 columns, one for each feature. The number of rows are supplied to the function by setting the rank. Here the rank was set to ten. In the same way as one decides on the best number of factors in factor analysis or the best number of clusters in cluster analysis, one can repeat the nmf with different ranks. Ten works as an illustration for our purposes. We start by naming those latent components in the rows. Rows 3 and 8 have many reddish rectangles side-by-side suggesting that several features are accessed together as a unit (e.g., all the features needed to take, view, and share pictures with your smartphone). Rows 1, 2, 4 and 5, on the other hand, have one defining feature with some possible support features (e.g., 4G cellular connectivity for your tablet).
The dendrogram at the top summarizes the clustering of features. The right hand side indicates the presence of two large clusters spanning most of the features. Both rows 3 and 8 pull together a sizable number of features. However, these blocks are not of uniform color hinting that some features may not be used as frequently as others of the same type. Rows 6, 7, 9 and 10 have a more uniform color, although the rectangles are smaller consisting of combinations of only 2, 3 or 4 features. The remaining rows seem to be defined by a single feature each. It is in the manner that one talks about NMF as a feature clustering technique.

You can see that NMF has been utilized as a rank-reduction technique. Those 4 blocks of features in rows 6, 7, 9 and 10 appear to function as units, that is, if one feature in the block is used, then all the features in the block are used, although to different degrees as shown by the varying colors of the adjacent rectangles. It is not uncommon to see a gate-keeping feature with a very high coefficient anchoring the component with support features that are used less frequently in the task. Moreover, features with mixture coefficients across different components imply that the same feature may serve different functions. For example, you can see in row 8 a grouping of features near the middle of the row with mixing coefficients in the 0.3 to 0.6 range for both rows 3 and 8. We can see the same pattern for a rectangle of features a little more left mixing rows 3 and 6. At least some of the features serve more than one purpose.

I would like to offer a little more detail so that you can begin to develop an intuitive understanding of what is meant by matrix factorization with nonnegativity constraints. There are no negative coefficients in H, so that nothing can be undone. Consequently, the components can be thought of as building blocks for each contain the minimal feature pattern that act together as a unit. Suppose that a segment only used their smartphones to make and receive calls so that their feature usage matrix had zeroes everywhere except for everyday use of the calling features. Would we not want a component to represent this usage pattern? And what if they also used their phone as a camera but only sometimes? Since there is probably not a camera-only segment, we would not expect to see camera-related features as a standalone component. We might find, instead, a single component with larger coefficients in H for calling features and smaller coefficients in the same row of H for the camera features.

Recalling What We Are Trying to Do

It always seems to help to recall that we are trying to factor our data matrix. We start with an inventory containing the usage frequency for some 72 features (columns) for all the individual users (rows). Can we still reproduce our data matrix using fewer columns? That is, can we find fewer than 72 component scores for individual respondents that will still reproduce approximately the scores for all 72 features? Knowing only the component scores for each individual in our matrix W, we will need a coefficient matrix H that takes the component scores and calculates feature scores. Then our data matrix V is approximated by W x H (see Wikipedia for a review).

We have seen H (feature coefficients), now let's look at W (latent component scores). Once again, NMF displays usage patterns for all the respondents with a heatmap. The columns are our components, which were defined earlier in terms of the features. Now, what about individual users? The components or columns constitute building blocks. Each user can decide to use only one of the components or some combination of several components. For example, one could choose to use only the calling features or seldom make calls and text almost everything or some mixture of these two components. This property is often referred to in the NMF literature as additivity (e.g., learning the parts of objects).

So, how should one interpret the above heatmap? Do we have 10 segments, one for each component? Such a segmentation could be achieved by simply classifying each respondent as belonging to the component with the highest score. We start with fuzzy membership and force it to be all or none. For example, the first block of users at the top of column 7 can be classified as Component #7 users, where Component #7 has been named based on the features in H with the largest coefficients. As an alternative, the clustered heatmap takes the additional step of running a hierarchical cluster analysis using distances based on all 10 components. By treating the 10 components as mixing coefficients, one could select any clustering procedure to form the segments. A food consumption study referenced in an earlier post reports on a k-means in the NMF-derived latent space.

Regardless of what you do next, the heatmap provides the overall picture and thus is a good place to start. Heatmaps can produce checkerboard patterns when different user groups are defined by their usage of completely different sets of features (e.g., a mall with distinct specialty stores attracting customers with diverse backgrounds). However, this is not what we see in this heatmap. Instead, Component #7 acts almost as continuous usage intensity factor: the more ways you use your smartphone, the more you use your smartphone (e.g., business and personal usage). The most frequent flyers fly for both business and pleasure. Cars with the most mileage both commute and go on vacation. Continuing with examples will only distract from the point that NMF has enabled us to uncover structure from a large and largely sparse data matrix. Whether heterogeneity takes a continuous or discrete form, we must be able to describe it before we can respond to it.



Thursday, July 10, 2014

How Much Can We Learn from Top Rankings using Nonnegative Matrix Factorization?

Purchases are choices from available alternatives. Post-purchase, we know what is the most preferred, but all the other options score the same. Regardless of differences in appeal, all the remaining items received the same score of not chosen. A second choice tells us more, as would the alternative selected as third most preferred. As we add top rankings from first to second to the kth choice, we seem to gain more and more information about preferences. Yet, what if we concentrated only on the top performers, what might be called the "pick of the litter" or the "top of the heap" (e.g., top k from J alternatives)? How much can be learn from such partial rankings?

Jan de Leeuw shows us what can be done with a complete ranking. What if we were to take de Leeuw breakfast food dataset and keep only the top-3 rankings so that all we know is what each respondent selected as their first, second and third choices? Everything that you would need to know is contained in the Journal of Statistical Software article by de Leeuw and Mair (see section 6.2). The data come in a matrix with 42 individuals and 15 breakfast foods. I have reproduce his plot below to make the discussion easier to follow. Please note that all the R code can be found at the end of this post.


The numbers running from 1 to 42 represent the location of each individual ranking the 15 different breakfast foods. That is, rows are individuals, columns are foods, and the cells are rankings from 1 to 15 for each row. What would you like for breakfast? Here are 15 breakfast foods, please order them in terms of your preference with "1" being your most preferred food and "15" indicating your least preferred.

The unfolding model locates each respondent's ideal and measures preference as distance from that ideal point. Thus, both rows (individuals) and columns (foods) are points that are positioned in the same space such that the distances between any given row number and the columns have the same ordering as the original data for that row. As a result, you can reproduce (approximately) an individual's preference ordering from the position of their ideal point relative to the foods. Who likes muffins? If you answered, #23 or #34 or #33 or anyone else nearby, then you understand the unfolding map.

Now, suppose that only the top-3 rankings were provided by each respondent. We will keep the rankings for first, second and third and recode everything else to zero. Now, what values should be assigned to the first, second and third picks? Although ranks are not counts, it is customary to simply reverse the ranks so that the weight for first is 3, second is 2, and third is 1. As a result, the rows are no longer unique values of 1 to 15, but instead contain one each of 1, 2 and 3 plus 12 zeroes. We have wiped out 80% of our data. Although there are other approaches for working with partial rankings, I will turn to nonnegative matrix factorization because I want a technique that works well with sparsity, for example, top 3 of 50 foods or top 5 of 100 foods. Specifically, we are seeking a general approach for dealing with any partial ranking that generates sparse data matrices. Nonnegative matrix factorization seems to be up for the task, as demonstrated in a large food consumption study.

We are now ready for the nmf R package as soon as we specify the number of latent variables. I will try to keep it simple. The data matrix is 42 x 15 with each row having 12 zeroes and three entries that are 1, 2 and 3 with 3 as the best (ranking reversed). Everything would be simpler if the observed breakfast food rankings resulted from a few latent consumption types (e.g., sweet-lovers tempted by pastries, the donuts-for-breakfast crowd, the muffin-eaters and the bread-slicers). Then, observed rankings could be accounted for by some combination of these latent types. "Pure Breads" select only toast or hard roll. "Pure Muffins" pick only the three varieties of muffin, though corn muffin may not be considered a real muffin by everyone. Coffee cakes may be its own latent type, and I have idea how nmf will deal with cinnamon toast (remember that the data is at least 40 years old). From these musings one might reasonably try three or four latent variables.

The nonnegative matrix factorization (nmf) was run with four latent variables. The first function argument is the data matrix, followed by the rank or number of latent variables, with the method next, and a number indicating the number of times you want the analysis rerun with different starting points. This last nrun argument works in the same manner as the nstart argument in kmeans. Local minimum can be a problem, so why not restart the nmf function with several different initialization and pick the best solution? The number 10 seemed to work with this data matrix, by which I mean that I obtained similar results each time I reran the function with nrun=10. You will note that I did not set the seed, so that you can try it yourself and see if you get a similar solution.

The coefficient matrix is shown below. The entries have been rescaled to fall along a scale from 0 to 100 for no other reason than it is relative value that is important and marketing research often uses such a thermometer scale. Because I will be interpreting these coefficients as if they were factor loadings, I borrowed the fa.sort() function from the psych R package. Hopefully, this sorting make it easier to see the underlying pattern.

Obviously, these coefficients are not factor loadings, which are correlations between the observed and latent variables. You might want to think of them as if they were coefficients from a principal component analysis. What are these coefficients? You might wish to recall that we are factoring our data matrix into two parts: this coefficient matrix and what is called a basis matrix. The coefficient matrix enables us to name the latent variables by seeing the association between the observed and latent variables. The basis matrix includes a row for every respondent indicating the contribution of each latent variable to their top 3 rankings. I promise that all this will become clearer as we work through this example.

Coffee Cake Muffin Pastry Bread
cofcake 70 1 0 0
cornmuff 2 0 0 2
engmuff 0 38 0 4
bluemuff 2 36 5 0
cintoast 0 7 0 3
danpastry 1 0 100 0
jdonut 0 0 25 0
gdonut 8 0 20 0
cinbun 0 6 20 0
toastmarm 0 0 12 10
toast 0 0 2 0
butoast 0 3 0 51
hrolls 0 0 2 22
toastmarg 0 1 0 14
butoastj 2 0 7 10

These coefficients indicate the relative contribution of each food. The columns are named as one would name a factor or a principal component or any other latent variable. That is, we know what a danish is and a glazed or jelly donut, but we know nothing about the third column except that interest in these three breakfast foods seem to covary together. Pastry seemed like a good, although not especially creative, name. These column names seem to correspond to the different regions in the joint configuration plot derived from the complete rankings. In fact, I borrowed de Leeuw's cluster names from the top of his page 20.

And what about the 42 rows in the basis matrix? The nmf package relies on a heatmap to display the relationship between the individuals and the latent variables.

Interpretation is made easier by the clustering of the respondents along the left side of the heatmap. We are looking for blocks of solid color in each column, for example, the last 11 rows or the 4 rows just above the last 11 rows. The largest block falls toward the middle of the third column associated with pastries, and the first several rows tend to have their largest values in the first column. although most have membership in more than one column. The legend tells us that lighter yellows indicate the lowest association with the column and the darkest reds or browns identify the strongest connection. The dendrogram divides the 42 individuals into the same groupings if you cut the tree at 4 clusters.

The dendrogram also illustrates that some of the rows are combinations of more than one type. The whole, meaning the 42 individuals, can be separated into four "pure" types. A pure type is an individual whose basis vector contains one value very near one and the remaining values very near zero. Everyone is a combination of the pure types or latent variables. Some are all pure types, and some are mixtures of different types. The last 4 rows are a good example of a mixture of muffins and breads (columns 4 and 2).

Finally, I have not compared the location of the respondents on the configuration plot with their color spectrum in the heatmap. There is a correspondence, for example, #37 is near the breads on the plot and in the bread column on the heatmap. And we could continue with #19 into pastries and #33 eating muffins, but we will not since one does not expect complete agreement when the heatmap has collapsed the lower 80% of the rankings. We have our answer to the initial question raised in the title. We can learn a great deal about attraction using only the top rankings. However, we have lost any avoidance information contained in the complete rankings.

So, What Is Nonnegative Matrix Factorization?

I answered this question at the end of a previous post, and it might be helpful for you to review another example. I show in some detail the equation and how the coefficient matrix and the basis matrix combine to yield approximations of the observed data.

What do you want for breakfast? Is it something light and quick, or are you hungry and want something filling? We communicate in food types. A hotel might advertise that their price includes a continental breakfast. Continental breakfast is a food type. Bacon and eggs are not included. This is the structure shaping human behavior that nonnegative matrix factorization attempts to uncover. There were enough respondents who wanted only the foods from each of the four columns that we were able to extract four breakfast food types. These latent variables are additive so that a respondent can select according to their own individual proportions how much they want the foods from each column.

Nonnegative matrix factorization will succeed to the extent that preferences are organized as additive groupings of observed choices. I would argue that a good deal of consumption is structured by goals and that these latent variables reflect goal-derived categories. We observe the selections made by individuals and infer their motivation. Those inferences are the columns of our coefficient matrix, and the rows of the heatmap tell us how much each respondent relies on those inferred latent constructs when making their selections.


R code needed to recreate all the tables and plots:

library(smacof)
data(breakfast)
breakfast
res <- smacofRect(breakfast)
plot(res, plot.type = "confplot")
 
partial_rank<-4-breakfast
partial_rank[partial_rank<1]<-0
apply(breakfast, 2, table)
apply(partial_rank, 2, table)
partial_rank
 
library(NMF)
fit<-nmf(partial_rank, 4, "lee", nrun=10)
h<-coef(fit)
library(psych)
fa.sort(t(round(h,3)))
w<-basis(fit)
wp<-w/apply(w,1,sum)
fa.sort(round(wp,3))
basismap(fit)

Created by Pretty R at inside-R.org

Tuesday, July 8, 2014

Are Consumer Preferences Deep or Shallow?

John Hauser, because no one questions his expertise, is an excellent spokesperson for the viewpoint that consumer preferences are real, as presented in his article "Self-Reflection and Articulated Consumer Preferences." Simply stated, preferences are enduring when formed over time and after careful consideration of actual products. As a consequence, accurate measurement requires us to encourage self-reflection within realistic contexts. "Here true preferences mean the preferences consumers use to make decisions after a serious evaluation of the products that are available on the market."

However, serious evaluation takes some time and effort, in fact, a series of separate online tasks including revealed preference plus self-reports of both attribute-level preferences and decision-making strategies. We end up with a lot of data from each respondent enabling the estimation of a number of statistical models (e.g., a hierarchical Bayes choice-based conjoint that could be fit using the bayesm R package). All this data is deemed necessary in order for individuals to learn their "true" preferences. Underlying Hauser's approach is a sense of inevitably that a decision maker will arrive at the same resolution regardless of their path as long as they begin with self-reflection.

A more constructivist alternative can be found in my post on "The Mind is Flat!" where it is argued that we lack the cognitive machinery to generate, store and retrieve the extensive array of enduring preferences demanded by utility theory. Although we might remember our previous choice and simply repeat it as a heuristic simplification strategy, working our way through the choice processes anew will likely result in a different set of preferences. Borrowing a phrase from Stephen Jay Gould, replaying the "purchase process tape" will not yield the same outcome unless there are substantial situational constraints forcing the same resolution.
Do preferences control information search, or are preferences evoked by the context? Why would we not expect decision making to be adaptive and responsive to the situation? Enduring preferences may be too rigid for our constantly changing marketplaces. Serendipity has its advantages. After the fact, it is easy to believe that whatever happened had to be. Consider the case study from Hauser's article, and ask what if there had not been an Audi dealership near Maria? Might she been just as satisfied or perhaps even more happy with her second choice? It all works out for the best because we are inventive storytellers and cognitive dissonance will have its way. Isn't this the lesson from choice blindness?

Still, most of marketing research continues to believe in true and enduring preferences that can be articulated by the reflective consumer even when confronted by overwhelming evidence that the human brain is simply not capable of such feats. We recognize patterns, even when they are not there, and we have extensive episodic memory for stuff that we have experienced. We remember faces and places, odors and tastes, and almost every tune we have ever heard, but we are not as proficient when it comes to pin numbers and passwords or dates or even simple calculations. Purchases are tasks that are solved not by looking inside for deep and enduring preferences. Instead, we exploit the situation or task structure and engage in fast thinking with whatever preferences are elicited by the specifics of the setting. Consequently, preferences are shallow and contextual.

As long as pre-existing preferences were in control, we were free to present as many alternatives and feature-levels as we wished. The top-down process would search for what it preferred and the rest would be ignored. However, as noted above, context does matter in human judgment and choice. Instead of deciding what you feel like eating (top down), you look at the menu and see what looks good to you (bottom up). Optimal experimental designs that systematically manipulate every possible attribute must be replaced by attempts to mimic the purchase context as closely as possible, not just the checkout but the entire consumer decision journey. Purchase remains the outcome of primary interest, but along the way attention becomes the dependent variable for "a wealth of information creates a poverty of attention" (Herbert A. Simon).

Future data collection will have us following consumers around real or replicated marketplaces and noting what information was accessed and what was done. Our statistical model will then be forced to deal with the sparsity resulting from consumers who concentrate their efforts on only a very few of the many touchpoints available to them. My earlier post on identifying the pathways in the consumer decision journey will provide some idea of what such an analysis might look like. In particular, I show how the R package nmf is able to uncover the underlying structure when the data matrix is sparse. More will follow in subsequent posts.



Wednesday, July 2, 2014

Using Biplots to Map Cluster Solutions

FactoMineR is a quick and easy R package for generating biplots, such as the following plot showing the columns as arrows with the rows to be added later as points. As you might recall from a previous post, a biplot maps a data matrix by plotting both the rows and columns in the same figure. Here the columns (variables) are arrows and the rows (individuals) will be points. By default, FactoMineR avoids cluttered maps by separating the variables and individuals factor maps into two plots. The variables factor map appears below, and the individuals factor map will be shown later in this post.
The dataset comes from David Wishart's book Whiskey Classified, Choosing Single Malts by Flavor. Some 86 whiskies from different regions of Scotland were rated on 12 aromas and flavors from "not present" (a rating of 0) to "pronounced" (a rating of 4). Luba Gloukhov ran a cluster analysis with this data and plotted the location where each whisky was distilled on a map of Scotland. The dataset can be retrieved as a csv file using the R function read.csv("clipboard'). All you need to do is go to the web site, select and copy the header and the data, and run the R function read.csv pointing to the clipboard. All the R code is presented at the end of this post.

Each arrow in the above plot represents one of the 12 ratings. FactoMineR takes the 86 x 12 matrix and performs a principal component analysis. The first principal component is labeled as Dim 1 and accounts for almost 27% of the total variation. Dim 2 is the second principal component with an additional 16% of the variation. One can read the component loadings for any rating by noting the perpendicular projection of the arrow head onto each dimension. Thus, Medicinal and Smoky have high loadings on the first principal component with Sweetness, Floral and Fruity anchoring the negative end. One could continue in the same manner with the second principal component, however, at some point we might notice the semi-circle that runs from Floral, Sweetness and Fruity through Nutty, Winey and Spicy to Smoky, Tobacco and Medicinal. That is, the features sweep out a one-dimensional arc, not unlike a multidimensional scaling of color perceptions (see Figure 1).
Now, we will add the 86 points representing the different whiskies. But first we will run a cluster analysis so that when we plot the whiskies, different colors will indicate cluster membership. I have included the R code to run both a finite mixture model using the R package mclust and a k-means. Both procedures yield four-cluster solutions that classify over 90% of the whiskies into the same clusters. Luba Gloukhov also extracted four clusters by looking for an "elbow" in the plot of the within-cluster sum-of-squares from two through nine clusters. By default, Mclust will test one through nine clusters and select the best model using the BIC as the selection criteria. The cluster profiles from mclust are presented below.

Black Red Green Blue Total
27 36 6 17 86
31% 42% 7% 20% 100%
Body 2.7 1.4 3.7 1.9 2.1
Sweetness 2.4 2.5 1.5 2.1 2.3
Smoky 1.5 1.0 3.7 1.9 1.5
Medicinal 0.0 0.2 3.3 1.0 0.5
Tobacco 0.0 0.0 0.7 0.3 0.1
Honey 1.9 1.1 0.2 1.0 1.3
Spicy 1.6 1.1 1.7 1.6 1.4
Winey 1.9 0.5 0.5 0.8 1.0
Nutty 1.9 1.3 1.2 1.4 1.5
Malty 2.1 1.7 1.3 1.7 1.8
Fruity 2.1 1.9 1.2 1.3 1.8
Floral 1.6 2.1 0.2 1.4 1.7

Finally, we are ready to look at the biplot with the rows represented as points and the color of each point indicating cluster membership, as shown below in what FactoMineR calls the individuals factor map. To begin, we can see clear separation by color suggesting that differences among the cluster reside in the first two dimensions of this biplot. It is important to remember that the cluster analysis does not use the principal component scores. There is no data reduction prior to the clustering.
The Green cluster contains only 6 whiskies and falls toward the right of the biplot. This is the same direction as the arrows for Medicinal, Tobacco and Smoky. Moreover, the Green cluster received the highest scores on these features. Although the arrow for Body does not point in that direction, you should be able to see that the perpendicular projection of the Green points will be higher than that for any other cluster. The arrow for Body is pointed upward because a second and larger cluster, the Black, also receives a relatively high rating. This is not the case for other three ratings. Green is the only cluster with high ratings on Smoky or Medicinal. Similarly, though none of the whiskies score high on Tobacco, the six Green whiskies do get the highest ratings.

You can test your ability to interpret biplots by asking on what features the Red cluster should score the highest. Look back up to the vector map, and identify the arrows pointing in the same direction as the Red cluster or pointing in a direction so that the Red points will project toward the high end of the arrow. Do you see at least Floral and Sweetness? The process continues in the same manner for the Black cluster, but the Blue cluster, like its points, fall in the middle without any distinguishing features.

Hopefully, you have not been troubled by my relaxed and anthropomorphic writing style. Vectors do not reposition themselves so that all the whiskies earning high scores will project themselves toward its high end, and points do not move around looking for that one location that best reproduces all their ratings. However, principal component analysis does use a singular value decomposition to factor data matrices into row and column components that reproduce the original data as closely as possible. Thus, there is some justification for such talk. Nevertheless, it helps with the interpretation to let these vectors and points come alive and have their own intentions.

What Did We Do and Why Did We Do It?

We began trying to understand a cluster analysis derived from a data matrix containing the ratings for 86 whiskies across 12 aroma and taste features. Although not a large data matrix, one still has some difficulty uncovering any underlying structure by looking one variable/column at a time. The biplot helps by creating a low-dimensional graphic display with ratings as vectors and whiskies as points. The ratings appeared to be arrayed along an arc from floral to medicinal, and the 86 whiskies were located as points in this same space.

Now, we are ready to project the cluster solution onto this biplot. By using separate ratings, the finite mixture model worked in the 12-dimensional rating space and not in the two-dimensional world of the biplot. Yet, we see relatively coherent clusters occupying different regions of the map. In fact, except for the Blue cluster falling in the middle, the clusters move along the arc from a Red floral to a Black malty/honey/nutty/winey to a Green medicinal. The relationships among the four clusters are revealed by their color coding on the biplot. They are no longer four qualitatively distinct entries, but a continuum of locally adjacent groupings arrayed along a nonlinear dimension from floral to medicinal.

R code needed to run all the analysis in this post.

# read data from external site
# after copied into the clipboard
data <- read.csv("clipboard")
ratings<-data[,3:14]
 
# runs finite mixture model
library(mclust)
fmm<-Mclust(ratings)
fmm
table(fmm$classification)
fmm$parameters$mean
 
# compares with k-means solution
kcl<-kmeans(ratings, 4, nstart=25)
table(fmm$classification, kcl$cluster)
 
# creates biplots
library(FactoMineR)
pca<-PCA(ratings)
plot(pca, choix=c("ind"), label="none", col.ind=fmm$classification)

Created by Pretty R at inside-R.org

Saturday, June 21, 2014

Separating Statistical Models of "What Is Learned" from "How It Is Learned"

Something triggers our interest. Possibly it's an ad, a review or just word of mouth. We want to know more about the movie, the device, the software, or the service. Because we come with different preferences and needs, our searches vary in intensity. For some it is one and done, but others expend some effort and seek out many sources. My last post on the consumer decision journey laid out the argument for using nonnegative matrix factorization and the R package nmf to identify the different pathways taken in the search for product information.

Information Search ≠ Knowledge Acquired

It is easy to confuse the learning process with what is learned. The internet gives consumers control over their information search, and they are free to improvise as they wish. However, what is learned remains determined by the competition in the product category. What knowledge do we acquire as we search online or in person? Careful, there is no exam, therefore we are not required to be objective or thorough. Andy Clark reminds us that "...minds evolved to make things happen." So we learn what is available and what we want because we intend to make a purchase. We learn only what we need to know to make a choice.

The marketer and the consumer join forces to simplify the purchase process so that only a limited amount of information search and knowledge acquisition is needed to reach a satisfying decision. When the choice is hard, only a few buy. The simplification is a one-dimensional array of features and benefits running from the basic to the premium product, from the least to the most expensive. Every product category offers alternatives that are good, better, and best. Learning this is not difficult since everyone is ready and willing to help. The marketing department, the experts who review and recommend, and even other users will let you know what features differentiate the good from the better and the better from the best. One cannot search long for product information without learning what features are basic, what features are added to create the next quality level, and finally what features indicate a premium product.

In the end, we require one statistical model for analyzing how well the brand is doing and a different statistical model for investigating the pathways taken in the consumer decision journey. As we have already seen, R provides a method for uncovering the learning pathways with matrix factorization packages such as nmf. Brand performance or achievement (what is learned) can be modeled using latent-trait or item response theory (see the section "Thinking Like an Item Response Theorist"). I have provided more detail in previous posts showing how to analyze both checklists and rating scales.

Marketers have resource allocation decisions to make. They need to put their money where consumers are looking. When marketing is successful, the brand will do well. Since process and product are distinct systems governed by different rules, each demands its own statistical model and associated measurement procedures.