Monday, January 26, 2015

Wine for Breakfast: Consumption Occasion as the Unit of Analysis

If the thought of a nice Chianti with that breakfast croissant is not that appealing, then I have made by point: occasion shapes consumption. Our tastes have been fashioned by culture and shared practice. Yet, we often ignore the context and run our analyses as if consumers were not nested within situations. Contextual effects are attributed to the person, who is treated as both the unit of observation and the unit of analysis.

Obviously, it would be difficult to interview the occasion. We need informants to learn about wine occasions. Thus, we seek out consumers to tell us when and where they drink what kinds of wines by themselves and with others. Even if one knows little about wine etiquette, the situation imposes such strong constraints that is makes sense to treat the consumption occasion as the unit of analysis. The person serves as the measuring instrument, but the focus is on the determining properties of the occasion.

Continuing with our example, there is a broad range of red and wine varietals that can be purchased in varying containers from a number of different retailers and served in various locations with a diversity of others. The list is long, and it is unlikely that we can ask for the details for more than a couple of consumption occasions before we fatigue our respondents. Yet, it is the specifics that we seek, including the benefits sought and the features considered.

Clearly, there is a self-selection process so that we would expect to find certain types of individuals in each situation. However, the consumption occasion imposes its own rules over and above any selection effect. Therefore, we would anticipate that whatever the reasons for your presence, the occasion will dictate its own norms. In the end, it is reasonable to aggregate the responses of everyone reporting on each consumption occasion and run the analysis with those aggregate responses as the rows. The columns are formed using all the data gathered about the occasion.

And It Isn't Just About Wine and Breakfast (Benefit Structure Analysis)

There are occasions when you use your smartphone to take pictures. If you were thinking about purchasing a new smartphone, you would consider camera ease of use and picture quality remembering those low-light photo that were out of focus and those sunsets where the sun is a blur. Usage occasion seems to impact almost every purchase. You pick your parents up at the airport, so you need four doors, preferably with easy access to the rear seats. Usage is so important that the website or the salesperson always  asks how you intend to use your new acquisition. Context matters whatever you buy (e.g., a washing machine, a garden hose, clothes, cosmetics, sporting equipment, and suitcases).

The goal is to uncover the major sources of variation differentiating among all the consumption occasions. Product differentiation and customer segmentation originate in the usage context. Since opportunities for increased profitability are found in the details, let's pretend we are journalists and ask who, what, where, when, why, and how. These six questions alone can generate a lot of rows, for instance, we obtain some 15,625 possible combinations when we suppose that the answers to each of the six questions could be classified into one of five categories (15,625 = 5x5x5x5x5x5). Of course, most of these rows will be empty because the responses to the six questions are not independent. Yet, 10% is still over 1500 rows, even if many of those rows will be sparse with zero or very small frequencies. Finally, the columns can contain any information collected about the consumption occasions in the rows, though one would expect inquiries concerning benefits sought and features preferred.

Now, we have a large matrix revealing the linkages between many specific occasions and a wide range of benefits and features. It might helpful to revisit the work on Benefit Structure Analysis from the 1970s in order to see how others have analyzed such a matrix. In Exhibit 5 from that Journal of Marketing article, we are presented with a matrix of 51 benefits wanted across 21 cleaning tasks. The solution was a simultaneous row and column linkage analysis, which seems similar to the biclustering that one would achieve today with nonnegative matrix factorization (NMF). As noted in the article, when cleaning furniture, the respondents desired products that removed dust, dirt and film without leaving residues or scratches. On the one hand, there appears to be a structure underlying the cleaning tasks revealed by their shared benefits, On the other hand, the benefits are clustered together by their common association with similar cleaning tasks.

Following that line of reasoning, we can simulate a data matrix by specifying a set of common latent features linking the occasions and the benefits. As outlined in a prior post, the data generating process is an additive superpositioning of building blocks formed by the occasion-benefit linkages. We can begin with some product, for example, coffee. When do we drink coffee, and why do we drink it? Even the shortest list would include starting the day (occasion) in order to jump-start the brain (benefit). Is this a building block? If there were a sizable cohort of first-of-the-day kickstarters who did not drink coffee for the same reasons at other occasions, then we would have a building block.

The data matrix tells us what benefits are sought in each occasion. Neither the occasions nor the benefits are independent. There are times and places when specialty coffee replaces our regular cup. What occasions come to mind when you think about iced or frozen blended coffees? To help us understand this process, I have reproduced a figure from an earlier post.

The associations between the ten occasions labeled A to J and the seven benefits numbered 1 to 7 are indicated by filled squares in Section a. The rows and columns are interchanged as we move from Sections b to c until we see the building blocks in Section d. The solid black and white squares do not show the shades of gray indicating the degree to which coffee drinkers demand the benefit in each occasion. Specifically, Benefit 6 is wanted in both Occasions A, C and H and Occasions D, G, I and E. However, it is likely drinkers are not equally demanding in the two sets of occasions. For example, coffee that starts the day must energize, but the coffee in the afternoon might be primarily a break or a low calorie refreshment. In both cases we are seeking stimulation, just not as much in the afternoon as the first cup of the day.

Benefit structure analysis remains a critical component in any marketing plan. Opportunity is found in the white spaces where benefits are not delivered by the current offerings. Case studies and qualitative research findings fill the business shelves of online and retail book sellers. Now, advances in statistical modeling enable us to inquire at the deep level of detail that drives consumer product purchases. The R code needed to simultaneously cluster the rows and columns of such data matrices has been provided in a series of previous posts on music, cosmetics, personality inventoriesscotch whiskey, feature usage, and the consumer purchase journey.

Sunday, January 11, 2015

Some Applications of Item Response Theory in R

The typical introduction to item response theory (IRT) positions the technique as a form of curve fitting. We believe that a latent continuous variable is responsible for the observed dichotomous or polytomous responses to a set of items (e.g., multiple choice questions on an exam or rating scales from a survey). Literally, once I know your latent score, I can predict your observed responses to all the items. Our task is to estimate that function with one, two or three parameters after determining that the latent trait is unidimensional. In the process of measuring individuals, we gather information about the items. Those one, two or three parameters are assessments of each item's difficulty, discriminability and sensitivity to noise or guessing.

All this has been translated into R by William Revelle, and as a measurement task, our work is done. We have an estimate of each individual's latent position on an underlying continuum defined as whatever determines the item responses. Along the way, we discover which items require more of the latent trait in order to achieve a favorable response (e.g., the difficulty of answering correctly or the extremity of the item and/or the response). We can measure ability with achievement items, political ideology with an opinion survey, and brand perceptions with a list of satisfaction ratings.

To be clear, these scales are meant to differentiate among individuals. For example, the R statistical programming language has an underlying structure that orders the learning process so that the more complex concepts are mastered after the simpler material. In this case, learning is shaped by the difficulty of the subject matter with the more demanding content reusing or building onto what has already been learned. When the constraints are sufficient, individuals and their mastery can be arrayed on a common scale. At one end of the continuum are complex concepts that only the more advanced students master. The easier stuff falls toward the bottom of the scale with topics that almost everyone knows. When you take an R programming achievement test, your score tells me how well you performed relative to others who answered similar questions (see normed-referenced testing).

The same reasoning applied to IRT analysis of political ideology (e.g., the R package basicspace). Opinions tend to follow a predictable path from liberal to conservative so that only a limited number of all possible configurations are actually observed. As shown below, legislative voting follows such a pattern with Senators (dark line) and Representatives (light line) separate along the liberal to conservative dimensions based on their votes in the 113th Congress. Although not shown, all the specific votes can also be placed on this same scale so that Pryor, Landrieu, Baucus and Hagan (in blue) are located toward the right because their votes on various bills and resolutions agreed more often with Republicans (in red). As with achievement testing, an order is imposed on the likely responses of objects so that the response space in p dimensions (where p equals the number of behaviors, items or votes) is reduced to a one-dimensional seriation of both votes and voters on the same scale.

My last example comes from marketing research where brand perceptions tend to organized as a pattern of strengths and weaknesses defined by the product category. In a previous post, I showed how preference for Subway fast food restaurants is associated with a specific ordering of product and service attribute ratings. Many believe that Subway offers fresh and healthy food. Fewer like the taste or feel it is filling. Fewer still are happy with the ordering or preparation, and even more dislike the menu and the seating arrangements. These perceptions have an order so that if you are satisfied with the menu then you are likely to be satisfied with the taste and the freshness/healthiness of the food. Just as issues can be ordered from liberal to conservative, brand perceptions reflect the strengths and weaknesses promised by the brand's positioning. Subway promises fresh and healthy food but not prepackaged and waiting under the heat lamp for easy bagging. The mean levels of our satisfaction ratings will be consistent with those brand priorities.

We can look at the same data from another perspective. Heatmaps summarize the triangular pattern observed in data matrices that can be modeled by IRT. In a second post analyzing the Subway data, I described the following heatmap showing the results from the 8-item checklist of features associated with the brand. Each row is a different respondent with the blue indicating that the item was checked and red telling us that the item was not checked. As one moves down the heatmap, the overall perceptions become more positive as additional attributes are endorsed. Positive brand perceptions are incremental, but the increments are not more of the same. Tasty and filling gets added to healthy and fresh. That is, greater satisfaction with Subway is reflected in the willingness to endorse additional components of the brand promise. The heatmap is triangular so that those who are happy with the menu are likely to be at least as satisfied with all the attributes to the right.

Monday, December 22, 2014

Contextual Measurement Is a Game Changer




Adding a context can change one's frame of reference:

Are you courteous? 
Are you courteous at work? 





Decontextualized questions tend to activate a self-presentation strategy and retrieve memories of past positioning of oneself (impression management). Such personality inventories can be completed without ever thinking about how we actually behave in real situations. The phrase "at work" may disrupt that process if we do not have a prepared statement concerning our workplace demeanor. Yet, a simple "at work" may not be sufficient, and we may be forced to become more concrete and operationally define what we mean by courteous workplace behavior (performance appraisal). Our measures are still self-reports, but the added specificity requires that we relive the events described by the question (episodic memory) rather than providing inferences concerning the possible causes of our behavior.

We have such a data set in R (verbal in the difR package). The data come from a study of verbal aggression triggered by some event: (S1) a bus fails to stop for me, (S2) I miss a train because a clerk gave faulty information, (S3) the grocery store closes just as I am about to enter, or (S4) the operator disconnect me when I used up my last 10 cents for a call. Obviously, the data were collected during the last millennium when there were still phone booths, but the final item can be updated as "The automated phone support system disconnects me after working my way through the entire menu of options" (which seems even more upsetting than the original wording).

Alright, we are angry. Now, we can respond by shouting, scolding or cursing, and these verbally aggressive behaviors can be real (do) or fantasy (want to). The factorial combination of 4 situations (S1, S2, S3, and S4) by 2 behavioral modes (Want and Do) by 3 actions (Shout, Scold and Curse) yields the 24 items of the contextualized personality questionnaire. Respondents are given each description and asked "yes" or "no" with "perhaps" as an intermediate point on what might be considered an ordinal scale. Our dataset collapses "yes" and "perhaps" to form a dichotomous scale and thus avoids the issue of whether "perhaps" is a true midpoint or another branch of a decision tree.

David Magis et al. provide a rather detailed analysis of this scale as a problem in differential item functioning (DIF) solved using the R package difR. However, I would like to suggest an alternative approach using nonnegative matrix factorization (NMF). My primary concern is scalability. I would like to see a more complete inventory of events that trigger verbal aggression and a more comprehensive set of possible actions. For example, we might begin with a much longer list of upsetting situations that are commonly encountered. We follow up by asking which situations they have experienced and recalling what they did in each situation. The result would be a much larger and sparser data matrix that might overburden a DIF analysis but that NMF could easily handle.

Hopefully, you can see the contrast between the two approaches. Here we have four contextual triggering events (bus, train, store, and phone) crossed with 6 different behaviors (want and do by curse, scold and shout). An item response model assumes that responses to each item reflect each individual's position on a continuous latent variable, in this case, verbal aggression as a personality trait. The more aggressive you are, the more likely you are to engage in more aggressive behaviors. Situations may be more or less aggression-evoking, but individuals maintain their relative standing on the aggression trait.

Nonnegative matrix factorization, on the other hand, searches for a decomposition of the observed data matrix within the constraint that all the matrices contain only nonnegative values. These nonnegative restrictions tend to reproduce the original data matrix by additive parts as if one were layering one component after the other on top of each other. As an illustration, let us say that our sample could be separated into the shouters, the scolders, and those who curse based on their preferred response regardless of the situation. These three components would be the building blocks and those who shout their curses would have their data rows formed by the overlay of shout and curse components. The analysis below will illustrate this point.

The NMF R code is presented at the end of this post. You are encourage to copy and run the analysis after installing difR and NMF. I will limit my discussion to the following coefficient matrix showing the contribution of each of the 24 items after rescaling to fall on a scale from 0 to 1.


Want to and Do Scold

Store Closing

Want to and Do Shout

Want to Curse

Do Curse

S2DoScold

1.00
0.19
0.00
0.00
0.00
S4WantScold

0.96
0.00
0.00
0.08
0.00
S4DoScold

0.95
0.00
0.00
0.00
0.11
S1DoScold

0.79
0.37
0.02
0.05
0.15

S3WantScold

0.00
1.00
0.00
0.08
0.00
S3DoScold

0.00
0.79
0.00
0.00
0.00
S3DoShout

0.00
0.15
0.14
0.00
0.00

S2WantShout

0.00
0.00
1.00
0.13
0.02
S1WantShout

0.00
0.05
0.91
0.17
0.04
S4WantShout

0.00
0.00
0.76
0.00
0.00
S1DoShout

0.00
0.12
0.74
0.00
0.00
S2DoShout

0.08
0.00
0.59
0.00
0.00
S4DoShout

0.10
0.00
0.39
0.00
0.00
S3WantShout

0.00
0.34
0.36
0.00
0.00

S1wantCurse

0.13
0.18
0.03
1.00
0.09
S2WantCurse

0.34
0.00
0.08
0.92
0.20
S3WantCurse

0.00
0.41
0.00
0.85
0.02
S2WantScold

0.59
0.00
0.00
0.73
0.00
S1WantScold

0.40
0.22
0.01
0.69
0.00
S4WantCurse

0.31
0.00
0.00
0.62
0.48

S1DoCurse

0.24
0.16
0.01
0.17
1.00
S2DoCurse

0.47
0.00
0.00
0.00
0.99
S4DoCurse

0.46
0.00
0.02
0.00
0.95
S3DoCurse

0.00
0.54
0.00
0.00
0.69

As you can see, I extracted five latent features (the columns of the above coefficient matrix). Although there are some indices in the NMF package to assist in determining the number of latent features, I followed the common practice of fitting a number of different solutions and picking the "best" of the lot. It is often informative to learn how the solutions changes with the rank of the decomposition. In this case similar structures were uncovered regardless of the number of latent features. References to a more complete discussion of this question can be found in an August 29th comment from a previous post on NMF.

Cursing was the preferred option across all the situations, and the last two columns reveal a decomposition of the data matrix with a concentration of respondents who do curse or want to curse regardless of the trigger. It should be noted that Store Closing (S3) tended to generate less cursing, as well as less scolding and shouting. Evidently there was a smaller group that were upset by the store closing, at least enough to scold. This is why the second latent feature is part of the decomposition; we need to layer store closing for those additional individuals who reacted more than the rest. Finally, we have two latent features for those who shout and those who scold across situations. As in principal component analysis, which is also a matrix factorization, one needs to note the size of the coefficients. For example, the middle latent features reveals a higher contribution for wanting to shout over actually shouting.

Contextualized Measurement Alters the Response Generation Process

When we describe ourselves or other, we make use of the shared understandings that enable communication (meeting of minds or brain to brain transfer). These inferences concerning the causes of our own and others behavior are always smoothed or fitted with context ignored, forgotten or never noticed. Statistical models of decontextualized self-reports reflect this organization imposed by the communication process. We believe that our behavior is driven by traits, and as a result, our responses can be fit with an item response model assuming latent traits.

Matrix factorization suggests a different model for contextualized self-reports. The possibilities explode with the introduction of context. Relatively small changes in the details create a flurry of new contexts and an accompanying surge in the alternative actions available. For instance, it makes a differences if the person closing the store as you are about to enter has the option of letting one more person in when you plea that it is for a quick purchase. The determining factor may be an emotional affordance, that is, an immediate perception that one is not valued. Moreover, the response to such a trigger will likely be specific to the situation and appropriately selected from a large repertoire of possible behaviors. Leaving the details out of the description only invites the respondents to fill in the blanks themselves,

You should be able to build on my somewhat limited example and extrapolate to a data matrix with many more situations and behaviors. As we saw here, individuals may have preferred responses that generalize over context (e.g., cursing tends to be overused) or perhaps there will be situation-specific sensitivity (e.g., store closings). NMF builds the data matrix from additive components that simultaneously cluster both the columns (situation-action pairings) and the rows (individuals). These components are latent, but they are not traits in the sense of dimensions over which individuals are ranked ordered. Instead of differentiating dimensions, we have uncovered the building blocks that are layered to reproduce the data matrix.

Although we are not assuming an underlying dimension, we are open to the possibility. The row heatmap from the NMF may follow a characteristic Guttman scale pattern, but this is only one of many possible outcomes. The process might unfold as follows. One could expect a relationship between the context and response with some situations evoking more aggressive behaviors. We could then array the situations by increasing ability to evoke aggressive actions in the same way that items on an achievement test can be ordered by difficulty. Aggressiveness becomes a dimension when situations accumulated like correct answers on an exam with those displaying less aggressive behaviors encountering only the less aggression-evoking situations. Individuals become more aggressive by finding themselves in or by actively seeking increasingly more aggression-evoking situations.


R Code for the NMF Analysis of the Verbal Aggression Data Set

# access the verbal data from difR
library(difR)
data(verbal)
 
# extract the 24 items
test<-verbal[,1:24]
apply(test,2,table)
 
# remove rows with all 0s
none<-apply(test,1,sum)
table(none)
test<-test[none>0,]
 
library(NMF)
# set seed for nmf replication
set.seed(1219)
 
# 5 latent features chosen after
# examining several different solutions
fit<-nmf(test, 5, method="lee", nrun=20)
summary(fit)
basismap(fit)
coefmap(fit)
 
# scales coefficients and sorts
library(psych)
h<-coef(fit)
max_h<-apply(h,1,function(x) max(x))
h_scaled<-h/max_h
fa.sort(t(round(h_scaled,3)))
 
# hard clusters based on max value
W<-basis(fit)
W2<-max.col(W)
 
# profile clusters
table(W2)
t(aggregate(test, by=list(W2), mean))

Created by Pretty R at inside-R.org

Friday, December 5, 2014

Archetypal Analysis: Similarity Defined by Distances from Contrasting Ideals


Carl Jung was at least partially correct. We do tend to think in terms of the extremes as shown in this archetypal wheel with rulers versus outlaws and heroes versus caregivers at different ends of bipolar dimensions. Happily, we are not required to accept Jung's collective unconscious to explain this tendency. Metaphorical thinking works just fine. For example, why not separate all political players into two camps based on our earliest experiences with powerful others: liberals as caregivers (supportive mothers) and conservatives as heroes (demanding and punishing fathers)?

Political ideology was selected as my example because of its universality and because R offers so many ways of analyzing such data. Probably the quickest introduction is through the voteview blog, which relies on a dimensional representation of our liberal and conservative archetypes (such as the following figure showing the polarization in the U.S. Congress).
Two points define a line, and it is seldom difficult to image a continuum between any two bipolar types, in this case between liberals and conservatives. Do we have a dimension or categories? It depends on any separation within the density distribution. Obviously, the distributions in the both the House (light blue Democrats and light red Republicans) and the Senate (dark blue and red) are at least bimodal. Thus, we are free to represent the same data as points along the liberal-conservative dimension or as ratios of mixture coefficients for the two clusters (i.e., odds ratio of membership likelihood in the red or blue clusters).

The mclust R code and a more complete discussion can be found in an earlier post using likelihood to recommend as the dimension and promoters versus distractors as the clusters. In order that there is no misunderstanding, the liberal-conservative continuum is a latent variable derived from a series of votes on a range of issues with the R package basicspace. Recommendation, on the other hand, is an observed likelihood rating along an 11-point scale from 0 to 10. In both cases, we are looking for evidence of separation as if we had a mixture of different generative models.

Given the above figure, liberal and conservative archetypes would be located toward the end points of this scale. That is, instead of describing the two clusters using their centroids positioned near the "humps" in the two curves, archetypal analysis attempts to describe political ideology in terms of idealized liberals and conservatives. These are not necessarily the most extreme points, as the archetypes R package makes clear with displays such as the following showing both the convex hull of the most extreme data points in gray and archetypes as the vertices of the internal red triangle. Three archetypes are necessary to locate any data point in the two-dimensional space.
Before continuing, we ought to review a few examples so that we understand what we mean by an archetype. If you live in a region that receives snow or just watch a lot of Christmas movies and I told you that it was a perfect winter day, that picture you just imagined is an ideal or archetype. All winter days can be described in terms of their distance from that ideal. The same can be said of spring, fall and summer days. If you are familiar with smoggy days, as was Leo Breiman when he introduced archetypal analysis to describe ozone levels in Los Angeles, then you know what a smog alert feels like. We use the shorthand provided by shared archetypes to summarize succinctly a large amount of information.

As you may have noticed, I have interchanged the words "ideal" and "archetype" in my writing. This was deliberate since archetypes tend to be seen when describing ideal instantiation of a category rather than the average category member. Thus, when asked to tell you about a specific athlete, such as a basketball center, you are not likely to describe the average center nor the greatest center that ever played the game. Instead, one thinks about the role that the center plays in the game, lists those defensive and offensive contributions, and distinguishes this position from the other players on the team. Manual Eugster demonstrates how the R package archetypes would uncover such archetypal athletes.

Of course, there is no requirement that forces us to retrieve goal-derived categories and their associated ideals from memory. We could evaluate "on a curve" and think about the average basketball center, as we might if asked to guess the average height of a NBA center. Yet, the center in basketball serves a purpose within a team of other players with other purposes. Not unlike the archetypal wheel that introduced this post, the center is defined in contrast to the other positions on the team. The rules of the game play a role in the clustering of players with similarity measured not by distance from the average but distance from the ideal. Therefore, two centers are similar because they play similar roles in the game, that is, both are close to the ideal center. Moreover, they are seen as even more alike when guards are added into the mix. Similarity is shaped by the context of competing archetypes or ideals.

In one of my first posts, I demonstrated how the R package archetypes would identify features usage types. Repeatedly, we find that usage intensity has the greatest impact differentiating the light from the heavy user. I have reproduced a figure from that previous post showing both the k-means clusters (the K's) and the position of the archetypes (the A's).


The data are 10 feature usage ratings that are projected onto the plane formed by the first two principal components. The points are respondents and the arrows represent the features. All the arrows point to the right indicating that the first principal component reflects usage intensity with heavier usage toward the right in the direction that all the arrows point. As you know, the angles between arrows reflect their correlations, so that the two bundles of arrows suggest a two-factor solution. We can call such a pattern a bifactor solution: a general factor separating light and heavy users and two specific factors distinguishing between those more involved with each of the two bundles of feature sets. It is worth your time to become familiar with this factor structure because it reappears frequently with usage data, as well as preference and satisfaction.

Do you see clusters of data points in the above scatterplot? The three centroids from a K-means clustering follows the path of the first principal component with a low usage (K2), a medium usage (K1) and a high usage (K3) segment. Personally, I find it difficult to separate out clusters in this data cloud. I see a fan-spread distribution with the amount of variation on the second dimension dependent on the value of the first dimension, that is, little or no feature usage among light users and increasing separation of the two feature bundles for heavier users. The archetypes reveal this pattern by forming a triangle with vertices at no usage (A3), bundle A1 usage and bundle A2 usage. K-means yields a restatement of usage intensity along the first dimension, while archetypal analysis summarizes the data as contained with the triangle formed by three usage types.

Friday, November 14, 2014

In Praise of Substantive Expertise in Data Science

Substantive expertise makes it into the Data Science Venn Diagram from DataCamp's infographic on how to become a data scientist. It's one of the three circles of equal size along with programming and statistics. Regrettably, substantive expertise is never mentioned in the definition of a data scientist as "someone who is better at statistics than any software engineer and better at software engineering than any statistician." And it gets no step. Statistics is the first step, and the remaining steps cover programming in all its varying forms. "Alas, poor Substance! I knew him, DataCamp."

All of this, of course, is to be taken playfully. I have no quarrel with any of DataCamp's 8-step program. I only ask that we recognize that there are three circles of equal value. Some of us come to data science with substantive expertise and seeking new models for old problems. Some even contribute libraries applying those models in their particular areas of substantive expertise. R provides a common language through which we can visit foreign disciplines and see the same statistical models from a different perspective.

John Chambers reminds us in his UseR! 2014 keynote address that R began as a "user-centric scientific software tool" providing "an interface to the very best numerical algorithms." Adding an open platform for user-submitted packages, R also becomes the interface to a diverse range of applications. This is R's unique selling proposition. It is where one goes for new ways of seeing.

Wednesday, November 12, 2014

Building Blocks: A Compelling Image for Clustering with Nonnegative Matrix Factorization (NMF)

Would hierarchical clustering be as popular without the dendrogram? Cannot the same be said of finite mixture modeling with its multidimensional spaces populated by normal distributions? I invite you to move your mouse over the figure on the introductory page of the website for the R package mclust and click through all the graphics that bring mixture modeling to life. So what is the compelling image for nonnegative matrix factorization (NMF)?


Dendrograms are constructed from distance matrices. We have some choice in the distance or divergence metric, but once a variable has been selected, it is included in all the distance calculations. Finite mixtures and k-means avoid such matrices, but still define fit as some variation on the ratio of between-cluster and within-cluster distances, once again computed from all the variables selected for the analysis. These are the clusters pictured in the above links to dendrograms and isodensity curves in low-dimensional spaces derived from the entire set of included variables. 

However, such representations do not exhaust common ways of talking and thinking about similarity. For example, object substitution in a task or an activity is based on a more limited definition of shared functionality. These are goal-derived categories that I discussed at the end of my post showing how NMF can use top-contender rankings to reveal preference patterns for breakfast foods. Will a Danish pastry be good enough when all the donuts have been eaten? The thought of eating the donut evokes the criteria upon which substitutability and hence similarity will be judged (see Norm Theory: Comparing Reality to Its Alternatives). In the context of toast and other options for breakfast, the donut and the Danish may appear more similar in contrast, yet that is not what comes to mind when hungry for donuts. Similarity can only be defined within a context, as noted by Nelson Goodman

Similarity Derived from Building Blocks in Localized Additive Representations

What did you do today? I could give you a list of activities and ask you to indicate how frequently you engaged in each activity. Who else is like you? Should we demand complete agreement across all the activities, or is it sufficient that you share some common task? If my list is a complete inventory, it will include many relatively infrequent activities performed only by specific subgroups. For example, caregivers for very young children come in all ages and gender from diverse backgrounds and with other responsibilities, yet they form a product category with an entire aisle of the supermarket dedicated to their shared needs. Situational demands pull together individuals in rows and activities in columns to mold a building block of relational data.

To be clear, a hierarchical clustering of respondents or the rows of our data matrix averages over all the columns. We start with an nxp data matrix, but perform the analysis with the nxn dissimilarity or distance matrix. Still, the data matrix contains relational data. Individuals are associated with the activities they perform. Instead of ignoring these relationships between the rows and the columns, we could seek simultaneous clustering of individuals and activities to form blocks running along the diagonal of the data matrix (a block diagonal matrix). Consequently, we may wish to alter our initial figure at the beginning of this post to be more precise and push the colored blocks out of a straight line to form a diagonal with each block demarcated by the intersection of individuals and their frequent activities.

A Toy Example

We can see how it is all accomplished in the following NMF analysis. We will begin with a blank data matrix and combine two blocks to form the final data matrix in the lower right of the figure below.


The final data matrix represents 6 respondents in the rows and 4 activities in the columns. The cells indicate the frequency of engaging in the activity ranging from 0=never to 6=daily. Since the rows and columns have been sorted into two 3x2 blocks along the diagonal, we have no problem directly interpreting this small final data matrix. The frequency of the 4 activities is greatest in the first column and least for third column. The first 3 and last 3 respondents are separated by the first 2 and last 2 activities. It appears that the 6x4 data matrix might be produced by only two latent features.

The following R code creates the final data matrix as the matrix product of respondent mixing weights and activity latent feature coefficients. That is, activities get organized into packets of stuff done by the same respondents, and respondents get clustered based on their activities. If you are familiar with the co-evolution of music genre and listening communities, you will not be surprised by the co- or bi-clustering of rows and columns into these diagonal building blocks. In a larger data matrix, however, we would expect to see both purist with nonzero mixing weights for only one latent feature and hybrids that spread their weights across several latent features. As noted in earlier posts, NMF thrives on sparsity in the data matrix especially when there is clear separation into non-overlapping blocks of rows and columns (e.g., violent action films and romantic comedies appealing to different audiences or luxury stores and discount outlets with customers tending to shop at one or the other).

# enter the data for the respondent mixing weights
MX<-matrix(c(3,2,1,0,0,0,0,0,0,1,1,2), ncol=2)
MX
 
# enter the data for the latent features
LP<-matrix(c(2,1,0,0,0,0,1,2), ncol=4, byrow=TRUE)
LP
 
# observed data is the matrix product
DATA<-MX%*%LP
DATA
 
# load the NMF library
library(NMF)
 
# run with rank=2
fit<-nmf(DATA, 2, "lee", nrun=20)
 
# output the latent feature coefficients
lp<-coef(fit)
round(lp,3)
 
#output the respondent mixing weights
mx<-basis(fit)
round(mx,3)
 
# reproduce the data matrix using NMF results
data<-mx%*%lp
round(data)
# output residuals
round(DATA-data,3)
 
# same but only for 1st latent feature
rank1<-as.matrix(mx[,1])%*%lp[1,]
round(DATA-rank1,3)
 
# same but only for 2nd latent feature
rank2<-as.matrix(mx[,2])%*%lp[2,]
round(DATA-rank2,3)
 
# additive representation 
round(rank1+rank2,3)
round(DATA-rank1-rank2,3)

Created by Pretty R at inside-R.org

The extensive comments in this R code reduce the need for additional explanation except to emphasize that you should copy and run the code in R (after installing NMF). I did not set a seed so that the order of the two parts may be switched. The exercise is intended to imprint the building block imagery. In addition, you might wish to think about how NMF deals with differences in respondent and activity intensity. For example, the first three respondents all engage in the first two activities but with decreasing frequency. Moreover, the same latent feature is responsible for the first two activities, yet the first activity is more frequent than the second. 

I would suggest that the answer can be found in the following section of output from the above code. You must, of course, remember your matrix multiplication. The first cell in our data matrix contains a "6" formed by multiplying the first row of mx by the first column of lp or 0.5 x 12 + 0.0 x 0 = 6. Now, it is easy to see that the 0.500, 0.333 and 0.167 in mx reveal the decreasing intensity of the first latent feature. Examining the rest of mx suggest that the last respondent should have higher scores than the previous two and that is what we discover.

> round(lp,3)
      [,1] [,2] [,3] [,4]
[1,]   12    6    0    0
[2,]    0    0    4    8

> round(mx,3)
          [,1] [,2]
[1,] 0.500 0.00
[2,] 0.333 0.00
[3,] 0.167 0.00
[4,] 0.000 0.25
[5,] 0.000 0.25
[6,] 0.000 0.50

Parting Comments

When you see diagrams, such as the following from Wikipedia, you should take them literally. 


The data matrix V is reproduced approximately by a reduced rank matrix of mixing weights W multiplied by a reduced rank matrix of latent features H. These interpretations of W and H depend on V being a respondents-by-variables data matrix. One needs to be careful because many applications of NMF reverse the rows and columns changing the meaning of W and H. 

The number of columns in W and the number of rows in H can be much smaller than the number of observed variables, which is what is meant by data reduction. The same latent features are responsible for the clustering of respondents and variables. This process of co- or bi-clustering has redefined similarity by computing distances within the building blocks instead of across all the rows and columns. Something had to be done if we wish to include a complete inventory of activities. As the number of activities increase, the data become increasingly sparse and distances become more uniform (see Section 3 The Curse of Dimensionality).

The building block imagery seems to work in this example because different people engage in different activities. The data matrix is sparse due to such joint separation of row and columns. Those building blocks, the latent features, provide a localized additive representation from which we can reproduce the data matrix by stacking the blocks, or stated more accurately, by a convex combination of the latent features.