Sunday, April 3, 2016

When Choice Modeling Paradigms Collide: Features Presented versus Features Perceived

What is the value of a product feature? Within a market-based paradigm, the answer is the difference between revenues with and without the feature. A product can be decomposed into its features, each feature can be assigned a monetary value by including price in the feature list, and the final worth of the product is a function of its feature bundle. The entire procedure is illustrated in an article using the function rhierMnlMixture from the R package bayesm (Economic Valuation of Product Features). Although much of the discussion concentrates on a somewhat technical distinction between willingness-to-pay (WTP) and willingness-to-buy (WTB), I wish to focus instead on the digital camera case study in Section 6 beginning on page 30. If you have question concerning how you might run such an analysis in R, I have two posts that might help: Let's Do Some Hierarchical Bayes Choice Modeling in R and Let's Do Some More Hierarchical Bayes Choice Modeling in R.

As you can see, the study varies seven factors, including price, but the goal is to estimate the economic return from including a swivel screen on the back of the digital camera. Following much the same procedure as that outlined in those two choice modeling posts mentioned in the last paragraph, each respondent saw 16 hypothetical choice sets created using a fractional factorial experimental design. There was a profile associated with each of the four brands, and respondents were asked to first select the one they most preferred and then if they would buy their most preferred brand at a given price.

The term "dual response" has become associated with this approach, and several choice modelers have adopted the technique. If the value of the swivel screen is well-defined, it ought not matter how you ask these questions, and that seems to be confirmed by some in the choice modeling community. However, outside the laboratory and in the field, commitment or stated intention is the first step toward behavior change. Furthermore, the mere-measurement effect in survey research demonstrates that questioning by itself can alter preferences. Within the purchase context, consumers do not waste effort deciding which of the rejected alternatives is the least objectionable by attending to secondary features after failing to achieve consideration on one or more deal breakers (i.e., the best product they would not buy). Actually, dual response originates as a sales technique because encouraging commitment to one of the offerings increases the ultimate purchase likelihood.

We have our first collision. Order effects are everywhere. It is one of the most robust findings in measurement. The political pollster wants to know how big a sales tax increase could be passed in the next election. You get a different answer when you ask about a one-quarter percent increase followed by one-half percent than when you reverse the order. Perceptual contrast is unavoidable so that one-half seems bigger after the one-quarter probe. I do not need to provide a reference because everyone is aware of order as one of the many context effects. The feature presented is not the feature perceived.

Our second collision occurs from the introduction of price as just another feature, as if in the marketplace no one ever asks why one brand is more expensive than another. We ask because price is both a sacrifice with a negative impact and a signal of quality with a positive weight. In fact, as one can see from the pricing literature, there is nothing simple or direct about price perception. Careful framing may be needed (e.g., maintaining package size but reducing the amount without changing price). Otherwise, the reactions can be quite dramatic for price increases can trigger attributions concerning the underlying motivation and can generate a strong emotional response (e.g., price fairness).

At times, the relationship between the feature presented and the feature perceived can be more nuanced. It would be reasonable to vary gasoline prices in terms of cost per unit of measurement (e.g., dollars per gallon or euros per liter). Yet, the SUV driver seems to react in an all-or-none fashion only when some threshold on the cost to fill up their tank has been exceeded. What is determinant is not the posted price but the total cost of the transaction. Thus, price sensitivity is a complex nonlinear function of cost per unit depending on how often one fills up with gasoline and the size of that tank. In addition, the pain at the pump depends on other factors that fail to make it into a choice set. How long will the increases last? Are higher prices seen as fair? What other alternatives are available? Sometimes we have no option but to live with added costs, reducing our dissonance by altering our preferences.

We see none of this reasoning in choice modeling where the alternatives are described as feature bundles outside of any real context. The consumer "plays" the game as presented by the modeler. Repeating the choice exercise with multiple choice sets only serves to induce a "feature-as-presented" bias. Of course, there are occasions when actual purchases look like choice models. We can mimic repetitive purchases from the retail shelf with a choice exercise, and the same applies to online comparison shopping among alternatives described by short feature lists as long as we are careful about specifying the occasion and buyers do not search for user comments.

User comments bring us back to the usage occasion, which tends to be ignored in choice modeling. Reading the comments, we note that one customer reports the breakage of the hinge on the swivel screen after only a few months. Is the swivel screen still an advantage or a potential problem waiting to occur? We are not buying the feature, but the benefit that the feature promises. This is the scene of another paradigm collision. The choice modeler assumes that features have value that can be elicited by merely naming the feature. They simplify the purchase task by stripping out all contextual information. Consequently, the resulting estimates work within the confines of their preference elicitation procedures, but do not generalize to the marketplace.

We have other options in R, as I have suggested in my last two posts. Although the independent variables in a choice model are set by the researcher, we are free to transform them, for instance, compute price as a logarithm or fit low-order polynomials of the original features. We are free to go farther. Perceived features can be much more complex and constructed as nonlinear latent variables from the original data. For example, neural networks enable us to handle a feature-rich description of the alternatives and fit adaptive basis functions with hidden layers.

On the other hand, I have had some success exploiting the natural variation within product categories with many offerings (e.g., recommender systems for music, movies, and online shopping like Amazon). By embedding measurement within the actual purchase occasion, we can learn the when, why, what, how and where of consumption. We might discover the limits of a swivel screen in bright sunlight or when only one hand is free. The feature that appeared so valuable when introduced in the choice model may become a liability after reading users' comments.

Features described in choice sets are not the same features that consumers consider when purchasing and imagining future usage. This more realistic product representation requires that we move from those R packages that restrict the input space (choice modeling) to those R packages that enable the analysis of high-dimensional sparse matrices with adaptive basis functions (neural networks and matrix factorization).

Bottom Line: The data collection process employed to construct and display options when repeated choice sets are presented one after another tends to simplify the purchase task and induce a decision strategy consistent with regression models we find in several R packages (e.g., bayesm, mlogit, and RChoice). However, when the purchase process involves extensive search over many offerings (e.g., music, movies, wines, cheeses, vacations, restaurants, cosmetics, and many more) or multiple usage occasions (e.g., work, home, daily, special events, by oneself, with others, involving children, time of day, and other contextual factors), we need to look elsewhere within R for statistical models that allow for the construction of complex and nonlinear latent variables or hidden layers that serve as the derived input for decision making (e.g., R packages for deep learning or matrix factorization).

Friday, March 25, 2016

Choice Modeling with Features Defined by Consumers and Not Researchers

Choice modeling begins with a researcher "deciding on what attributes or levels fully describe the good or service." This is consistent with the early neural networks in which features were precoded outside of the learning model. That is, choice modeling can be seen as learning the feature weights that recognize whether the input was of type "buy" or not.

As I have argued in the previous post, the last step in the purchase task may involve attribute tradeoffs among a few differentiating features for the remaining options in the consideration set. The aging shopper removes two boxes of cereal from the well-stocked supermarket shelves and decides whether low-sodium beats low-fat. The choice modeler is satisfied, but the package designer wants to know how these two boxes got noticed and selected for comparison. More importantly for the marketer, how is the purchase being framed by the consumer? Is it advertising that focused attention on nutrition? Was it health claims by other cereal boxes nearby on the same shelf?

With caveats concerning the need to avoid caricature, one can describe this conflict between the choice modeler and the marketer in terms of shallow versus deep learning (see slide #2 from Yann LeCun's 2013 tutorial with video here). From this perspective, choice modeling is a form of  more shallow information integration where the features are structured (varied according to some experimental design) and presented in a simplified format (the R package support.CEs aids in this process and you can find R code for hierarchical Bayes using bayesm in this link).

Choice modeling or information integration is illustrated on the upper left of the above diagram. The capital S's are the attribute inputs that are translated into utilities so that they can be evaluated on a common value scale. Those utilities are combined or integrated and yield a summary measure that determines the response. For example, if low-fat were worth two units and low-sodium worth only one unit, you would buy the low-fat cereal. The modeling does not scale well, so we need to limit the number of feature levels. Moreover, in order to obtain individual estimates, we require repeated measures from different choice sets. The repetitive task encourages us to streamline the choice sets so that feature tradeoffs are easier to see and make. The constraints of an experimental design force us toward an idealized presentation so that respondents have little choice but information integration.

Deep learning, on the other hand, has multiple hidden layers that model feature extraction by the consumer. The goal is to eat a healthy cereal that is filling and tastes good. Which packaging works for you? Does it matter if the word "fiber" is included? We could assess the impact of the fiber labeling by turning it on and off in an experimental design. But that only draws attention to the features that are varied and limits any hope of generalizing our findings beyond the laboratory. Of course, it depends on whether you are buying for an adult or a child, and whether the cereal is for breakfast or a snack. Contextual effects force us to turn to statistical models that can handle the complexities of real world purchase processes.

R does offer an interface to deep learning algorithms. However, you can accomplish something similar with nonnegative matrix factorization (NMF). The key is not to force a basis onto the statistical analysis. Specifically, choice modeling relies on a regression analysis with the features as the independent variables. We can expand this basis by adding transformations of the original features (e.g., the log of price or inserting polynomial expansions of variables already in the model). However, the regression equation will reveal little if the consumer infers some hidden or latent features from a particular pattern of feature combinations (e.g., a fragment of the picture plus captions along with the package design triggers childhood memories or activates aspirational drives).

Deep learning excels with the complexities of language and vision. NMF seems to work well in the more straightforward world of product preference. As an example, Amazon displays several thousand cereals that span much of what is available in the marketplace. We can limit ourselves to a subset of the 100 or more most popular cereals and ask respondents to indicate their interest in each cereal. We would expect a sparse data matrix with blocks of joint groupings of both respondents with similar tastes and cereals with similar features (e.g., variation on flakes, crunch or hot cereals). The joint blocks define the hidden layers simultaneously clustering respondents and typing products.

Matrix factorization or decomposition seeks to reconstruct the data in a matrix from a smaller number of latent features. I have discussed its relationship to deep learning in a post on product category representation. It ends with a listing of examples that include the code needed to run NMF in R. You can think of NMF as a dual factor analysis with a common set of factors for both rows (consumers) and columns (cereals in this case). Unlike principal component or factor analysis, there are no negative factor loadings, which is why NMF is nonnegative. The result is a data matrix reconstructed from parts that are not imposed by the statistician but revealed in the attempt to reproduce the consumer data.

We might expect to find something similar to what Jonathan Gutman reported from a qualitative study using a means-end analysis. I have copied his Figure 3 showing what consumers said when asked about crunchy cereals. Of course, all we obtain from our NMF are weights that look like factor loadings for respondents and cereals. If there is a crunch factor, you will see all the cereals with crunch loading on that hidden feature with all the respondents wanting crunch with higher weights on the same hidden feature. Obviously, in order to know which respondents wanted something crunchy in their cereal, you would need to ask a separate question. Similarly, you might inquire about cereal perceptions or have experts rate the cereals to know which cereals produce the biggest crunch. Alternatively, one could cluster the respondents and cereals and profile those clusters.

Monday, March 21, 2016

Understanding Statistical Models Through the Datasets They Seek to Explain: Choice Modeling vs. Neural Networks

R may be the lingua franca, yet many of the packages within the R library seem to be written in different languages. We can follow the R code because we know how to program but still feel that we have missed something in the translation.

R provides an open environment for code from different communities, each with their own set of exemplars, where the term "exemplar" has been borrowed from Thomas Kuhn's work on normal science. You need only to examine the datasets that each R package includes to illustrate its capabilities in order to understand the diversity of paradigms spanned. As an example, the datasets from the Clustering and Finite Mixture Task View demonstrate the dependence of the statistical models on the data to be analyzed. Those seeking to identifying communities in social networks might be using similar terms as those trying to recognize objects in visual images, yet the different referents (exemplars) change the meanings of those terms.

Thinking in Terms of Causes and Effects

Of course, there are exceptions, for instance, regression models can be easily understood across applications as the "pulling of levers" especially for those of us seeking to intervene and change behavior (e.g., marketing research). Increased spending on advertising yields greater awareness and generates more sales, that is, pulling the ad spending lever raises revenue (see the R package CausalImpact). The same reasoning underlies choice modeling with features as levers and purchase as the effect (see the R package bayesm).

The above picture captures this mechanistic "pulling the lever" that dominates much of our thinking about the marketing mix. The exemplar "explains" through analogy. You might prefer "adjusting the dials" as an updated version, but the paradigm remains cause-and-effect with each cause separable and under the control of the marketer. Is this not what we mean by the relative contribution of predictors? Each independent variable in a regression equation has its own unique effect on the outcome. We pull each lever a distance of one standard deviation (the beta weight), sum the changes on the outcome (sometimes theses betas are squared before adding), and then divide by the total.

The Challenge from Neural Networks

So, how do we make sense of neural networks and deep learning? Is the R package neuralnet simply another method for curve fitting or estimating the impact of features? Geoffrey Hinton might think differently. The Intro Video for Coursera's Neural Networks for Machine Learning offers a different exemplar - handwritten digit recognition. If he is curve fitting, the features are not given but extracted so that learning is possible (i.e., the features are not obvious but constructed from the input to solve the task at hand). The first chapter of Michael Nielsen's online book, Using Neural Nets to Recognize Handwritten Digits, provides the details. Isabelle Guyon's pattern recognition course adds an animated gif displaying visual perception as an active process.

On the other hand, a choice model begins with the researcher deciding what features should be varied. The product space is partitioned and presented as structured feature lists. What alternative does the consumer have, except to respond to variations in the feature levels? I attend to price because you keep changing the price. Wider ranges and greater variation only focus my attention. However, in real setting the shelves and the computer screens are filled with competing products waiting for consumers to define their own differentiating features. Smart Watches from Google Shopping provides a clear illustration of the divergence of purchase processes in the real world and in the laboratory.

To be clear, when the choice model and the neural network speak of input, they are referring to two very different things. The exemplars from choice modeling are deciding how best to commute and comparing a few offers for same product or service. This works when you are choosing between two cans of chicken soup by reading the ingredients on their labels. It does not describe how one selects a cheese from the huge assortment found in many stores.

Neural networks take a different view of the task. In less than five minutes Hinton's video provides the exemplar for representation learning. Input enters as it does in real settings. Features that successfully differentiate among the digits are learned over time. We see that learning in the video when the neural net generates its own handwritten digits for the numbers 2 and 8. It is not uncommon to write down a number that later we or others have difficulty reading. Legibility is valued so that we can say that an easier to read "2" is preferred over a "2" that is harder to identify. But what makes one "2" a better two than another "2" takes some training, as machine learning teaches us.

We are all accomplished at number recognition and forget how much time and effort it took to reach this level of understanding (unless we know young children in the middle of the learning process). What year is MCMXCIX? The letters are important, but so are their relative positions (e.g. X=10 and IX=9 in the year 1999). We are not pulling levers any more, at least not until the features have been constructed. What are those features in typical choice situations? What you want to eat for breakfast, lunch or dinner (unless you snack instead) often depends on your location, available time and money, future dining plans, motivation for eating, and who else is present (context-aware recommender systems).

Adopting a different perspective, our choice modeler sees the world as well-defined and decomposable into separate factors that can be varied systematically according to some experimental design. Under such constraints the consumer behaves as the model predicts (a self-fulling prophecy?). Meanwhile, in the real world, consumers struggle to learn a product representation that makes choice possible.

Thinking Outside the Choice Modeling Box

The features we learn may be relative to the competitive set, which is why adding a more expensive alternative makes what is now the mid-priced option appear less expensive. Situation plays an important role for the movie I view when alone is not the movie I watch with my kids. Framing has an impact, which is why advertising tries to convince you that an expensive purchase is a gift that you give to yourself. Moreover, we cannot forget intended usage for that Smartphone is a camera, a GPS, and I believe you get the point. We may have many more potential features than included in our choice design.

It may be the case that the final step before purchase can be described as a tradeoff among a small set of features varying over only a few alternatives in our consideration set. If we can mimic that terminal stage with a choice model, we might have a good chance to learn something about the marketplace. How did the consumer get to that last choice point? Why these features and those alternative products or services? In order to answer such questions, we will need to look outside the choice modeling box.

Friday, January 8, 2016

A Data Science Solution to the Question "What is Data Science?"

As this flowchart from Wikipedia illustrates, data science is about collecting, cleaning, analyzing and reporting data. But is it data science or just or a "sexed up term" for Statistics (see embedded quote by Nate Silver)? It's difficult to separate the two at this level of generality, so perhaps we need to define our terms.

We begin by making a list of all the stuff that a data scientist might do or know. We are playing a game where the answer is "data scientist" and the questions are "Do they do this?" and "Do they know that?". However, the "this" and the "that" are very specific. For example, "Data is Processed" can range from simple downloading to the complex representation of visual or speech input. What precisely does a data scientist do when they process data that a programmer or a statistician does not do?

To be clear, I am constructing a very long questionnaire that I intend to distribute to individuals calling themselves data scientists along with everyone else claiming that they too do data science, although by another name. A checklist will work in our game of Twenty Questions as long as the list is detailed and exhaustive. You are welcome to add suggestions as comments to this post, but we can start by expanding on each of the boxes in the above data science flowchart.

Since I am a marketing researcher, I am inclined to analyze the resulting data matrix as if it were a shopping cart filled with items purchased from a grocery store or an inventory of downloads from a video or music provider. The rows are respondents, and the columns are all the questions that might be asked to distinguish among all the various players. Let's not include sexy as a column.

You may have guessed that I am headed toward some type of matrix factorization. Can we recognize patterns in the columns that reflect different configurations of study and behavior? Are there communities composed of rows clustered together with similar practices and experiences? R provides most of us who have some experience running factor and cluster analyses with a "doable" introduction to non-negative matrix factorization (NMF). You can think of it as simultaneous clustering of the rows and columns in a data matrix. My blog is filled with examples, none of which are easy, but none of which are incomprehensible or beyond your ability to adapt to your own datasets.

What are we likely to find? Will we discover something like anchor words from topic modeling? For instance, it is necessary to work with multiple datasets from different disciplines to be a data scientist? Would I stop calling myself a marketing scientist if I started working with political polling data? Some argue that one becomes a statistician when they begin consulting with others from divergent fields of study.

What about teaching to students with varied backgrounds in universities or industry? Do we call it data science if one writes and distributes software that others can apply with data across diverse domains? Does proving theorems make one a statistician? How many languages must one know before they are a programmer? What role does computation play when making such discriminations?

What will we learn from dissecting the "corpus" (the detailed body of what we do and know summarized by the boxes in the above data science process)? Extending this analogy, I am recommending that the "physician, heal thyself" by applying data science methodology to provide a response to the "What is Data Science?" question. 

Hopefully, we can avoid the hype and the caricature from the popular press (sexiest job of 21st century). Moreover, I suggest that we resist the tendency to think metaphorically in terms of contrasting ideals. The simple act of comparing statisticians and data scientists shapes our perceptions and leads us to see the two as more dissimilar than suggested by their training and behavior. The distinction may be more nuance than substance, reflecting what excites and motivates rather than what is known or done. The basis for separation may reside in how much personal satisfaction is derived from the subject matter or the programming rather than the computational algorithm or the generative model.

Wednesday, December 23, 2015

Modeling How Consumers Simplify the Purchase Process by Copying Others

A Flower That Fits the Bill
Marketing borrows the biological notion of coevolution to explain the progressive "fit" between products and consumers. While evolutionary time may seem a bit slow for product innovation and adoption, the same metaphor can be found in models of assimilation and accommodation from cultural and cognitive psychology.

The digital camera was introduced as an alternative to film, but soon redefined how pictures are taken, stored and shared. The selfie stick is but the latest step in this process by which product usage and product features coevolve over time with previous cycles enabling the next in the chain. Is it the smartphone or the lack of fun that's killing the camera?

The diffusion of innovation unfolds in the marketplace as a social movement with the behavior of early adopters copied by the more cautious. For example, "cutting the cord" can be a lifestyle change involving both social isolation from conversations among those watching live sporting events and a commitment to learning how to retrieve television-like content from the Internet. The Diary of a Cord-Cutter in 2015 offers a funny and informative qualitative account. Still, one needs the timestamp because cord-cutting is an evolving product category. The market will become larger and more diverse with more heterogeneous customers (assimilation) and greater differentiation of product offerings (accommodation).

So, we should be able to agree that product markets are the outcome of dynamic processes involving both producers and customers (see Sociocognitive Dynamics in a Product Market for a comprehensive overview). User-centered product design takes an additional step and creates fictional customers or personas in order to find the perfect match. Shoppers do something similar when they anticipate how they will use the product they are considering. User types can be real (an actual person) or imagined (a persona). If this analysis is correct, then both customers and producers should be looking at the same data: the cable TV customer to decide if they should become cord-cutters and the cable TV provider to identify potential defectors.

Identifying the Likely Cord-Cutter

We can ask about your subscriptions: cable TV, internet connection, Netflix, Hulu, Amazon Prime, Sling, and so on). It is a long list, and we might get some frequency of usage data at the same time. This may be all that we need, especially if we probe for the details (e.g., cable TV usage would include live sports, on-demand movies, kid's shows, HBO or other channel subscriptions, and continue until just before respondents become likely to terminate on-line surveys). Concurrently, it might be helpful to know something about your hardware, such as TVs, DVDs, DVRs, media streamers and other stuff.

A form of reverse engineering guides our data collection. Qualitative research and personal experience gives us some idea of the usage types likely to populate our customer base. Cable TV offers a menu of bundled and ala carte hardware and channels. Only some of the alternatives are mutually exclusive; otherwise, you are free to create your own assortment. Internet availability only increases the number of options, which you can watch on a television, a computer, a tablet or a phone. Plus, there are always free broadcast TV captured with an antenna and DVDs that you rent or buy. We ought not to forget DVRs and media streamers (e.g., Roku, Apple TV, Chromecast, and Amazon Fire Stick). Obviously, there is no reason to stop with usage so why not extend the scale to include awareness and familiarity? You might not be a cord-cutter, though you may be on your way if you know all about Sling TV.

Traditional segmentation will not be able to represent this degree of complexity.

Each consumer defines their own personal choices by arranging options in a continually changing pattern that does not depend on existing bundles offered by providers. Consequently, whatever statistical model is chosen must be open to the possibility that every non-contradictory arrangement is possible. Yet, every combination will not survive for some will be dominated by others and never achieve a sustainable audience.

We could display this attraction between consumers and offerings as a bipartite graph (Figure 2.9 from Barabasi's Network Science).

Consumers are listed in U, and a line is drawn to the offerings in V that they might wish to purchase (shown in the center panel). It is this linkage between U and V that produces the consumer and product networks in the two side panels. The A-B and B-C-D cliques of offerings in Projection V would be disjoint without customer U_5. Moreover, the 1-2-3 and 4-5-6-7 consumer clusters are connected by the presence of offering B in V. Removing B or #5 cuts the graph into independent parts.

Actual markets contain many more consumers in U, and the number of choices in V can be extensive. Consumer heterogeneity creates complexities for the marketer trying to discover structure in Projection U. Besides, the task is not any easier for an individual consumer who must select the best from a seemingly overwhelming number of alternatives in Projection V. Luckily, one trick frees the consumer from having to learn all the options that are available and being forced to make all the difficult tradeoffs - simply do as others do (as in observational learning). The other can be someone you know or read about as in the above Diary of a Cord-Cutter in 2015. There is no need for a taxonomy of offerings or a complete classification of user types.

In fact, it has become popular to believe that social diffusion or contagion models describe the actual adoption process (e.g., The Tipping Point). Regardless, over time, the U's and V's in the bipartite interactions of customers and offerings come to organize each other through mutual influence. Specifically, potential customers learn about the cord-cutting persona through the social and professional media and at the same time come to group together those offerings that the cord-cutter might purchase. Offerings are not alphabetized or catalogued as an academic exercise. There is money to be saved and entertainment to be discovered. Sorting needs to be goal-directed and efficient. I am ready to binge-watch, and I am looking for a recommendation.

"I'll Have What She's Having"

It has taken some time to outline how consumers are able to simplify complex purchase process by modeling the behavior of others. It is such a common experience, although rational decision theory continues to control our statistical modeling of choice. As you are escorted to your restaurant table, you cannot help but notice a delicious meal being served next to where you are seated. You refuse a menu and simply ask for the same dish. "I'll Have What She's Having" works as a decision strategy only when I can identify the "she" and the "what" simultaneously.

If we intend to analyze that data we have just talked about collecting, we will need a statistical model. Happily, the R Project for Statistical Computing implements at least two approaches for such joint identification: a latent clustering of a bipartite network in the latentnet package and a nonnegative matrix factorization in the NMF package. The Davis data from the latentnet R package will serve as our illustration. The R code for all the analyses that will be reported can be found at the end of this post.

Stephen Borgatti is a good place to begin with his two-mode social network analysis of the Davis data. The rows are 18 women, the columns are 14 events, and the cells are zero or one depending on whether or not each woman attended each event. The nature of the events has not been specified, but since I am in marketing, I prefer to think of the events as if they were movies seen or concerts attended (i.e., events requiring the purchase of tickets). You will find a latentnet tutorial covering the analysis of this same data as a bipartite network (section 6.3). Finally, a paper by Michael Brusco called "Analysis of two-mode network data using nonnegative matrix factorization" provides a detailed treatment of the NMF approach.

We will start with the plot from the latentnet R package. The names are the women in the rows and the numbered E's are the events in the columns. The events appear to be separated into two groups of E1 to E6 toward the top and E9 to E14 toward the bottom. E7 and E8 seem to occupy a middle position. The names are also divided into an upper and lower grouping with Ruth and Pearl falling between the two clusters. Does this plot not look similar to the earlier bipartite graph from Barabasi? That is, the linkages between the women and the events organize both into two corresponding clusters tied together by at least two women and two events.

The heatmaps from the NMF reveal the same pattern for the events and the women. You should recall that NMF seeks a lower dimensional representation that will reproduce the original data table with 0s and 1s. In this case, two basis components were extracted. The mixture coefficients for the events vary from 0 to 1 with a darker red indicating a higher contribution for that basis component. The first six events (E1-E6) form the first basis component with the second basis component containing the last six events (E9-E14). As before, E7 and E8 share a more even mixture of the two basis components. Again, the most of the women load on one basis component or the other with Ruth and Pearl traveling freely between both components. As you can easily verify, the names form the same clusters in both plots.

It would help to know something about the events and the women. If E1 through E6 were all of a certain type (e.g., symphony concerts), then we could easily name the first component. Similarly, if all of the women in red at bottom of our basis heatmap played the piano, our results would have at least face validity. A more detailed description of this naming process can be found in a previous example called "What Can We Learn from the Apps on Your Smartphone?". Those wishing to learn more might want to review the link listed at the end of that post in a note.

Which events should a newcomer attend? If Helen, Nora, Sylvia and Katherine are her friends, the answer is the second cluster of E9-E14. The collaborative filtering of recommender systems enables a novice to decide quickly and easily without a rational appraisal of the feature tradeoffs. Of course, a tradeoff analysis will work as well for we have a joint scaling of products and users. If the event is a concert with a performer you love, then base your decision on a dominating feature. When in tradeoff doubt, go along with your friends.

Finally, brand management can profit from this perspective. Personas work as a design strategy when user types are differentiated by their preference structures and a single individual can represent each group. Although user-centered designers reject segmentations that are based on demographics, attitudes, or benefit statements, a NMF can get very specific and include as many columns as needed (e.g., thousands of movie and even more music recordings). Furthermore, sparsity is not a problem and most of the rows can be empty.

There is no reason why each of the basis components in the above heatmaps could not be summarized by one person and/or one event. However, NMF forms building blocks by jointly clustering many rows and columns. Every potential customer and every possible product configuration are additive compositions built from these blocks. Would not design thinking be better served with several exemplars of each user type rather than trying to generalize from a single individual? Plus, we have the linked columns telling us what attracts each user type in the desired detail provided by the data we collected.

R Code to Produce Plots
fit<-nmf(data_matrix, 2, "lee", nrun=20)
par(mfrow = c(1, 2))
Created by Pretty R at

Friday, December 18, 2015

BayesiaLab-Like Network Graphs for Free with R

My screen has been filled with ads from BayesiaLab since I downloaded their free book. Just as I began to have regrets, I received an email invitation to try out their demo datasets. I was especially interested in their perfume ratings data. In this monadic product test, each of 1,321 French women was presented with only one of 11 perfumes and asked to evaluate on a 10-point scale a series of fragrance-related adjectives along with a few user-imagery descriptors. I have added the 6-point purchase intent item to the analysis in order to assess its position in this network.

Can we start by looking at the partial correlation network? I will refer you to my post on Driver Analysis vs. Partial Correlation Analysis and will not repeat that more detailed overview.

Each of the nodes is a variable (e.g., purchase intent is located on the far right). An edge drawn between any two nodes shows the partial correlation between those two nodes after controlling for all the other variables in the network. The color indicates the sign of the partial correlation with green for positive and red for negative. The size of the partial correlation is indicated by the thickest of the edge.

Simply scanning the map reveals the underlying structure of global connections among even more strongly joined regions:

  • Northwest - In Love / Romantic / Passionate / Radiant,
  • Southwest - Bold / Active / Character / Fulfilled / Trust / Free, 
  • Mid-South - Classical / Tenacious / Quality / Timeless / High End, 
  • Mid-North - Wooded / Spiced, 
  • Center - Chic / Elegant / Rich / Modern, 
  • Northeast - Sweet / Fruity / Flowery / Fresh, and
  • Southeast - Easy to Wear / Please Others / Pleasure. 

Unlike the Probabilistic Structural Equation Model (PSEM) in Chapter 8 of BayesiaLab's book, my network is undirected because I can find no justification for assigning causality. Yet, the structure appears to be much the same for the two analyses, for example, compare this partial correlation network with BayesiaLab's Figure 8.2.3.

All this looks very familiar to those of us who have analyzed consumer rating scales. First, we expect negative skew and high collinearity because consumers tend to give ratings in the upper end of the scale and their responses often are highly intercorrelated. In fact, the first principal component accounted for 64% of the total variation, and it would have been higher had Wooded and Spiced been excluded from the battery.

A more cautious researcher might stop with extracting a single dimension and simply concluding that the women either liked or disliked the perfumes they tested and rated everything either uniformly higher or lower. They would speak of halo effects and question whether any more than an overall score could be extracted from the data. Nevertheless, as we see from the above partial correlation network, there is an interpretable local structure even when all the variables are highly interrelated.

I have discussed this issue before in a post about separating global from specific factors. The bifactor model outlined in that post provides another view into the structure of the perfume rating data. What if there were a global factor explaining what we might call the "halo effect" (i.e., uniformly high correlations among all the variables) and then additional specific factors accounting for the extra correlation among different subsets of variables (e.g., the regions in the above partial correlation network map)?

The bifactor diagram shown below may not be pretty with so many variables to be arrayed. However, you can see the high factor loadings radiating out from the global factor g and how the specific factors F1* through F6* provide a secondary level of local structure corresponding to the regions identified in the above network.

I will end with a technical note. The 1321 observations were nested within the 11 perfumes with each respondent seeing only one perfume. Although we would not expect the specific perfume rated to alter the correlations (factorial invariance), mean-level differences between the perfumes could inflate the correlations calculated over the entire sample. In order to test this, I reran the analysis with deviation scores by subtracting the corresponding mean perfume score from each respondent's original ratings. The results were essentially the same.

R Code Needed to Import CSV File and Produce Plots

# Set working directory and import data file
setwd("C:/directory where file located")
perfume<-read.csv("Perfume.csv", sep=";")
apply(perfume, 2, function(x) table(x,useNA="always"))
# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(perfume[,2:48]), n=1321)
qgraph(sparse_matrix, layout="spring", 
       label.scale=FALSE, labels=names(perfume)[2:48],
       label.cex=1, node.width=.5)
# Purchase Intent Not Included
omega(perfume[,3:48], nfactors=6)
Created by Pretty R at

Thursday, December 10, 2015

Attitudes Modeled as Networks

In case you missed it, Jonas Dalege and his colleagues at the PsychoSystems research group have recently published an article in Psychological Review detailing how attitudes can be represented as network graphs. It is all done using R and a dataset that can be downloaded by registering at the ANES data center. You will find the R code under Scripts and Code in a file called ANES 1984 Analyses. With very minor changes to the size of some labeling, I was able to reproduce the above undirected graph with two R packages: IsingFit and qgraph. As usual when downloading others' files, most of the R code is data munging and deals with assigning labels and transforming ratings into dichotomies.

The above graph represents the conditional independence relationships among node pairings. Specifically, edges are drawn between pairs of nodes only if they are still related after controlling for all the other nodes not in that pair. The center nodes in red are assessments of Ronald Reagan's ability, decency and caring. The groupings of the red nodes seem reasonable, for example, the thicker green edges connected knowledgeable, hard-working, decent and moral. Similarly, in touch, understands and cares are also drawn together by stronger relationships. These evaluative judgments are joined by positive green edges to the respondents' feelings of pride and hope (blue nodes). Moreover, they are pushed away by negative red pathways from darker emotional reactions such as fear, anger and disgust (green nodes).

One should not be surprised to learn that it makes a difference whether the attitudes are scored dichotomously (e.g., yes/no, agree/disagree or present/absent) or using some ordinal rating scale. If it helps, you can think of this as you might the distinction between regression (continuous) and classification (discrete) in statistical learning theory. Thus, when I analyzed a set of mobile phone ratings gathered with 10-point scales, I borrowed a graphical lasso model called EBICglasso from the qgraph R package (see Undirected Graphs When the Causality is Mutual). On the other hand, the Ising model from the IsingFit R package was needed when the data came from yes/no checklists (see The Network Underlying Consumer Perceptions of the European Car Market).