Wednesday, May 25, 2016

Using Support Vector Machines as Flower Finders: Name that Iris!

Nature field guides are filled with pictures of plants and animals that teach us what to look for and how to name what we see. For example, a flower finder might display pictures of different iris species, such as the illustrations in the plot below. You have in hand your own specimen from your garden, and you carefully compare it to each of the pictures until you find a good-enough match. The pictures come from Wikipedia, but the data used to create the plot are from the R dataset iris: sepal and petal length and width measured on 150 flowers equally divided across three species.

I have lifted the code directly from the svm function in the R package e1071.
library(e1071)
data(iris)
attach(iris)
 
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
 
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y) 
 
print(model)
summary(model)
 
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
 
# Check accuracy:
table(pred, y)
 
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
 
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),
     col = as.integer(iris[,5]),
     pch = c("o","+")[1:150 %in% model$index + 1])
Created by Pretty R at inside-R.org

We will focus on the last block of R code that generates the metric multidimensional scaling (MDS) of the distances separating the 150 flowers calculated from sepal and petal length and width (i.e., the dist function applied to the first four columns of the iris data). Species plays no role in the MDS with the flowers positioned in a two-dimensional space in order to reproduce the pairwise Euclidean distances. However, species is projected onto the plot using color, and the observations acting as support vectors are indicated with plus signs (+).

The setosa flowers are represented by black circles and black plus signs. These points are separated along the first dimension from the versicolor species in red and virginica in green. The second dimension, on the other hand, seems to reflect some within-species sources of differences that do not differentiate among the three iris types.

We should recall that the dist function calculates pairwise distances in the original space without any kernel transformation. The support vectors, on the other hand, were identified from the svm function using a radial kernel and then projected back onto the original observation space. Of course, we can change the kernel, which defaults to "radial" as in this example from the R package. A linear kernel may do just as well with the iris data, as you can see by adding kernel="linear" to the svm function in the above code.


It appears that we do not need all 150 flowers in order to identify the iris species. We know this because the svm function correctly classifies over 97% of the flowers with 51 support vectors (also called "landmarks" as noted in my last post Seeing Similarity in More Intricate Dimensions). The majority of the +'s are located between the two species with the greatest overlap. I have included the pictures so that the similarity of the red and green categories is obvious. This is where there will be confusion, and this is where the only misclassifications occur. If your iris is a setosa, your identification task is relatively easy and over quickly. But suppose that your iris resembles those in the cluster of red and green pluses between versicolor and virginica. This is where the finer distinctions are being made.

By design, this analysis was kept brief to draw an analogy between support vector machines and finder guides that we have all used to identify unknown plants and animals in the wild. Hopefully, it was a useful comparison that will help you understand how we classify new observations by measuring their distances in a kernel metric from a more limited set of support vectors (a type of global positioning with a minimal number of landmarks or exemplars as satellites).

When you are ready with your own data, you can view the videos from Chapter 9 of An Introduction to Statistical Learning with Applications in R to get a more complete outline of all the steps involved. My intent was simply to disrupt the feature mindset that relies on the cumulative contributions of separate attributes (e.g., the relative impact of each independent variable in a prediction equation). As objects become more complex, we stop seeing individual aspects and begin to bundle features into types or categories. We immediately recognize the object by its feature configuration, and these exemplars or landmarks become the new basis for our support vector representation.

Monday, May 23, 2016

The Kernel Trick in Support Vector Machines: Seeing Similarity in More Intricate Dimensions

The "kernel" is the seed or the essence at the heart or the core, and the kernel function measures distance from that center. In the following example from Wikipedia, the kernel is at the origin and the different curves illustrate alternative depictions of what happens as we move away from zero.


At what temperature do you prefer your first cup of coffee? If we center the scale at that temperature, how do we measure the effects of deviations from the ideal level. The uniform kernel function tells us that closest to the optimal makes little difference as long as it is within a certain range. You might feel differently, perhaps it is a constant rate of disappointment as you move away from the best temperature in either direction (a triangular kernel function). However, for most us, satisfaction takes the form of exponential decay with a Gaussian kernel describing our preferences as we deviate from the very best.

Everitt and Hothorn show how it is done in R for density estimation. Of course, the technique works with any variable, not just preference or distance from the ideal. Moreover, the logic is the same: give greater weight to closer data. And how does one measure closeness? You have many alternatives, as shown above, varying from tolerant to strict. What counts as the same depends on your definition of sameness. With human vision the person retains their identity and our attention as they walk from the shade into the sunlight; my old camera has a different kernel function and fails to keep track or focus correctly. In addition, when the density being estimated is multivariate, you have the option of differential weighting of each variable so that some aspects will count a great deal and others can be ignored.

Now, with the preliminaries over, we can generalize the kernel concept to support vector machines (SVMs). First, we will expand our feature space because the optimal cup of coffee depends on more than its temperature (e.g., preparation method, coffee bean storage, variation on finest and method of grind, ratio of coffee to water, and don't forget the the type of bean and its processing). You tell me the profile of two coffees using all those features that we just enumerated, and I will calculate their pairwise similarity. If their profiles are identical, the two coffees are the same and centered at zero. But if they are not identical, how important are the differences? Finally, we ought to remember that differences are measured with respect to satisfaction, that is, two equally pleasing coffees may have different profiles but the differences are not relevant.

As the Mad Hatter explained in the last post, SVMs live in the observation space, in this case, among all the different cups of coffees. We will need a data matrix with a bunch of coffees for taste testing in the rows and all those features as columns, plus an additional column with a satisfaction rating or at least a thumbs-up or thumbs-down. Keeping it simple, we will stay with a classification problem distinguishing good from bad coffees. Can I predict your coffee preference from those features? Unfortunately, individual tastes are complex and that strong coffee may be great for some but only when hot. What of those who don't like strong coffee? It is, as if, we had multiple configurations of interacting nonlinear features with many more dimensions than can be represented in the original feature space.

Our training data from the taste tests might contain actual coffees near each of these configurations differentiating the good and the bad. These are the support vectors of SVMs, what Andrew Ng calls "landmarks" in his Coursera course and his more advanced class at Stanford. In this case, the support vectors are actual cups of coffee that you can taste and judge as good or bad. Chapter 9 of An Introduction to Statistical Learning will walk you through the steps, including how to run the R code, but you might leave without a good intuitive grasp of the process.

It would help to remember that a logistic regression equation and the coefficients from a discriminant analysis yield a single classification dimension when you have two groupings. What happens when there are multiple ways to succeed or fail? I can name several ways to prepare different types of coffee, and I am fond of them all. Similarly, I can recall many ways to ruin a cup of coffee. Think of each as a support vector from the training set and the classification function as a weighted similarity to instances from this set. If a new test coffee is similar to one called "good" from the training data, we might want to predict "good" for this one too. The same applies to coffees associated with the "bad" label.

The key is the understanding that the features from our data matrix are no longer the dimensions underlying this classification space. We have redefined the basis in terms of landmarks or support vectors. New coffees are placed along dimensions defined by previous training instances. As Pedro Domingos notes (at 33 minutes into the talk), the algorithm relies on analogy, not unlike case-based reasoning. Our new dimensions are more intricate compressed representations of the original features. If this reminds you of archetypal analysis, then you may be on the right track or at least not entirely lost.

Monday, May 9, 2016

The Mad Hatter Explains Support Vector Machines

"Hatter?" asked Alice, "Why are support vector machines so hard to understand?" Suddenly, before you can ask yourself why Alice is studying machine learning in the middle of the 19th century, the Hatter disappeared. "Where did he go?" thought Alice as she looked down to see a compass painted on the floor below her. Arrows pointed in every direction with each one associated with a word or phrase. One arrow pointed toward the label "Tea Party." Naturally, Alice associated Tea Party with the Hatter, so she walked in that direction and ultimately found him.

"And now," the Hatter said while taking Alice's hand and walking through the looking glass. Once again, the Hatter was gone. This time there was no compass on the floor. However, the room was filled with characters, some that looked more like Alice and some that seemed a little closer in appearance to the Hatter. With so many blocking her view, Alice could see clearly only those nearest to her. She identified the closest resemblance to the Hatter and moved in that direction. Soon she saw another that might have been his relative. Repeating this process over and over again, she finally found the Mad Hatter.

Alice did not fully comprehend what the Hatter told her next. "The compass works only when the input data separates Hatters from everyone else. When it fails, you go through the looking glass into the observation space where all we have is resemblance or similarity. Those who know me will recognize me and all that resemble me. Try relying on a feature like red hair and you might lose your head to the Red Queen. We should have some tea with Wittgenstein and discuss family resemblance. It's derived from features constructed out of input that gets stretched, accentuated, masked and recombined in the most unusual ways."

The Hatter could tell that Alice was confused. Reassuringly, he added, "It's a acquired taste that takes some time. We know we have two classes that are not the same. We just can't separate them from the data as given. You have to look it in just the right way. I'll teach you the Kernel Trick." The Mad Hatter could not help but laugh at his last remark - looking at in in just the right way could be the best definition of support vector machines.

Note: Joseph Rickert's post in R Bloggers shows you the R code to run support vector machines (SVMs) along with a number of good references for learning more. My little fantasy was meant to draw some parallels between the linear algebra and human thinking (see Similarity, Kernels, and the Fundamental Constraints on Cognition for more). Besides, Tim Burton will be releasing soon his movie Alice Through the Looking Glass, and the British Library is celebrating 150 years since the publication of Alice in Wonderland. Both Alice and SVMs invite you to go beyond the data as inputted and derive "impossible" features that enable differentiation and action in a world at first unseen.

Sunday, April 3, 2016

When Choice Modeling Paradigms Collide: Features Presented versus Features Perceived

What is the value of a product feature? Within a market-based paradigm, the answer is the difference between revenues with and without the feature. A product can be decomposed into its features, each feature can be assigned a monetary value by including price in the feature list, and the final worth of the product is a function of its feature bundle. The entire procedure is illustrated in an article using the function rhierMnlMixture from the R package bayesm (Economic Valuation of Product Features). Although much of the discussion concentrates on a somewhat technical distinction between willingness-to-pay (WTP) and willingness-to-buy (WTB), I wish to focus instead on the digital camera case study in Section 6 beginning on page 30. If you have question concerning how you might run such an analysis in R, I have two posts that might help: Let's Do Some Hierarchical Bayes Choice Modeling in R and Let's Do Some More Hierarchical Bayes Choice Modeling in R.

As you can see, the study varies seven factors, including price, but the goal is to estimate the economic return from including a swivel screen on the back of the digital camera. Following much the same procedure as that outlined in those two choice modeling posts mentioned in the last paragraph, each respondent saw 16 hypothetical choice sets created using a fractional factorial experimental design. There was a profile associated with each of the four brands, and respondents were asked to first select the one they most preferred and then if they would buy their most preferred brand at a given price.

The term "dual response" has become associated with this approach, and several choice modelers have adopted the technique. If the value of the swivel screen is well-defined, it ought not matter how you ask these questions, and that seems to be confirmed by some in the choice modeling community. However, outside the laboratory and in the field, commitment or stated intention is the first step toward behavior change. Furthermore, the mere-measurement effect in survey research demonstrates that questioning by itself can alter preferences. Within the purchase context, consumers do not waste effort deciding which of the rejected alternatives is the least objectionable by attending to secondary features after failing to achieve consideration on one or more deal breakers (i.e., the best product they would not buy). Actually, dual response originates as a sales technique because encouraging commitment to one of the offerings increases the ultimate purchase likelihood.

We have our first collision. Order effects are everywhere. It is one of the most robust findings in measurement. The political pollster wants to know how big a sales tax increase could be passed in the next election. You get a different answer when you ask about a one-quarter percent increase followed by one-half percent than when you reverse the order. Perceptual contrast is unavoidable so that one-half seems bigger after the one-quarter probe. I do not need to provide a reference because everyone is aware of order as one of the many context effects. The feature presented is not the feature perceived.

Our second collision occurs from the introduction of price as just another feature, as if in the marketplace no one ever asks why one brand is more expensive than another. We ask because price is both a sacrifice with a negative impact and a signal of quality with a positive weight. In fact, as one can see from the pricing literature, there is nothing simple or direct about price perception. Careful framing may be needed (e.g., maintaining package size but reducing the amount without changing price). Otherwise, the reactions can be quite dramatic for price increases can trigger attributions concerning the underlying motivation and can generate a strong emotional response (e.g., price fairness).


At times, the relationship between the feature presented and the feature perceived can be more nuanced. It would be reasonable to vary gasoline prices in terms of cost per unit of measurement (e.g., dollars per gallon or euros per liter). Yet, the SUV driver seems to react in an all-or-none fashion only when some threshold on the cost to fill up their tank has been exceeded. What is determinant is not the posted price but the total cost of the transaction. Thus, price sensitivity is a complex nonlinear function of cost per unit depending on how often one fills up with gasoline and the size of that tank. In addition, the pain at the pump depends on other factors that fail to make it into a choice set. How long will the increases last? Are higher prices seen as fair? What other alternatives are available? Sometimes we have no option but to live with added costs, reducing our dissonance by altering our preferences.

We see none of this reasoning in choice modeling where the alternatives are described as feature bundles outside of any real context. The consumer "plays" the game as presented by the modeler. Repeating the choice exercise with multiple choice sets only serves to induce a "feature-as-presented" bias. Of course, there are occasions when actual purchases look like choice models. We can mimic repetitive purchases from the retail shelf with a choice exercise, and the same applies to online comparison shopping among alternatives described by short feature lists as long as we are careful about specifying the occasion and buyers do not search for user comments.

User comments bring us back to the usage occasion, which tends to be ignored in choice modeling. Reading the comments, we note that one customer reports the breakage of the hinge on the swivel screen after only a few months. Is the swivel screen still an advantage or a potential problem waiting to occur? We are not buying the feature, but the benefit that the feature promises. This is the scene of another paradigm collision. The choice modeler assumes that features have value that can be elicited by merely naming the feature. They simplify the purchase task by stripping out all contextual information. Consequently, the resulting estimates work within the confines of their preference elicitation procedures, but do not generalize to the marketplace.

We have other options in R, as I have suggested in my last two posts. Although the independent variables in a choice model are set by the researcher, we are free to transform them, for instance, compute price as a logarithm or fit low-order polynomials of the original features. We are free to go farther. Perceived features can be much more complex and constructed as nonlinear latent variables from the original data. For example, neural networks enable us to handle a feature-rich description of the alternatives and fit adaptive basis functions with hidden layers.

On the other hand, I have had some success exploiting the natural variation within product categories with many offerings (e.g., recommender systems for music, movies, and online shopping like Amazon). By embedding measurement within the actual purchase occasion, we can learn the when, why, what, how and where of consumption. We might discover the limits of a swivel screen in bright sunlight or when only one hand is free. The feature that appeared so valuable when introduced in the choice model may become a liability after reading users' comments.

Features described in choice sets are not the same features that consumers consider when purchasing and imagining future usage. This more realistic product representation requires that we move from those R packages that restrict the input space (choice modeling) to those R packages that enable the analysis of high-dimensional sparse matrices with adaptive basis functions (neural networks and matrix factorization).

Bottom Line: The data collection process employed to construct and display options when repeated choice sets are presented one after another tends to simplify the purchase task and induce a decision strategy consistent with regression models we find in several R packages (e.g., bayesm, mlogit, and RChoice). However, when the purchase process involves extensive search over many offerings (e.g., music, movies, wines, cheeses, vacations, restaurants, cosmetics, and many more) or multiple usage occasions (e.g., work, home, daily, special events, by oneself, with others, involving children, time of day, and other contextual factors), we need to look elsewhere within R for statistical models that allow for the construction of complex and nonlinear latent variables or hidden layers that serve as the derived input for decision making (e.g., R packages for deep learning or matrix factorization).

Friday, March 25, 2016

Choice Modeling with Features Defined by Consumers and Not Researchers

Choice modeling begins with a researcher "deciding on what attributes or levels fully describe the good or service." This is consistent with the early neural networks in which features were precoded outside of the learning model. That is, choice modeling can be seen as learning the feature weights that recognize whether the input was of type "buy" or not.

As I have argued in the previous post, the last step in the purchase task may involve attribute tradeoffs among a few differentiating features for the remaining options in the consideration set. The aging shopper removes two boxes of cereal from the well-stocked supermarket shelves and decides whether low-sodium beats low-fat. The choice modeler is satisfied, but the package designer wants to know how these two boxes got noticed and selected for comparison. More importantly for the marketer, how is the purchase being framed by the consumer? Is it advertising that focused attention on nutrition? Was it health claims by other cereal boxes nearby on the same shelf?

With caveats concerning the need to avoid caricature, one can describe this conflict between the choice modeler and the marketer in terms of shallow versus deep learning (see slide #2 from Yann LeCun's 2013 tutorial with video here). From this perspective, choice modeling is a form of  more shallow information integration where the features are structured (varied according to some experimental design) and presented in a simplified format (the R package support.CEs aids in this process and you can find R code for hierarchical Bayes using bayesm in this link).


Choice modeling or information integration is illustrated on the upper left of the above diagram. The capital S's are the attribute inputs that are translated into utilities so that they can be evaluated on a common value scale. Those utilities are combined or integrated and yield a summary measure that determines the response. For example, if low-fat were worth two units and low-sodium worth only one unit, you would buy the low-fat cereal. The modeling does not scale well, so we need to limit the number of feature levels. Moreover, in order to obtain individual estimates, we require repeated measures from different choice sets. The repetitive task encourages us to streamline the choice sets so that feature tradeoffs are easier to see and make. The constraints of an experimental design force us toward an idealized presentation so that respondents have little choice but information integration.

Deep learning, on the other hand, has multiple hidden layers that model feature extraction by the consumer. The goal is to eat a healthy cereal that is filling and tastes good. Which packaging works for you? Does it matter if the word "fiber" is included? We could assess the impact of the fiber labeling by turning it on and off in an experimental design. But that only draws attention to the features that are varied and limits any hope of generalizing our findings beyond the laboratory. Of course, it depends on whether you are buying for an adult or a child, and whether the cereal is for breakfast or a snack. Contextual effects force us to turn to statistical models that can handle the complexities of real world purchase processes.

R does offer an interface to deep learning algorithms. However, you can accomplish something similar with nonnegative matrix factorization (NMF). The key is not to force a basis onto the statistical analysis. Specifically, choice modeling relies on a regression analysis with the features as the independent variables. We can expand this basis by adding transformations of the original features (e.g., the log of price or inserting polynomial expansions of variables already in the model). However, the regression equation will reveal little if the consumer infers some hidden or latent features from a particular pattern of feature combinations (e.g., a fragment of the picture plus captions along with the package design triggers childhood memories or activates aspirational drives).

Deep learning excels with the complexities of language and vision. NMF seems to work well in the more straightforward world of product preference. As an example, Amazon displays several thousand cereals that span much of what is available in the marketplace. We can limit ourselves to a subset of the 100 or more most popular cereals and ask respondents to indicate their interest in each cereal. We would expect a sparse data matrix with blocks of joint groupings of both respondents with similar tastes and cereals with similar features (e.g., variation on flakes, crunch or hot cereals). The joint blocks define the hidden layers simultaneously clustering respondents and typing products.

Matrix factorization or decomposition seeks to reconstruct the data in a matrix from a smaller number of latent features. I have discussed its relationship to deep learning in a post on product category representation. It ends with a listing of examples that include the code needed to run NMF in R. You can think of NMF as a dual factor analysis with a common set of factors for both rows (consumers) and columns (cereals in this case). Unlike principal component or factor analysis, there are no negative factor loadings, which is why NMF is nonnegative. The result is a data matrix reconstructed from parts that are not imposed by the statistician but revealed in the attempt to reproduce the consumer data.

We might expect to find something similar to what Jonathan Gutman reported from a qualitative study using a means-end analysis. I have copied his Figure 3 showing what consumers said when asked about crunchy cereals. Of course, all we obtain from our NMF are weights that look like factor loadings for respondents and cereals. If there is a crunch factor, you will see all the cereals with crunch loading on that hidden feature with all the respondents wanting crunch with higher weights on the same hidden feature. Obviously, in order to know which respondents wanted something crunchy in their cereal, you would need to ask a separate question. Similarly, you might inquire about cereal perceptions or have experts rate the cereals to know which cereals produce the biggest crunch. Alternatively, one could cluster the respondents and cereals and profile those clusters.


Monday, March 21, 2016

Understanding Statistical Models Through the Datasets They Seek to Explain: Choice Modeling vs. Neural Networks

R may be the lingua franca, yet many of the packages within the R library seem to be written in different languages. We can follow the R code because we know how to program but still feel that we have missed something in the translation.

R provides an open environment for code from different communities, each with their own set of exemplars, where the term "exemplar" has been borrowed from Thomas Kuhn's work on normal science. You need only to examine the datasets that each R package includes to illustrate its capabilities in order to understand the diversity of paradigms spanned. As an example, the datasets from the Clustering and Finite Mixture Task View demonstrate the dependence of the statistical models on the data to be analyzed. Those seeking to identifying communities in social networks might be using similar terms as those trying to recognize objects in visual images, yet the different referents (exemplars) change the meanings of those terms.

Thinking in Terms of Causes and Effects

Of course, there are exceptions, for instance, regression models can be easily understood across applications as the "pulling of levers" especially for those of us seeking to intervene and change behavior (e.g., marketing research). Increased spending on advertising yields greater awareness and generates more sales, that is, pulling the ad spending lever raises revenue (see the R package CausalImpact). The same reasoning underlies choice modeling with features as levers and purchase as the effect (see the R package bayesm).


The above picture captures this mechanistic "pulling the lever" that dominates much of our thinking about the marketing mix. The exemplar "explains" through analogy. You might prefer "adjusting the dials" as an updated version, but the paradigm remains cause-and-effect with each cause separable and under the control of the marketer. Is this not what we mean by the relative contribution of predictors? Each independent variable in a regression equation has its own unique effect on the outcome. We pull each lever a distance of one standard deviation (the beta weight), sum the changes on the outcome (sometimes theses betas are squared before adding), and then divide by the total.

The Challenge from Neural Networks

So, how do we make sense of neural networks and deep learning? Is the R package neuralnet simply another method for curve fitting or estimating the impact of features? Geoffrey Hinton might think differently. The Intro Video for Coursera's Neural Networks for Machine Learning offers a different exemplar - handwritten digit recognition. If he is curve fitting, the features are not given but extracted so that learning is possible (i.e., the features are not obvious but constructed from the input to solve the task at hand). The first chapter of Michael Nielsen's online book, Using Neural Nets to Recognize Handwritten Digits, provides the details. Isabelle Guyon's pattern recognition course adds an animated gif displaying visual perception as an active process.


On the other hand, a choice model begins with the researcher deciding what features should be varied. The product space is partitioned and presented as structured feature lists. What alternative does the consumer have, except to respond to variations in the feature levels? I attend to price because you keep changing the price. Wider ranges and greater variation only focus my attention. However, in real setting the shelves and the computer screens are filled with competing products waiting for consumers to define their own differentiating features. Smart Watches from Google Shopping provides a clear illustration of the divergence of purchase processes in the real world and in the laboratory.

To be clear, when the choice model and the neural network speak of input, they are referring to two very different things. The exemplars from choice modeling are deciding how best to commute and comparing a few offers for same product or service. This works when you are choosing between two cans of chicken soup by reading the ingredients on their labels. It does not describe how one selects a cheese from the huge assortment found in many stores.

Neural networks take a different view of the task. In less than five minutes Hinton's video provides the exemplar for representation learning. Input enters as it does in real settings. Features that successfully differentiate among the digits are learned over time. We see that learning in the video when the neural net generates its own handwritten digits for the numbers 2 and 8. It is not uncommon to write down a number that later we or others have difficulty reading. Legibility is valued so that we can say that an easier to read "2" is preferred over a "2" that is harder to identify. But what makes one "2" a better two than another "2" takes some training, as machine learning teaches us.

We are all accomplished at number recognition and forget how much time and effort it took to reach this level of understanding (unless we know young children in the middle of the learning process). What year is MCMXCIX? The letters are important, but so are their relative positions (e.g. X=10 and IX=9 in the year 1999). We are not pulling levers any more, at least not until the features have been constructed. What are those features in typical choice situations? What you want to eat for breakfast, lunch or dinner (unless you snack instead) often depends on your location, available time and money, future dining plans, motivation for eating, and who else is present (context-aware recommender systems).

Adopting a different perspective, our choice modeler sees the world as well-defined and decomposable into separate factors that can be varied systematically according to some experimental design. Under such constraints the consumer behaves as the model predicts (a self-fulling prophecy?). Meanwhile, in the real world, consumers struggle to learn a product representation that makes choice possible.

Thinking Outside the Choice Modeling Box

The features we learn may be relative to the competitive set, which is why adding a more expensive alternative makes what is now the mid-priced option appear less expensive. Situation plays an important role for the movie I view when alone is not the movie I watch with my kids. Framing has an impact, which is why advertising tries to convince you that an expensive purchase is a gift that you give to yourself. Moreover, we cannot forget intended usage for that Smartphone is a camera, a GPS, and I believe you get the point. We may have many more potential features than included in our choice design.

It may be the case that the final step before purchase can be described as a tradeoff among a small set of features varying over only a few alternatives in our consideration set. If we can mimic that terminal stage with a choice model, we might have a good chance to learn something about the marketplace. How did the consumer get to that last choice point? Why these features and those alternative products or services? In order to answer such questions, we will need to look outside the choice modeling box.

Friday, January 8, 2016

A Data Science Solution to the Question "What is Data Science?"

As this flowchart from Wikipedia illustrates, data science is about collecting, cleaning, analyzing and reporting data. But is it data science or just or a "sexed up term" for Statistics (see embedded quote by Nate Silver)? It's difficult to separate the two at this level of generality, so perhaps we need to define our terms.


We begin by making a list of all the stuff that a data scientist might do or know. We are playing a game where the answer is "data scientist" and the questions are "Do they do this?" and "Do they know that?". However, the "this" and the "that" are very specific. For example, "Data is Processed" can range from simple downloading to the complex representation of visual or speech input. What precisely does a data scientist do when they process data that a programmer or a statistician does not do?

To be clear, I am constructing a very long questionnaire that I intend to distribute to individuals calling themselves data scientists along with everyone else claiming that they too do data science, although by another name. A checklist will work in our game of Twenty Questions as long as the list is detailed and exhaustive. You are welcome to add suggestions as comments to this post, but we can start by expanding on each of the boxes in the above data science flowchart.

Since I am a marketing researcher, I am inclined to analyze the resulting data matrix as if it were a shopping cart filled with items purchased from a grocery store or an inventory of downloads from a video or music provider. The rows are respondents, and the columns are all the questions that might be asked to distinguish among all the various players. Let's not include sexy as a column.

You may have guessed that I am headed toward some type of matrix factorization. Can we recognize patterns in the columns that reflect different configurations of study and behavior? Are there communities composed of rows clustered together with similar practices and experiences? R provides most of us who have some experience running factor and cluster analyses with a "doable" introduction to non-negative matrix factorization (NMF). You can think of it as simultaneous clustering of the rows and columns in a data matrix. My blog is filled with examples, none of which are easy, but none of which are incomprehensible or beyond your ability to adapt to your own datasets.

What are we likely to find? Will we discover something like anchor words from topic modeling? For instance, it is necessary to work with multiple datasets from different disciplines to be a data scientist? Would I stop calling myself a marketing scientist if I started working with political polling data? Some argue that one becomes a statistician when they begin consulting with others from divergent fields of study.

What about teaching to students with varied backgrounds in universities or industry? Do we call it data science if one writes and distributes software that others can apply with data across diverse domains? Does proving theorems make one a statistician? How many languages must one know before they are a programmer? What role does computation play when making such discriminations?

What will we learn from dissecting the "corpus" (the detailed body of what we do and know summarized by the boxes in the above data science process)? Extending this analogy, I am recommending that the "physician, heal thyself" by applying data science methodology to provide a response to the "What is Data Science?" question. 

Hopefully, we can avoid the hype and the caricature from the popular press (sexiest job of 21st century). Moreover, I suggest that we resist the tendency to think metaphorically in terms of contrasting ideals. The simple act of comparing statisticians and data scientists shapes our perceptions and leads us to see the two as more dissimilar than suggested by their training and behavior. The distinction may be more nuance than substance, reflecting what excites and motivates rather than what is known or done. The basis for separation may reside in how much personal satisfaction is derived from the subject matter or the programming rather than the computational algorithm or the generative model.