Sunday, July 8, 2012

Network Visualization of Key Driver Analysis

Whatever happened to those evaluations that your airline asked you to complete after taking a flight? They ask you for a number of ratings about buying your ticket, attributes of the plane, the service you received, and if you were satisfied, if you would recommend, and if you would fly again.

The airline is certainly concerned about tracking changes in these ratings over time. But they might also be interested in increasing customer loyalty (i.e., satisfaction, recommendation, and repeat purchase). For the later, the airline might request a "key driver analysis." The term "driver analysis" is used because the airline is looking for a marketing strategy that will increase loyalty. The word "key" is used because the airline wants to find the drivers with the biggest impact. "What is the one thing that we could do to increase customer loyalty?"

Here is one answer -- a network visualization of the correlations among all the ratings. It can be produced using the R statistical programming language in one line of code. I claim that clients find the network engaging. That is, anyone can look at the figure below and quickly see the interrelationships among the ratings and what is driving the different manifestations of customer loyalty. It is an easy picture to understand and helps clients to think strategically. Let's see if I can support that claim.
 
What it is? It is a mapping of the correlations among 15 ratings with the colors added to show ratings with highest mutual intercorrelations. The nodes are the ratings. The lines are the correlations. Here we only show lines for correlations above a specified cutoff value.

The greater the correlation between two variables, the thicker the line will be. So, the aqua or light blue nodes are interconnected by thicker lines because they are all highly correlated with each other. You can think of this as a customer service component. The green circles refer to the aircraft seating and cleanliness. The ticketing process is represented by the red circle, although their lines are less thick, suggesting a less cohesive component than customer service. Finally, the outcomes associated with customer loyalty are shown in purple.

Pressure Points = Driver Analysis

Say the airline asks you how they could increase customer satisfaction. You find the Satisfaction node on the left hand side, and you look for thick lines leading to it. There are several pathways to increased customer satisfaction. For example, all four customer service ratings have sizeable paths. If we were able to improve customer perceptions of friendliness (apply pressure to the Friendliness node), the effect would spread along its path to Satisfaction. The improvement in perceived Friendliness would also spread to perceptions of Courtesy, Service, and Helpfulness since these are also connected. In fact, if the airline were to make changes that impacted all four service components at the same time (apply pressure simultaneously to four nodes), they could possibly see even greater improvement. Perhaps we should not be asking “what is the one thing that will most increase customer loyalty,” but what is the one area where we should concentrate our efforts.

Moreover, we can see that the "drivers" of repeat purchase (Fly Again) are different from the drivers of customer satisfaction. Otherwise, Fly Again would be positioned closer to Satisfaction.  In fact, the network visualization makes it obvious that the key drivers change as one moves from Satisfaction to Recommendation to Fly Again.


Comparison to More Traditional Key Driver Analyses

Multiple regression is the most common form of key driver analysis. How did our network map perform relative to regression analysis? Here are the standardized regression coefficients from three separate regressions of Satisfaction, Recommend, and Fly Again on all 12 predictors.

Standardized Regression Weights
Sat
Recommend
Fly Again
(Intercept)
0.00
0.00
0.00
Easy Reservation
0.05
0.16
0.12
Preferred Seats
0.04
0.15
0.14
Flight Options
0.05
0.11
0.14
Ticket Prices
0.04
0.06
0.10
Seat Comfort
0.09
0.09
0.03
Seat Roominess
0.07
0.17
0.10
Overhead Storage
0.02
0.19
0.16
Clean Aircraft
0.10
0.15
0.09
Courtesy
0.06
0.00
-0.01
Friendliness
0.15
0.00
-0.01
Helpfulness
0.13
-0.02
0.10
Service
0.14
-0.09
0.05
R-Squared
0.59
0.61
0.63

These coefficients are consistent with what we learned from the network. The largest weights for satisfaction come from the customer service components, while Recommend and Fly Again are influenced more by ticketing and cabin characteristics.

Warning, it's caveat time. The data are observational. We do not know the causal connections among these ratings. Does friendliness impact satisfaction? Or does satisfaction make it less likely that customers will give lower friendliness ratings?

Appendix




I have used an R package called qgraph to produce the network visualization.  You call the package using the command library(qgraph) and create the network graph using the following code:

gr<-list(1:4,5:8,9:12,13:15)
qgraph(cor(ratings),layout="spring", groups=gr, labels=names(ratings), label.scale=FALSE, minimum=0.50)

The 15 ratings are located in a data frame called ratings.  The map uses the correlation matrix among the 15 ratings as the proximity matrix to create the network.  We have asked for the “spring” layout, which has the effect of placing more highly correlated variables near each other and away from less or negatively correlated variables.

The author is Sacha Epskamp .  He works with the PsychoSystems Project.  Either of the following two links will take you to web sites that will explain qgraph and the network visualization approach in greater depth: http://sachaepskamp.com/ or http://www.psychosystems.org/.

Libraries like qgraph are one of the great strengths of the R programming language.  The PsychoSystems Project is a group of programmers and researchers attempting to introduce a new metaphor for understanding the basis of psychological measurement.  You should use either link to learn about their work.  However, for us what is important is that Sacha Epskamp has spent a considerable amount of time and work to create a single line of code that generates exactly the type of graph that one would want if they needed to show the interrelationships among a set of ratings.

Finally, the ratings data set was randomly generated using a specific factor model.  It was not essential to our discussion for the reader to know this because the simulated data set mimics the structure that underlies most of the satisfaction data sets one finds in marketing research.  I have seen this structure over and over again from customer satisfaction surveys across markets, including my research with the airlines. However, in order to reproduce the analysis shown in this posting, you will need to run the necessary R code.  I have listed everything you will need below.

R Code to Generate the Simulated Data and Run All Analyses
# The goal is to show all the R code that you would need
# to reproduce everything that has been reported.
# We will use the mvtnorm package in order to randomly
# generate a data set with a given correlation pattern.

# First, we create a matrix of factor loadings.
# This pattern is called bifactor because it has a
# general factor with loadings from all the items
# and specific factors for separate components.
# The outcome variables are also formed as
# combinations of these general and specific factors.

loadings <- matrix(c (
.33, .58, .00, .00,  # Ease of Making Reservation
.35, .55, .00, .00,  # Availability of Preferred Seats
.30, .52, .00, .00,  # Variety of Flight Options
.40, .50, .00, .00,  # Ticket Prices
.50, .00, .55, .00,  # Seat Comfort
.41, .00, .51, .00,  # Roominess of Seat Area
.45, .00, .57, .00,  # Availability of Overhead Storage
.32, .00, .54, .00,  # Cleanliness of Aircraft
.35, .00, .00, .50,  # Courtesy
.38, .00, .00, .57,  # Friendliness
.60, .00, .00, .50,  # Helpfulness
.52, .00, .00, .58,  # Service
.43, .10, .20, .30,  # Overall Satisfaction
.35, .50, .40, .20,  # Purchase Intention
.25, .50, .50, .00), # Willingness to Recommend
nrow=15,ncol=4, byrow=TRUE)

# Matrix multiplication produces the correlation matrix,
# except for the diagonal.
cor_matrix<-loadings %*% t(loadings)
# Diagonal set to ones.
diag(cor_matrix)<-1

library(mvtnorm)
N=1000
set.seed(7654321) #needed in order to reproduce the same data each time
std_ratings<-as.data.frame(rmvnorm(N, sigma=cor_matrix))

# Creates a mixture of two data sets:
# first 50 observations assinged uniformly lower scores.
ratings<-data.frame(matrix(rep(0,15000),nrow=1000))
ratings[1:50,]<-std_ratings[1:50,]*2
ratings[51:1000,]<-std_ratings[51:1000,]*2+7.0

# Ratings given different means
ratings[1]<-ratings[1]+2.2
ratings[2]<-ratings[2]+0.6
ratings[3]<-ratings[3]+0.3
ratings[4]<-ratings[4]+0.0
ratings[5]<-ratings[5]+1.5
ratings[6]<-ratings[6]+1.0
ratings[7]<-ratings[7]+0.5
ratings[8]<-ratings[8]+1.5
ratings[9]<-ratings[9]+2.4
ratings[10]<-ratings[10]+2.2
ratings[11]<-ratings[11]+2.1
ratings[12]<-ratings[12]+2.0
ratings[13]<-ratings[13]+1.5
ratings[14]<-ratings[14]+1.0
ratings[15]<-ratings[15]+0.5

# Truncates Scale to be between 1 and 9
ratings[ratings>9]<-9
ratings[ratings<1]<-1
# Rounds to single digit.
ratings<-round(ratings,0)

# Assigns names to the variables in the data frame called ratings
names(ratings)=c(
"Easy_Reservation",
"Preferred_Seats",
"Flight_Options",
"Ticket_Prices",
"Seat_Comfort",
"Seat_Roominess",
"Overhead_Storage",
"Clean_Aircraft",
"Courtesy",
"Friendliness",
"Helpfulness",
"Service",
"Satisfaction",
"Fly_Again",
"Recommend")

# Calls qgraph package to run Network Map
library(qgraph)
# creates grouping of variables to be assigned different colors.
gr<-list(1:4,5:8,9:12,13:15)
qgraph(cor(ratings),layout="spring", groups=gr, labels=names(ratings), label.scale=FALSE, minimum=0.50)

# Calculates z-scores so that regression analysis will yield
# standardized regression weights
scaled_ratings<-data.frame(scale(ratings))
ols.sat<-lm(Satisfaction~Easy_Reservation + Preferred_Seats +
  Flight_Options + Ticket_Prices + Seat_Comfort + Seat_Roominess +
  Overhead_Storage + Clean_Aircraft + Courtesy + Friendliness +
  Helpfulness + Service, data=scaled_ratings)
summary(ols.sat)

ols.rec<-lm(Recommend ~ Easy_Reservation + Preferred_Seats +
  Flight_Options + Ticket_Prices + Seat_Comfort + Seat_Roominess +
  Overhead_Storage + Clean_Aircraft + Courtesy + Friendliness +
  Helpfulness + Service, data=scaled_ratings)
summary(ols.rec)

ols.fly<-lm(Fly_Again ~ Easy_Reservation + Preferred_Seats +
  Flight_Options + Ticket_Prices + Seat_Comfort + Seat_Roominess +
  Overhead_Storage + Clean_Aircraft + Courtesy + Friendliness +
  Helpfulness + Service, data=scaled_ratings)
summary(ols.fly)

7 comments:

  1. Hi,

    Any possibility / time to start a MOOC for doing "Market Research Using R"

    I have learned a lot from your posts. But that would be a big help to get everything in order.

    Please do think ...


    Cheers !

    ReplyDelete
  2. any package in R programming for brand mapping and data visualitation??

    ReplyDelete
    Replies
    1. If by "brand mapping" you mean perceptual mapping using correspondence analysis, then you should search for "Gaston Sanchez correspondence analysis in R" for a complete how-to guide. If that is not what you were looking for, let me know.

      Delete
  3. Interesting, have you looked at the specific package pcalg in R? What if you have highly correlated data? The Network visulization wouldn't show you much?

    ReplyDelete
    Replies
    1. You are correct that partial correlations or regression coefficients become less stable and less informative as all the variables become more correlated. I have argued in several posts that there comes a point when the first principal component becomes too large to believe that we have anything more than a single dimension. That is, we believe that the data cloud resides in a high dimensional space because we have many variables, but the high intercorrelations suggest that the data are confined to a single dimension. One cannot discover a causal structure using pcalg when all the variables are measuring the same underlying construct.

      For those who wish to learn more, Shalizi provides a good introductory chapter on pcalg called Discovering Causal Structure from Observations. Here is the link http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch25.pdf

      Delete