Visualizing the Customer Journey with R and Adobe Analytics Data Feeds

Much has been said regarding the benefits of multi-touch or algorithmic attribution models to understanding your customers’ conversion paths, but running analyses merely looking at some numbers in a table doesn’t quite inspire insight in the same way that a well-constructed visualization can. So, in this post, I’m going to give you two great ways to construct and visualize customer data sequences at scale so that you can expose how your customers move and interact with your brand. We’ll also get a glimpse into which sequences are working for your customers and which aren’t.

Before we begin, you should probably get caught up if you haven’t read any of my previous posts:

For the sake of this post, I’m going to visualize customer movement through various marketing channels – but you don’t have to use marketing channels. This approach can readily work for page names, site sections, app screens, search keywords, or even product views by merely swapping the “campaign” dimension in my examples for any variable of interest. It’s also not strictly required to use a conversion metric, but I think it makes the resulting visualizations more meaningful and actionable.

So, to begin, I’m going to assume you’ve got a data feed loaded into a sparklyr data frame called “data_feed_tbl” with a column indicating conversion. From there, I’ll construct a series of order sequences that I can use to build my sequences off of, just as I did in my previous attribution posts:

data_feed_tbl = data_feed_tbl %>%
group_by(visitor_id) %>%
arrange(hit_time_gmt) %>%
mutate(order_seq = ifelse(conversion > 0, 1, NA)) %>%
mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
mutate(order_seq = ifelse((row_number() == 1) & (post_event_list %regexp% ",1,"), -1, 
       ifelse(row_number() == 1, 0, order_seq))) %>%
ungroup()

This bit of code gives me a nice new column that I can use for further group_by operations. Next, I’ll create a new data frame called “channel_stacks” – a non-repetitive concatenation of each distinct value a user saw leading up to a conversion:

channel_stacks = data_feed_tbl %>%
  group_by(visitor_id, order_seq) %>%
 
  #first remove irrelevant hits:
  filter(!is.na(campaign) | conversion>0) %>%
  
  #next remove repeated values with a lag function:
  filter((campaign != lag(campaign, default="1")) | conversion>0) %>%
  
  #now concatenate the sequence into a single row:
  summarize(
    path = concat_ws(">", collect_list(campaign)),
    conversion = sum(conversion)
  ) %>% ungroup() %>%
  
  #next roll up each unique path by count and conversion:
  group_by(path) %>%
  summarize(
    conversion = sum(conversion),
    path_count = n()
  ) %>% ungroup() %>%
  
  #last create a conversion rate column and pull it out of Spark:
  mutate(
    conversion_rate = conversion/path_count
  ) %>%
  collect()

This bit of code produces a table that is very interesting in its own right and will be the starting point for both of the visualizations I’ll show you how to do below. I’ve seen many companies use a “channel stacking” JavaScript plugin to get this type of data as an analytics report:

pathconversionpath_countconversion_rate
Natural_Search>Email2,63326,5970.10
Email>Natural_Search>Email2,19716,5280.13
Natural_Search>Paid_Search1,67012,0680.14
Email>Paid_Search>Email72911,6240.06
Email>Natural_Search>Email>Natural_Search>Email73910,4460.07
etc.etc.etc.etc.

However, as great as the above data table is, putting this information into a table doesn’t do this analysis justice, nor does it help me easily compare all of these sequences to each other.

Visualizing Paths with a Scatter Plot

To start, I recommend a scatter plot using one of my favorite interactive visualization libraries, plotly. Plotly has a host of useful options for scatter plots, and most of plotly’s visualizations work with R, Python, Matlab, and interoperate with ggplot2 if that’s your jam.

p = plot_ly(
  channel_stacks, 
  y=~conversion, 
  x=~path_count,
  color=~conversion_rate, 
  size=~conversion_rate,
  text=~path
) %>% layout(
  xaxis = list(type="log", title="Number of Paths"),
  yaxis = list(type="log", title="Number of Conversions")
) %>% colorbar(
  title = "Rate"
)

This bit of code gives you a great plot (hover over each point to see its stats):

Notice that I’ve plotted both the x-axis and y-axis on a log scale. I do this because some paths get a ton more traffic than others, and without a log scale on the chart, the low end becomes all smashed together and unreadable. I’ve also scaled and colored each point based on its conversion rate which helps call out the paths that are working well so that they’re easier to see.

Allow me to point out some of the insights the scatter plot reveals from this dataset:

  • The vast majority of conversions are coming from single touches – notice all the points towards the top right of the are primarily single points of contact. This skewed distribution is pretty typical in my experience.
  • Some of the paths have a really high conversion rate (especially those ending with “Affiliates”), which might mean something fishy is going on or that those paths are super successful. Either way, it’s something I may want to explore further.
  • It’s also clear that “Display” is terrible at converting users in this chart, and “Email” is pretty decent – notice the meager conversion rate of paths ending with the “Display” channel and the healthy conversion rates for paths ending in “Email.”

Before you create a scatter plot on a large scale dataset and pull it into an R data frame, here are a few tips:

  • First, be aware that the larger the volume of data you’re collecting, the larger the number of potential paths – you can work around this by filtering your sparklyr data frame to paths that meet a minimum threshold (say had more than 50 path_count) before collecting the data frame in R if necessary.
  • Next, consider the cardinality of the dimension you’re pathing on. High cardinality can seriously blow out the number of unique paths in your data.
  • Make sure you’re removing repeated values (as I’ve done above), or you’ll end up with a whole lot of paths that look like “Email>Email>Email>Email, etc.” which can also add a lot of useless noise to your results.
  • Finally, even if after using those tips the chart bogs down because of the high number of unique points, I’d recommend trying some of the WebGL functionality plotly offers for large datasets. It’s incredible how many data points it can handle!

In our next example, I’ll show you how to can go a level deeper to analyze your customers’ journey in a different way.

Visualizing Paths with a Sankey Diagram

While the scatter plot is fantastic for finding paths with a high conversion rate or finding which paths are most popular, typically I like to visualize user paths with a more dynamic and flowing presentation. For our next example, we’ll take the “channel stacking” table from above and convert it into a Plotly Sankey diagram. I confess that the Sankey diagram can be one of the more confusing charts to interpret if not done right, and it is one of the more difficult plotly visualizations to setup (in my opinion), but hopefully, you can power through these obstacles with a little assistance.

First, I’m going to add a column to my channel_stacks table that splits the path string into an indexed list:

channel_stacks$path_list = strsplit(x=channel_stacks_mod$path,split=">")

This indexed list makes life easier for me as you’ll see later. I’m also going to set a variable that allows me to quickly change how far down I want my Sankey diagram to go (i.e., one step, two steps, or three plus steps deep):

depth = 4

From here, we’ll need to construct Sankey node labels for each layer in the Sankey diagram. I tried to make my code as generic as I could (given how much time I have for blog posts), so you could use it on any number of unique path nodes or any dimension above and beyond a marketing channel dimension. I can’t guarantee that it’ll work in all scenarios, but hopefully, this code is enough to point you in the right direction. If you don’t need it to work generically, there are probably more straightforward ways to generate your node_labels and label_lengths:

#Generate node labels and label length vectors
node_labels=rep(list(list()),depth)
label_length = list()
for(i in 1:depth){
  for(j in 1:length(channel_stacks$path)){
    if(!is.na(channel_stacks$path_list[j][[1]][i]))
      node_labels[[i]][j] = channel_stacks$path_list[j][[1]][i]
  }
  node_labels[[i]] = unique(unlist(node_labels[[i]]))
  node_labels[[i]] = node_labels[[i]][order(node_labels[[i]])]
  label_length[[i]] = length(node_labels[[i]])
}
node_labels = unlist(node_labels)
label_length = unlist(label_length)

For my data, this gives me a series of repeated labels. This repetition is necessary because I want each step of the paths I’m visualizing to be separate nodes (as you’ll see below) and because the Sankey diagram doesn’t allow data to flow backward:

> node_labels
[1] "Affiliates" "Display" "Email" "Natural_Search" "Paid_Search" "Social_Media" "Affiliates" "Display" "Email" 
[10] "Natural_Search" "Paid_Search" "Social_Media" "Affiliates" "Display" "Email" "Natural_Search" "Paid_Search" "Social_Media" 
[19] "Affiliates" "Display" "Email" "Natural_Search" "Paid_Search" "Social_Media" 
> label_length
[1] 6 6 6 6

The next part is the trickiest part. The Plotly Sankey diagram uses three (zero indexed) arrays: “source,” “target,” and “value.” If I had a combination of source = 0, target = 7, and value = 10, it would create a Sankey branch from node 0 to node 7 (which maps “Affiliates” in the first layer to “Display” in the second layer as per my node_labels) with a value of “10”. Interpret this as having ten users who saw “Affiliates” (the zeroth node) as their very first step then went to “Display” (the seventh node) as their second step.

So what I need to do next is create a table to build out every combination of “source,” “target,” and “value” that I can fill in based on my channel_stacks table. Doing this requires some painful indexing using even more painful for loops (someone smarter than I could probably come up with a better way to do this without the loops):

#Build a data frame to fill out with each path view
combos = NULL
for(i in 1:(depth-1)){
  for(j in (1 + sum(label_length[1:i-1])):(label_length[i] + sum(label_length[1:i-1]))){
    for(k in (1 + label_length[i] + sum(label_length[1:i-1])):(label_length[i+1] + label_length[i] + sum(label_length[1:i-1]))){
      combos = rbind(combos, c(i,j,k,0))
    } 
  }
}
combos = as.data.frame(combos)
names(combos) = c("step","source","target","value")

Then, finally filling in the table with the actual values from my channel_stacks data frame:

#Populate the combo table
for(i in 1:(dim(combos)[1])){
  for(j in 1:(dim(channel_stacks)[1])){
    combos$value[i] = sum(combos$value[i], ifelse(
      (node_labels[combos$source[i]] == channel_stacks$path_list[j][[1]][combos$step[i]]) &
      (node_labels[combos$target[i]] == channel_stacks$path_list[j][[1]][combos$step[i]+1]),
      channel_stacks$path_count[j],0), na.rm = TRUE)
  }
}

This code produces a table that looks like this:

stepsourcetargetvalue
1170
118121
1193422
11102215
1111604
11126
127114
1280
etc.etc.etc.etc.

In this example, notice how the (one indexed, not zero indexed) source “1” only maps to “7” and above since I only have six distinct channels/nodes in step one. Depending on how many distinct values you have at each step, those values will be different. Also notice that a source value never has a value when pointed at its target value (for example, a value of 1 mapping to value 7, which would be “Affiliates” mapping to “Affiliates”) since we’ve removed repeated instances. Finally, notice that all of these values aren’t yet zero-indexed like I mentioned before – we’ll get to that in a later step.

Almost there! The last thing I need to do is add a “conversion” node at the end so that I can see how many of these paths ended in conversion:

#Add a node to populate with conversion values
uniques = unique(c(combos$source,combos$target))
converts = as.data.frame(list("step"=rep(0,length(uniques)), "source"=uniques, "target"=rep(max(uniques)+1,length(uniques)), 
                              "value"=rep(0,length(uniques))))
combos = rbind(combos,converts)
for(i in 1:(dim(channel_stacks)[1])){
  stack_depth = min(depth,length(channel_stacks$path_list[i][[1]]))
  index_val = which(combos$step==0 & combos$source==(which(org_node_labels == channel_stacks$path_list[i][[1]][stack_depth]) + 
                                                     ifelse(stack_depth>1, sum(label_length[1:(stack_depth-1)]),0)))
  combos$value[index_val] = combos$value[index_val] + channel_stacks$conversion[i]
}

This code creates some new entries in my combo table allowing every node to flow into the conversion node if users converted after any particular step (I’ve added this as step “0” since that value will work regardless of how deep you make your chart go):

stepsourcetargetvalue
01256801
022561
03253677
04253313
05251342
062532
0725923
etc.etc.etc.etc.

Notice how each added row has “25” as its target – this will be my conversion step. The last thing I’ll do is add a step number to each of the labels (for some reason at the time of this writing the Sankey diagram in plotly doesn’t handle repeated labels very well), as well as a “Conversion” label:

#Populate the conversion node values
display_node_labels = node_labels
for(i in 1:length(label_length)){
  for(j in 1:label_length[i]){
    display_node_labels[j+ifelse(i==1,0,sum(label_length[1:(i-1)]))] = paste0(i,":",node_labels[j+ifelse(i==1,0,sum(label_length[1:(i-1)]))])
  }
}
display_node_labels = c(display_node_labels, "Conversion")

Finally, I’m ready to plot (I’ve added some coloring not shown in the code above to make it easier to read).

#Generate Sankey diagram
p <- plot_ly(
    type = "sankey",
    orientation = "v",

    node = list(
      label = display_node_labels,
      color = node_colors,
      pad = 10,
      thickness = 30,
      line = list(
        color = "black",
        width = 0
      )
    ),
  
    link = list(
      source = combos$source-1, # convert to zero index
      target = combos$target-1, # convert to zero index
      value = combos$value, #size of connection
      color = combos$color #add colors for each link if desired
    )
  ) %>% 
  layout(
    title = "Conversion Flow Diagram",
    font = list(
    size = 10
    )
  )
p

Which finally produces the chart I’m after:

A few things I’d mention about this plot:

  • It’s interactive! You can drag and move the nodes around, as well as hover over each path or node to get more details – I love this about plotly!
  • I hid all of the conversions from the first step because there are so many of them that it drowned out everything else – so what you see here are just the multi-touch paths. I did this by adding a simple filter to the combos data frame to remove the combos from sources 1 to 6.
  • I added some extra transparency to the colors of paths that were not converting paths to highlight which paths were ending in conversion. You can see how “Affiliates” and “Email” are my best contributors to conversion.
  • You can also see how few people make it very far beyond the first one or two touches, again just visualizing how frequently or infrequently your customers move from one channel to another can be useful information.

Some tricks for working with large datasets:

  • If large cardinality was an issue for the scatter plot, multiply that by ten here – visualizing many different combinations of values gets unfeasible fast with this visualization (not to mention useless). If you have high cardinality in your dimension of interest, I’d recommend reducing the unique value count with a classification lookup or just using the top 10 most common values.
  • I love using colors to illustrate the different values, but having more than ten unique values can turn into a mess – so just one more reason to limit the number of distinct values you’re pathing across.

Conclusion

The customer journey is involved, and these are just two of many potential ways you can visualize the different paths your customers take before achieving success. Each visualization has its strengths and weaknesses, but for the scatter and Sankey plots I’ve described here, I’d sum it up this way:

  • The scatter plot helps you find the successful and unsuccessful paths more quickly, but it’s much more difficult to see which specific channels are contributing best.
  • The Sankey diagram is less clear about which paths are most successful or unsuccessful. However, it’s easy to see which channels are contributing and which aren’t. It’s also easier to get a holistic picture of your customers’ journey and visualizing the volume of traffic moving through each channel.

Feel free to follow me on Twitter if you like this content and want to see more. Also, let me know if you have other ideas or approaches to visualizing the customer journey, I’d love to hear them! Best of luck!

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.