Visitor Level Aggregations Using R and Adobe Analytics Data Feeds

Visitor level aggregations (or as I like to call them, “visitor rollups”) are one of the most useful and meaningful things you can do with an Adobe Analytics data feed.  If you ever want to do cluster analysis to find interesting marketing segments, propensity modeling to find likely converters, or product affinity analysis for cross sales, you need data at the visitor level.

A visitor rollup is basically a snapshot “profile” of every visitor that has been to your site or app in a given timeframe.  At a high level, a visitor rollup is a table that looks something like this:

Visitor IDTotal VisitsTotal RevenueUnique Days VisitedVisits to Home Page
86056882571$010
89594310361$110
28522840964$10032
28522840962$012

So, how do I construct a table like this using a bunch of data feed files using R?  Enter the sparklyr library.  If you haven’t set up sparklyr before, read my guide on how to do that here.

The first thing you’ll need to do is load up the sparklyr library, connect to Spark, and read in some data (notice the “*” at the end of the path that allows you to load a bunch of files all at once):

library(dplyr)
library(sparklyr)

# Recommended to get the latest features of sparklyr:
# devtools::install_github("rstudio/sparklyr")

# Unpack gz files if necessary
# untar("data/01-report.suite_2014-06-01.tsv.gz")

# Create Spark connection and read data
sc = spark_connect(master="local", version="2.1.0")
data_feed_local = spark_read_csv(sc = sc,
                                 name = "data_feed",
                                 path = "data/01-report.suite_2014-*.tsv",
                                 header = FALSE,
                                 delimiter = "\t")

With the data loaded, you can now do a little house cleaning to make your life easier.  The first step is to make sure you can map the columns in the data feed to a more friendly name.  Unfortunately data feed files don’t come with headers in the same file, but you can find what each of the columns map to in the “column_headers.tsv” file that comes with the data feed.  Each of the entries in the file map to each of the columns in your data feed file in respective order.

You can manually name the columns (and include just the ones you actually want for your analysis) using the select and mutate functions rather than keeping the default “VX” naming.  In this case, you can see that I’m also concatenating the post_visid_high and post_visid_low into a new visitor_id column with the mutate function.

# Assign friendly names and merge (mutate) visid hi and low
# Variable numbers were found from column_headers file

data_feed_tbl = data_feed_local %>%
  mutate(merged_visitor_ids = paste0(V13,V14)) %>%
  select(
     visitor_id = merged_visitor_ids,
     visit_num = V17,
     visit_page_num = V18,
     visitor_browser = V1,
     post_event_list = V3,
     visitor_country = V4,
     visitor_region = V5,
     hit_time_gmt = V6,
     mobile_id = V7,
     post_page_event = V8,
     post_pagename = V9,
     post_product_list = V10,
     report_suite = V12,
     post_evar1 = V19
  )

You’ll notice when you run the previous code, it actually runs immediately.  That’s because sparklyr hasn’t asked Spark to do anything yet – it’s simply keeping a good record of the transformations it will need to do later and only runs those transformations when it’s absolutely necessary.

To double check I’ve setup everything correctly, I like to do the following command:

# View names of columns
> tbl_vars(data_feed_tbl)
 [1] "visitor_id"        "visit_num"         "visit_page_num"   
 [4] "visitor_browser"   "post_event_list"   "visitor_country"  
 [7] "visitor_region"    "hit_time_gmt"      "mobile_id"        
[10] "post_page_event"   "post_pagename"     "post_product_list"
[13] "report_suite"      "post_evar1"

Cool.  Now that that’s all setup, we’re ready to do the fun stuff: the visitor level aggregations.  Let’s start with some easy ones: the total number of hits a visitor had, the total number of visits a visitor had in the date range, and the number of lifetime visits a visitor has had for all time.

# Simple roll up with hit count, lifetime visits, and total visits
basic_counts = data_feed_tbl %>%
 group_by(visitor_id) %>%
 summarize(
   hit_count = n(),
   lifetime_visits = max(visit_num),
   visits = n_distinct(visit_num)
 )

Couple of things to point out here.  Notice first that I’ve created a new sparklyr data frame called basic_counts that contains the aggregations I’m interested in.  I’ll need this later when I merge all of the aggregations I want.  Next, the first thing I do is group_by the visitor_id field which will group all of the following statements by the visitor_id field I created earlier.  Finally, I feed that into the summarize function which does three different aggregations for me:

  1. The n() function is simply a row count function.  It’s going to count the number of rows for each visitor and return that as a column I want in my visitor rollup file.
  2. The max() function allows me to pick the largest value in the visit_num column for a visitor – this is our visitors’ lifetime visits.
  3. The n_distinct() function counts the number of distinct values in a field – in this case the distinct values in the visit_num column which is an effective count of the number of visits the visitor had in my date range.

Again, when I run the above code, it runs immediately because Spark is not necessary yet – the basic_counts sparklyr data frame is just a mapping of what Spark will eventually need to do when the time comes.

Next, I’d like to get a visitor rollup column for the number of visits a visitor had to the “homepage” of my site.  To do that, I’m going to create another sparklyr data frame:

# Counting visits with where post_pagename contains a certain value
visit_counts = data_feed_tbl %>%
 group_by(visitor_id) %>%
 filter(post_pagename %regexp% "homepage") %>%
 summarize(
   visits_to_homepage = n_distinct(visit_num)
 )

In this example, I’ve applied a filter using a regular expression against the post_pagename column.  Notice that after I filter the hits down to just the “homepage” for a visitor, I can count the distinct number of visits where that post_pagename value occurred using the n_distinct() function like I did before.

I can similarly count the number of page views a visitor had to that page by adding the post_page_event to the filter.  A post_page_event value of “0” means that the hit was considered a page view by Adobe Analytics.

# Counting page views to homepage
page_view_counts = data_feed_tbl %>%
 group_by(visitor_id) %>%
 filter(post_pagename %regexp% "homepage" & post_page_event == 0) %>%
 summarize(
   page_views_to_homepage = n()
 )

Again, just using the row count function after filtering gives me a effective way to count homepage views for each visitor.

In the next (slightly more complex) example, I’d like to count the distinct number of days a visitor came to my site.  This is helpful to determine how engaged a visitor is by looking at the number of days his or her visits were spread across.  To do this one, I’m going to need the help from a Spark SQL function, from_unixtime().

#Counting unique dates visited
date_counts = data_feed_tbl %>%
 group_by(visitor_id) %>%
 summarize(
   days_visited = n_distinct(from_unixtime(hit_time_gmt, "YYYY-MM-dd")),
   months_visited = n_distinct(from_unixtime(hit_time_gmt, "YYYY-MM"))
 )

Couple of things to notice here.  First, I’m feeding the hit_time_gmt column (which is the timestamp of a hit) into the from_unixtime function which formats it into a nice human readable format.  With that, I’m using the n_distinct() function to count the distinct values of my formatted days and months.  I do this by simply using a slightly different format to the from_unixtime function.  This gives me a count of the distinct number of days and months a visitor showed up in my data.

My last example is the most complex one I’ll illustrate.  We’d like to count the number of times a visitor had an Adobe Analytics success event.  This sounds easy, but is actually super tricky because all success events appear in a comma delimited list in the post_event_list column.  Not only that, events can be incremented by more than 1 on a single hit using the “=XX” syntax (for example, a single hit can increment event4 six times by showing up as “203=6”.  To get around these issues, I’ll use a clever regular expression mixed with some Spark SQL functionality:

# Counting events from the post_event_list
# Note that "203" maps to custom event4
#   as per the lookup file that comes with
#   the data feed files.
event_to_count = "203"
event_counts = data_feed_tbl %>%
  group_by(visitor_id) %>%
  filter(post_event_list %regexp% paste0(event_to_count)) %>%
  mutate(true_counts = ifelse(
    post_event_list %regexp% paste0(event_to_count,"="), 
    as.numeric(regexp_extract(post_event_list, 
      paste0(".*",event_to_count,"=([0-9]+).*"))),
    1)
  ) %>%
  summarize(
   event4s_triggered = sum(true_counts)
)

This one was a little crazy, but essentially I’m filtering down to just the rows containing the event of interest, 203.  Then I’m using a regular expression to extract any values (where applicable) following a “203=” to ensure I get the proper count for that hit and store it into a mutated column “true_counts”.  Then I feed the “true_counts” array into the sum() function as my final summarize step.  Once done, I have an accurate count of the number of times each visitor saw event4.  Luckily the code above can be used for any event by just changing the “203” to the event you care about.

Now that I’ve finished all of my aggregations, I need to merge them all into a single data frame.  To do that I’m going to use the following:

# Merge all of the visitor rollup counts into a single data frame
visitor_rollup = list(basic_counts,
    event_counts,visit_counts,
    page_view_counts,date_counts) %>%
 Reduce(function(...) merge(..., all=TRUE, by="visitor_id"), .)

Only now does sparklyr actually run a Spark SQL command.  Using the merge function, I join each of the previous data frames I created into a new data frame called “visitor_rollup”.  With the magic of Spark, this query takes less than 3 minutes on my Mac to rollup nearly 1 million visitors – pretty impressive!

The output of the merge adds a lot of NAs to the data for visitors that had no data in some of the columns.  To convert those NAs to zeros, just use the following:

# Convert NAs to 0
visitor_rollup = visitor_rollup %>%
 mutate_all(funs(ifelse(is.na(.),0,.)))

Finally, to get the data out of Spark and into a normal R data frame for my visitor analysis, all I have to do is use the collect function:

# Put the data frame into an R object for analysis
local_visitor_rollup = collect(visitor_rollup)

# Run a summary and add up the columns to check accuracy
summary(local_visitor_rollup)
colSums(local_visitor_rollup[,-1])

When you add up the columns of your visitor rollup, they should be pretty dang close (if not exactly equal to) the totals you find in your Adobe Analytics reporting, although I now have the data in a much more machine learning friendly format.  We’ll be going over some of the amazing things you can do with a visitor rollup in the future, so be sure to check back!

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.

4 thoughts to “Visitor Level Aggregations Using R and Adobe Analytics Data Feeds”

  1. Hi Trevor,

    thanks for this awesome blog!!! I tried to reproduce some of your steps and stumbled upon the task to extract events such as purchase with the event id 1. In the event list this could appear as “1,” , “,1,” or “,1” (beginning, middle and end of event list). Can you give an example of how to extract it in the filter statement?

    I am not familiar with the sparklyr syntax but the regex should be something like this: “^1,|,1,|,1$” to avoid matching events such as 11, 101, and so on.

    Best wishes,
    Matthias

  2. Great question! Sparklyr supports regex both as a filter criteria (using %regexp%) and as an extraction method (try the “regexp_extract” function), so you should be able to get the right expression. Notice how I used both when counting event4, or “203” in my example.

    You can learn all about getting started with dplyr syntax (which is what sparklyr uses) here:
    http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

  3. Thanks for you quick help! 🙂 I think, I just have to play a little bit more around with the functions and the documentation of Sparklyr and it makes digging into the data much easier.

    Maybe also of interest the pure SQL statement to count the number of orders:

    library(DBI)
    order_counts <- dbGetQuery(sc,"SELECT count(*) AS number_of_orders FROM data_feed WHERE exclude_hit = 0 AND hit_source NOT IN (5,8,9) AND duplicate_purchase != 1 AND (post_event_list = '1' OR post_event_list LIKE '1,%' OR post_event_list LIKE '%,1,%' OR post_event_list LIKE '%,1')")

Comments are closed.