Customer Journey Analytics and R: How to Escape SQL Hell With cjar

In my last post, I illustrated how to use R with the Adobe Experience Platform (AEP) Query Service to query your raw data with analysis-supercharging tools like dbplyr without having to know any SQL. While I’m a big fanboy of R and dbplyr, if I’m honest, there are some things that SQL and dbplyr just can’t do well. Allow me to explain.

In the modern world of user-generated event data (credit to James Pheonix for his helpful article), it’s not enough to have SQL access to raw data. SQL is a nearly 50-year-old language originally designed to work with relational databases and was not designed to work with the hierarchical nature of semi-structured event data. In other words, event data only makes sense in the context of a person’s session on a website, a mobile app, or call center. While many modern analysis acceleration platforms have attempted to retrofit SQL onto modern event data, SQL remains an ill-suited, slow, and usually expensive tool to query event data to answer critical business questions like:

  • Are the product features we’re building improving our holistic customer experience?
  • Is our product adoption growing or shrinking?
  • Which marketing campaigns are driving the most significant conversion rates downstream?
  • How did users find the products they wanted?
  • Which of our most popular content should we promote?

Analyzing event data requires a query engine designed explicitly for hierarchical, semi-structured data, like Adobe’s Customer Journey Analytics (CJA). However, when coupled with R (using the handy new R library “cjar” built by Ben Woodard), you get the best of both worlds. Let’s dive in!

Bringing Event Data Into CJA

When using CJA, we first need to add the event datasets we want to analyze from AEP into a CJA Connection. CJA Connections are event-data oriented and require that your datasets have a timestamp and a person identifier. The required timestamp and person ID are one of several things that makes CJA queries so fast – it uses those fields to pre-sort and pre-shuffle the data across its many reporting nodes.

Another reason CJA Connections are so helpful is that they can combine multiple event datasets into a single union table with a union schema. Shared fields between datasets populate a single field in the union table (like the “event type” field in the video above). When a dataset has unique fields (like “call reason” from call center events), Connections create distinct fields for those. This union view is handy for analyzing how people jump from the web to an app or a call center call. If you’ve ever attempted to bring event datasets together using SQL, you’ll know how painful it can be to FULL OUTER JOIN or UNION ALL five or six times in a single SQL query.

Setting Up CJA Data Views

While Connections are the foundation for bringing event data together into one table, Data Views provide the “metrics layer” for CJA. Data Views are where you define the specific metrics and dimensions and all of the other associated query-time transformations (like data sessionization, timezone shifts, corporate calendars, global filters, etc.) you’ll use for reporting to your organization. Creating metrics and dimensions is as easy as dragging and dropping fields into your Data View:

One of the aspects of metrics and dimensions that are most useful with R is the Component ID. The component ID is how the CJA backend (and its reporting APIs) refer to that metric or dimension you’ve just created. For example – if I create a revenue metric in CJA and want to make that very easy to use in R, I can rename the component ID to simply revenue rather than the more complicated dot notation version of that schema field (_myanalyticsdata.retailBase.revenue):

CJA also provides a host of query-time data transformations you can apply if needed, such as:

  • Value deduplication
  • Include/Exclude value filtering
  • Substring (including regex)
  • Allocation and Attribution
  • Lowercasing strings
  • Value bucketing for numeric dimensions
  • Null value replacement and renaming

Each of these transformations can be very useful to “bake in” to your metric and dimension definition so that you don’t have to do this yourself in R or SQL later on.

Querying Data

Now that we have a Connection and Data View, we’re ready to start querying CJA with R, and the easiest way to do that is with an API wrapper library called “cjar,” written by Ben Woodard. Using cjar takes a minute to set up if you’ve never used the CJA APIs before, but it’s not too difficult to do. You can find a helpful guide here that walks you through the process. Once you’ve installed the cjar library and have it set up, all you have to do is authenticate:

library(cjar)
cja_auth(type="jwt")

You should see a message like Successfully authenticated with JWT: access token valid until <some-date> if you did things correctly.

The cjar library offers a variety of useful functions, but we’ll start with the basics – specifically the cja_freeform_table function which allows you to pull back data tables straight from CJA. To use the function, you’ll need a couple of things:

  1. The ID of the Data View you want to query and
  2. The metric/dimension IDs to query

To get the IDs you need, you can use the convenience functions cja_get_dataviews, cja_get_metrics, and cja_get_dimensions. Each of those functions will return a list of all available Data Views, Metrics, and Dimensions, respectively, with the IDs that you can use to query CJA (such as revenue defined above).

Now we’re ready to query some data! Supply the Data View ID, metrics, and dimensions you want to aggregate, and make your query. Note that CJA doesn’t let you pull back raw events themselves but instead requires that you supply both a dimension and a metric. It’s as if CJA forces a “group by” if I were using SQL or dbplyr. For example,

my_data = cja_freeform_table(
  dataviewId = "my_dataview_id",
  dimensions = "dimension1",
  metrics = c("metric1", "metric2")
)

returns a data table roughly equivalent to this SQL query:

SELECT "dimension1", SUM("metric1") AS "m1", SUM("metric2") AS "m2"
FROM "my_dataview_id"
GROUP BY "dimension1"

It’s tempting to wonder why we need CJA or cjar if we can accomplish the query we need with a few lines of SQL. The truth is that CJA abstracts away a lot of complexity that you’d have to deal with otherwise. Accounting for combining multiple event datasets, making field transformations on nested object arrays (like revenue or content impressions), and deduping repeated values (like duplicate purchase IDs or webpage refreshes) is no simple feat. The SQL above balloons to something like below (which I’m positive is incorrect given my terrible SQL skills):

SELECT 
  "dimension1", 
  SUM ( 
    UNNEST("nestedObjectArray.metric1")
  ) AS "m1", 
  SUM ( 
    UNNEST("nestedObjectArray.metric2")
  ) AS "m2"
FROM (
  SELECT "application",
    CASE 
      WHEN ("dimension1" = "some_value") THEN (
        CASE 
          WHEN ("dimension1" = LEAD("dimension1", 1, NULL) OVER()) THEN (NULL)
          WHEN NOT("dimension1" = LEAD("dimension1", 1, NULL) OVER()) THEN ("channel")
       WHEN NOT("dimension1" = "some_value") THEN (NULL)
    END AS *
  FROM ((((
    SELECT *
    FROM "my_web_data")
  UNION ALL (
    SELECT *
    FROM "my_mobile_app_data"))
  UNION ALL (
    SELECT *
    FROM "my_call_center_data"))
  UNION ALL (
    SELECT *
    FROM "my_POS_data"))
GROUP BY "dimension1"

Given the complexity, I can’t imagine attempting to filter the query to “new users within the last 30 days” or “web sessions that ended in a call center call.” I honestly don’t want to think about writing the SQL or dbplyr code needed to answer those seemingly simple use cases. In my experience, it can take even very sophisticated organizations weeks or months to answer such questions when using SQL in this way.

The cjar code above allows you to abstract away the complexity of these queries by keeping all of the union event tables and field level transformations out of your code – leaving you with a set of metrics and dimensions that match what users will see in CJA’s Analysis Workspace UI.

Dimension Breakdowns

While cjar makes querying event data accessible in R, it also has several notable features that make it easy to bring data into R for machine learning or visualization. The first (and my personal favorite) is the ability to do multi-dimensional breakdowns (equivalent to GROUP BY using multiple columns). In the Analysis Workspace UI, you can only do breakdowns on a single dimension value at a time, so having this capability in R is fantastic:

my_data = cja_freeform_table(
  dataviewId = "my_dataview_id",
  dimensions = c("dimension1", "dimension2"),
  metrics = c("metric1", "metric2"),
  top = c(10, 50)
)

cjar makes multiple API calls behind the scenes (which the CJA API currently requires) and compiles them all into a data table seamlessly for you. The query above yields a breakdown you can’t quickly get in Analysis Workspace or SQL:

dimension1dimension2metric1metric2
value 1value a510
value 1value b48

Filtering Data

Next on my list, cjar allows you to tap into the deep segmentation power that CJA provides. One of the most challenging aspects of using SQL on event data is filtering down to “users who…” or “sessions where…” since SQL doesn’t inherently understand what a “session” or “user” is. Furthermore, returning all of the events belonging to any user who performed a specific action requires complex SQL joins and nested subselects (like I illustrated in my previous post) that are a huge pain to construct correctly (if not impossible).

To accomplish the sophisticated filtering that CJA offers with cjar, use the filter_build and filter_rule functions. In this case, I want to filter my event data down to all the events generated by users who purchased and searched at any point, even where those two actions may or may not have happened on the same event:

library(rjson)

rule1 = filter_rule(
  dimension = "user_action_dimension",
  verb = "streq",
  object = "Search"
)

rule2 = filter_rule(
  dimension = "user_action_dimension",
  verb = "streq",
  object = "Purchase"
)

filt = filter_build(
  dataviewId = "my_dataview_id",
  name = "Users who searched and purchased",
  context = "visitors",
  conjunction = "and",
  rules = list(rule1, rule2),
  create_filter = TRUE
)

filterID = fromJSON(filt)$id

The above code creates a reusable filter for subsequent cja_freeform_table calls, and can even be used within the CJA Analysis Workspace UI. Filters in CJA use a “context” that describes the scope of your filter (e.g., “user” or “session” or “event”), a “conjunction” (e.g., “and” or “or”), and a list of filter rules which contain the specific metrics or dimensions you’d like to filter on with the filter logic you want to apply (e.g., “user_action_dimension” equals “Search”). The cjar package contains a data frame with all of the filter verbs available – for example, “Equals” or “Contains” or “Equals any of.”

Once you have the filterID above, you can add it directly into a cja_freeform_table function using the segmentId parameter:

my_data = cja_freeform_table(
  dataviewId = "my_dataview_id",
  dimensions = c("dimension1", "dimension2"),
  metrics = c("metric1", "metric2"),
  segmentId = filterID,
  top = c(10, 50)
)

This code provides the data aggregation I’m after but filtered down to all events belonging to users who had both a “Search” and a “Purchase.” If I attempt to create equivalent SQL, the complexity grows with every new rule in my filter. With cjar, you can add as many rules as you want to your filter_build function without worrying about subselects and joins and other SQL messiness.

Sequential Segmentation

cjar also provides methods for constructing sequential filters and even nested contextual containers which can become pretty complex, but I’ll provide a fairly straightforward example:

I would like to filter my events down to users who performed a sequence of actions. In this case, I would like to filter my events to those generated by any user who had a single session where the user first did a search, then later made a purchase. This type of filter is common and useful for search algorithm optimization and/or search UX optimization.

I’m not at all sure how that can be done in SQL, but in cjar this can be done fairly easily by defining two rules, one sequence, and one filter:

rule1 = filter_rule(
  dimension = "user_action_dimension",
  verb = "streq",
  object = "Search"
)

rule2 = filter_rule(
  dimension = "user_action_dimension",
  verb = "streq",
  object = "Purchase"
)

seq1 = filter_seq(
  context = "visits",
  rules = list(rule1, rule2),
  sequence = "in_order",
)

filt = filter_build(
  dataviewId = "my_dataview_id",
  name = "cjar sequential filter example",
  description = "All events for any user that searched then purchased within any single session",
  containers = list(seq1),
  context = "visitors",
  create = TRUE
)

filterID = fromJSON(filt)$id

As before, the filterID can then be used by the cja_freeform_table function to run a report based on event data generated only by users that meet that criteria.

If constructing a sequential or complex filter is too difficult to do in R, one other neat trick is to just use the CJA UI to build it which is a bit more user friendly. Here I’ve constructed a filter that gives me only events that occurred on or before a user’s first session where the user 1) performed a search, and 2) within five minutes made a purchase:

Once saved, the filter ID can be retrieved using the cja_get_filters function, or you can find it at the very end of the URL of the filter builder – it looks like “s12345ABCDEF@AdobeOrg_xyz123”.

Summary

Customer Journey Analytics combined with the cjar library offers unique possibilities for analyzing event data in R and is a better alternative to SQL-based approaches for most event analysis applications. Big thanks to Ben Woodard for making the CJA API accessible and easy to use in R.

In subsequent posts, I intend to show you some of the cool ways you can harness machine learning to enrich and accelerate your insights when you have both SQL and cjar at your disposal. Thanks for reading, and feel free to hit me up with questions or connect with me on Twitter or LinkedIn!

Trevor Paulsen

Trevor comes from the world of aerospace engineering and robotics. He studied estimation theory and data mining techniques before leading Adobe’s data science consulting team. Trevor is currently a group product manager for Adobe Analytics and loves doing big data projects and statistical analyses on the side whenever he can. He’s a father of three and a lover of bike rides & music. All views expressed are his own.