; ;
Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Analytics

Demystifying GA4 Data: How to Flatten Nested Structures in BigQuery

Ciara Adkins
Senior Analytics Architect
June 23, 2023

Google Analytics 4 (GA4) is an industry-shaking update to Universal Analytics. In addition to GA4’s extensive list of new features, exporting GA4 data to BigQuery is free. However, analysts tasked with doing anything with the GA4 BigQuery export are in for a heartbreaking surprise: all the event data is nested (*big sad*).

But have no fear! I have gone through and done most of the hard work for you. So now, all you have to do is download a few files, follow along, and stick around for the ride.

You can find the full query at the end of this article. I recommend opening the file in another window or screen next to your current one and referencing the file while reading the article.

Caveat: This article assumes you have intermediate familiarity with SQL in BigQuery. I will not be teaching the basics of BigQuery. Now that we have selected for all users WHERE level >= “intermediate,” let’s get into it!


Choosing Variables

Before you start writing any SQL, you should have a plan for the variables you want to include in your analysis. Since jumping directly into BigQuery without a plan will most likely leave you even more flustered and confused than when you started, I suggest creating a Variable Mapping Reference. The columns should include items such as “Desired Data,” which are the data points you need, and “Source Column Name,” which maps the actual columns in the raw GA4 BigQuery export to your desired data points.

Here is what I came up with:

Here’s a link to my template if you want to reuse it.

Alright! Now that you have your Variable Mapping Reference completed, you are one step closer to that beautiful flattened data you’ve been dreaming of.

Let’s write some SQL.


Flattening the Nested Data Structure

Understanding Table Structure

Let’s start by understanding the raw GA4 data export structure in GCP. Below is a simplified illustration of the structure of the export if you were to run a query like the following:

SELECT *
FROM `your GA4 data export table`
Figure 1: The raw GA4 export data structure

As you can see above, the event_params are a nested array. As mentioned before, this nested structure isn’t ideal if you want to do any analysis.


Structuring the Query

First, let’s write some pseudo-code to demonstrate the structure of our query. This query flattens the nested data structure of a GA4 data export using a combination of the UNNEST(), COALESCE(), and MAX() functions and a Common Table Expression (CTE) with two parts:

WITH alpha AS (
	Get data from raw GA4 data export table
	SELECT variables from your variable mapping reference
	UNNEST event_params array
	Compress event values into a single column (COALESCE)
)
, beta AS (
	Get data from alpha
	Pivot event_params variables into columns (MAX)
	Flatten remaining rows (GROUP BY)
)
Get data from beta
Create unique_session_id and any other additional desired columns

Now that we’ve planned our structure, let’s jump into some SQL.

In the following section, I will specifically walk through the trickier parts of the query.


1. Unpack Event Parameters: Using UNNEST()

The first part of the query, alpha, uses UNNEST() to unnest the event_params array column of your GA4 data export table.

UNNEST(event_params) AS ep

This action creates a new row for each element in the event_params array, with columns for your chosen dimensions like event_date, user_pseudo_id, and event_name as shown in the illustration below:

Figure 2: The data structure after unnesting event_params

2. Compress Event Values: Using COALESCE()

Next, we want to consolidate our event_params values so that every event parameter key only has one value. You can use the COALESCE() function to compress the four event value columns into a single event value column. For any given event_params key value, there will only be one non-null event value in the string_value, int_value, double_value, or float_value columns. The COALESCE() function allows us to extract the non-null values from the value columns, and we concurrently CAST all values to strings such that all values in our resulting column event_value will be the same data type.

COALESCE(
    value.string_value,
      CAST(value.int_value AS string),
      CAST(value.double_value AS string),
      CAST(value.float_value AS string)
    ) AS event_value
Figure 3: The data structure after consolidating our event parameter values using COALESCE()

3. Pivot and Flatten: Using MAX() and GROUP BY

The second part of our query, beta, flattens the data by grouping the alpha table by the common columns and pivoting the event key and event_value columns into new columns using the MAX() function. This action creates a single row for each unique combination of your columns (in this instance, event_date, user_pseudo_id, event_name, ga_session_id, campaign_source, campaign_id, and campaign_content).

MAX(IF(event_key = "ga_session_id",event_value,NULL)) AS ga_session_id,
MAX(IF(event_key = "source",event_value,NULL)) AS campaign_source,
MAX(IF(event_key = "campaign",event_value,NULL)) AS campaign_id,
MAX(IF(event_key = "content",event_value,NULL)) AS campaign_content
Figure 4: The data structure after pivoting and flattening the event parameters

4. Add Session ID: Using CONCAT()

Finally, the main query selects the desired columns (event_date, user_pseudo_id, event_name, ga_session_id, source, campaign, content, session_id, event_number) from the flattened beta table. The CONCAT() function concatenates ga_session_id and user_pseudo_id to create a unique_session_id, and ROW_NUMBER() is used to generate an event number within each ga_session_id partitioned and ordered by event_timestamp.

CONCAT(ga_session_id,".",user_pseudo_id) AS unique_session_id,
ROW_NUMBER() OVER(PARTITION BY ga_session_id ORDER BY ga_session_id, event_timestamp ASC) AS event_number
Figure 5: The data structure after adding our additional columns

Quality Assurance

After your query produces a flattened output, ensure it pulls the data as intended. For this result, I suggest creating high-level reports in the GA4 UI and comparing them to the summarized data in your query. Here is the quality assurance format that I came up with:

Here's a link to my template if you want to reuse it.

Input your summarized data into a Google Sheet or Excel Spreadsheet and compare. If the variance looks too large, double-check your query to ensure no issues in your code. However, some data points will likely have a higher variance than others (like “Sessions,” which is nicely explained by Tanelytics in the article here).


Final Thoughts

If you’ve made it this far, you (hopefully) now have some flattened data! Now you’re well on your way to the fun part: analysis.

The query I’ve walked through should enable you to recreate reports from the GA4 UI and empower you to create new, meaningful reports that would be far more complex for the UI alone.

So get creative! Push the boundaries! Happy analyzing!

References and Additional Resources

My Template Variable Mapping Reference, QA Doc, and Query
Google Support | [GA4] BigQuery Export schema
Google Support | [GA4] Default channel group
Tanelytics | Defining and calculating sessions
GPT4SQL.com

Full Sample Query:

WITH alpha AS (
  SELECT
    event_date,
    event_timestamp,
    event_name,
    user_pseudo_id,
    geo.region AS geo_region,
    traffic_source.medium AS campaign_medium,
    key AS event_key,
    COALESCE(
      value.string_value,
      CAST(value.int_value AS string),
      CAST(value.double_value AS string),
      CAST(value.float_value AS string)
    ) AS event_value
  FROM `your GA4 data export table`
  , UNNEST(event_params) AS ep
  WHERE 1=1
    AND event_date = "your date"
-- LIMIT 100 -- I recommend setting a limit while testing
)
, beta AS (
  SELECT 
    event_date,
    event_timestamp,
    TIMESTAMP_MICROS(event_timestamp) event_ts_pretty,
    event_name,
    user_pseudo_id,
    geo_region,
    campaign_medium,
    MAX(IF(event_key = "ga_session_id",event_value,NULL)) AS ga_session_id,
    MAX(IF(event_key = "campaign",event_value,NULL)) AS campaign_id
  FROM alpha
  GROUP BY 1,2,3,4,5,6,7
)

SELECT
    CONCAT(ga_session_id,".",user_pseudo_id) AS unique_session_id,
    event_date,
    event_name,
    ROW_NUMBER() OVER(PARTITION BY ga_session_id ORDER BY ga_session_id, event_timestamp ASC) AS event_number,
    event_timestamp,
   event_ts_pretty,
    geo_region,
    campaign_id,
    campaign_medium
from beta
Ciara Adkins
Senior Analytics Architect

Recent Articles

Engineering
.
Building Scalable White-Label Android Apps
Sean Kenkeremath
Engineering
.
Text Searching with MongoDB
Matthew O'Connell