GA4 Data Redaction – Clean PII in a simple way

Navigating through GA4’s data collection process is pivotal in tapping into insightful user data while ensuring utmost privacy. This blog dips into the intricacies of GA4’s data collection, exploring methods to avoid unintentional collection of personal information and introducing simpler configurations for effective data redaction. Dive in to discover how to make your GA4 data collection both powerful and most importantly privacy-compliant.

GA4 Data Collection under the Hood

Let’s dissect how GA4’s data collection works. The subsequent diagram illustrates the data collection pipeline of GA4 for a typical client-side measurement setup. This pipeline goes through three phases:

1. Loading the tracking library (gtag.js) and the measurement logic, most of the times deployed via Google Tag Manager
2. Dispatching the events along with their parameters and metadata to Google’s servers.
3. Processing data on Google’s servers, where data from users’ browsers is gathered and aggregated.

One very common problem is the often collected accidentally (mostly via query parameters) Personally Identifiable Information (PII) about your users. Not only this is a violation of a wide array of regulations across the globe, but also it is against Google’s T&C for data collection.

The approach to mitigating this issue involves utilizing page_location and updating the GA4 Config tag and a Custom JavaScript variable, designed to manipulate the URL, as demonstrated in the provided JavaScript function, all within Google Tag Manager.

function() {
  // capture URL
  var url = window.location.toString();
  var filter = [
    {
      rx: /firstnamename=[^&]*/g,
      replacement: 'firstnamename=REDACTED_NAME'
    },
    {
      rx: /email=[^&]*/g,
      replacement: 'email=REDACTED_EMAIL'
    }
   //add more parameters to redact
  ];
  
  //redact URL
  filter.forEach(function(item) {
    url = url.replace(item.rx, item.replacement);
  });
  return url;
}

Though not overwhelmingly complex, one might argue that a platform, especially one like GA4 that prides itself on being a “Privacy First” platform, should offer a more streamlined, less workaround-dependent solution.

There’s a better way.

Fortunately, recent updates have introduced a more straightforward method. The new configuration allows users to specify query parameters that should be redacted, letting the native gtag.js library handle the URL manipulations, as illustrated in the GA4 Config and GA4 Test sections.

What this configuration allows you to do is exactly what you’d expect – list the query parameters you want redacted (i.e. as you would in the filter object above) and let the native gtag.js library do the manipulations for you.

GA4 Config

You can also test it:

GA4 Test

The alterations are instantaneously applied, allowing for swift verification of the network requests transmitting data to GA4.

network request

However, be aware that this is all applicable to client-side data collection and the presence of the gtag.js library to do the redaction. Additionally, it only applies to the following parameters page_location, page_referrer, page_path, link_url, video_url, form_destination. Implementations that involve offline data via the Measurement Protocol would still require you to make sure PII data does not hit Google’s servers.

Summary

Exploring GA4’s data collection unveils a crucial, yet complex, three-phase pipeline, often marred by unintended personal data collection. The emergence of new, straightforward configurations in GA4 not only alleviates the challenges of ensuring data privacy but also liberates analysts from time-consuming workarounds. Now, we can do more of what brings more value – extracting valuable insights from the data.

References:

GA4 Data Collection via GTM
GA4 Configuration fields reference – page_location
GA4 About Data Redaction

Share this Post

Leave a Reply

Your email address will not be published. Required fields are marked *