Keyword Consolidation

What happens when you go from 3,000 keywords to 80?

Dave Bates

August 14, 2023

Reading time: 10 minutes

Context

Changes to match types

Google has over the last few years changed how it matches user search queries with keywords in your account.

Exact match became Exactish. Etc..

Does this change mean that you now have multiple keywords in your account that could, in theory, trigger the same query?

Is that good? Is it not? Could your performance improve if you optimised this?

We tested this theory to see what the impact would be (if any). In this article, we will cover how we chose the keywords to remove, our testing framework, how we analysed the experiment, the results, and what we plan to do next.

The account & business

A high-volume and already-scaled account was looking to improve efficiency and ease of management.

The campaign structure was very simple, ad groups had a good logic for splits, and there were many thousands of keywords in each ad group.

Why would we want to consolidate keyword sets?

As Google continually updates & evolves their match type algorithms, simplifying & de-cluttering your keyword set can help ensure volume isn’t unnecessarily fragmented across multiple keywords that are all acting to trigger similar search terms.

Although there is nothing inherently wrong with having many keywords (this was best practice in the past), the theories behind reducing these keywords could result in the following benefits:

A lower keyword pool is easier for management and reporting.

Reducing the number of keywords ensures the volume being driven through them increases. This enables keyword-level decisions to be made more quickly and accurately.

Theoretically, if we are driving searches through the best-performing keywords we may anticipate seeing an uptick in efficiency without constraining total volume.

Fewer keywords and larger data volume might improve the performance of bidding strategies.

Our hypotheses

Keyword consolidation results in no tangible reduction in performance (from a volume, efficiency, and cost perspective).
Reduced keywords actually result in an improvement in performance.

Approach

Due to convergence in match type, we designed a process around basic NLP (Natural Language Processing) techniques to understand how and why certain keywords would match with certain queries. This process, coupled with some simple keyword cleaning best practices (i.e. pausing keywords that have never resulted in impressions), was then applied to one of the ad groups in the campaign we wanted to test our hypothesis on.

This process can be broken down into the following elements:

1. Tokenization & stop word removal

Tokenization refers to the process of breaking a phrase into small chunks e.g. [“This is a unique keyword”] would become [“This”, “is”, “a”, “unique”, “keyword”]. This process then easily enables us to iteratively remove terms coined as stop words e.g. the above would become [“unique”, “keyword”]. For this process, we use a list defined in the nltk.download(“stopwords”)package.

2. Lemmatization

Lemmatization is the process of taking the root of a word, as shown in the image. This stage of the process is really important as it enables us to account for things Google is already adjusting for in its match type algorithms.

3. Ordered

Simply put, this is the process of ordering the words left in a phrase alphabetically to ensure we don”t have identical keywords that only differ in order.

This approach recommended that we reduce keywords from 3,000+ down to ~80.

Analysis - did consolidation hurt performance?

There are several ways that test data can be run. In an ideal world, a controlled A/B test is the most statistically robust; however, for this client, this was not appropriate due to the campaign structure and the nature of the test.

As a result, alternative testing techniques & their limitations were summarised as follows:

Pre vs. Post – Comparing the figures before and after a change in the treatment group
1. Assumes there are no seasonal/market considerations i.e. no control reference.
Difference in Differences (DiD) – Quasi-experimental approach that compares the changes in outcomes over time.
1. Relies on a control group that is assumed to show a perfect parallel trend.
2. Generally done on a point estimate basis i.e. generates a single value difference based on pre vs. post.
Synthetic Control and/or Causal Impact* – Use the help of control groups to construct a counterfactual of the treated group giving us an idea of what the trend is if the treatment had not happened.
1. These methods are highly susceptible to change based on the input data provided. Similar to the DiD they also required a good control reference to ensure the output is an accurate representation of the data

*The difference between Synthetic Control and Casual Impact is that Synthetic Control uses only pre-treatment variables for matching while Causal Impact uses the full pre and post-treatment time series of predictor matching.

Did the remaining keywords pick up volume?

As we can see below, upon pausing a large set of keywords on October 24th, the volume of our pre-selected keyword rose significantly:

If we then look at the specific search term and how the keywords triggering for the term change, we can see the following:

Pre/Post, Difference in Differences & Synthetic Control

With all pre-post test approaches that leverage a control group, isolating the most representative ad group(s) to act as the control is key.

The chart below shows the correlation between the selected control vs the experiment group pretest launch for conversions**.

**As mentioned previously, with control based experiment analyses – the limiting factor in complete confidence in the result is finding a control that most appropriately maps to the experiment across all primary measures. For this reason, tests of this nature are non-idealised.

Comparing Pre vs Post, the simplistic approach

The most simplistic method of analysis is comparing the average number of daily clicks & conversions Pre & Post test. Here standard deviation is simply being used to illustrate variance in the data:

If we run a simple T-test on the pre vs post data we can see there is no significant difference in click volume – this is also true of conversion volume.

Although this method is a good simple approach to analysing performance – it doesn’t consider other contributors to changes in performance such as:

Seasonal differences
Changes in demand
Competition

Point view using Difference in Differences

A point view of Difference in Differences (DiD) takes into consideration what occurred in the control group to determine the drivers of the impact observed in the test group. Visually this looks as follows, where ꞵ3 represents the intervention effect of the test.

Applying this to our model we get an output that suggests we saw an increase of 8.6 conversions per day as the DiD methodology where our Control Ad Group was the control. If we take into consideration the standard deviation of the data in the post period, this lies within 1 Standard Deviation of the mean, suggesting we can’t conclusively say this change improved performance.

Additionally, if we run the same analysis with all relevant Ad Groups as the control the increase is suggested to be much less significant: ~3.5 additional conversions per day. This further shows the fragility of this methodology.

Synthetic Control approach time-series view

The final view to consider is the Synthetic Control. It involves the construction of a weighted combination of groups used as controls, to which the treatment group is compared.

This comparison is used to estimate what would have happened to the treatment group if it had not received the treatment. As a technique, it is very similar to Causal Impact in estimating the true impact of a treatment. The chart below shows this method applied to the conversion volume over time.

Analysing changes in rates Experimental vs Control

Although looking at overall volume metrics is essential, isolating just these figures does not account for changes to the landscape. What if competition changed or we increased our spend during the test period? As such reviewing rate metrics across the pre-post for the experiment & control groups is key.

The charts below show a summary of the changes in different rates over time throughout the test period between all other ad groups (red) and the experimental ad group (blue).

As you can see with CPA, CvR & CTR – we saw no major discrepancy between the Experiment & control group both pre & post the test launch.

Reviewing the specific keywords explored

Isolating the main keywords pre & post we can see some interesting trends in relation to the data:

In the above example, upon pausing Keyword 3 we anticipated the majority of volume would have been picked up by Keyword 2, however, what we observed was Keyword 1 picking up the bulk of the volume.

This raises some interesting questions about the reduction in control of consolidating keywords in this way & as such led to us launching a mini test to understand if we can bias which keyword Google pushes volume through.

How does keyword matching work?

We launched a mini test to try and understand how Google determines the keyword that enters the bulk of the available auctions.

To do this we paused the [Keyword 1] keyword that was previously picking up most traffic with the hope that [Keyword 2] would take that traffic (the difference between the two example keywords was ordering and spacing, the actual words were the same).

We then wanted to see upon re-enabling the [Keyword 1] keyword whether the traffic immediately reverted back to that keyword.

Conclusion

The results

Overall, based on the analysis conducted and the data observed we can conclude the following:

Keyword consolidation had no tangible negative impact on performance (total click volume, conversion volume, CVR or CPA)
Based on the DiD results and Synthetic Control output relative to the control ad group we saw an improvement in conversion and click volume relative to what we would have expected – although this is inconclusive at this stage due to the variance in the data and imperfect test design.
Upon pausing keywords, we do see a noticeable drop in volume which does recover in a few days.
Query matching was unaffected, some control around query matching is lower with some keywords triggering unexpected search terms. Adding negative keywords is still important.

Final Thoughts

Due to changes in how Google matches keywords with user search queries, advertisers should challenge their current account structures to see if improvements can be made by reorganising how they give data to Google.

What we aimed to do with this experiment is challenge the previous best practice of adding a significant amount of keywords within your ad groups (to match with more queries) and see if this was relevant for modern Paid Search account structures.

If you want to test this for yourself, we recommend moving forward with caution. As you can see in our testing structure, we took a lot of time to ensure we didn’t negatively impact the conversion volume whilst getting enough data to come to conclusions on how to move forward.

Next Steps

Run a controlled A/B test to isolate if there is a statistically significant increase in performance to determine if this adjustment can justify concerns in relation to some reduction in control.
Run a follow-up experiment centred around cross-match type keywords & the impact of matching.
Depending on the results, look to roll out this methodology across all ad groups.

We’d love to hear what you think about our case study and what you have tried on your accounts. Contact us and share your experiences.