Skip to content

How you structure data before an AI agent touches it matters more than which model you point at it. We ran a controlled test to measure by how much.

We connected an AI agent to the same paid search performance data in four different ways, then asked the same 18 questions of each, from simple lookups to anomaly detection. The model was held constant throughout. Reliability ranged from 36% to 89%, and the difference came down entirely to how the data was structured before the agent reached it.

 

Direct platform API access averaged 36% reliability on anything beyond basic spend and traffic metrics.
A modelled data warehouse layer averaged 75%, but behaved non-deterministically on the harder questions.
Adding a semantic configuration layer on top of the modelled layer reached 89% average reliability, including anomaly detection.
The gap between 36% and 89% is explained by information structure, not model capability.
The same problem applies to paid social, financial reporting, product analytics, and anywhere an agent has to decide which metric definition to use.

The Reliability Problem

Connecting AI agents to data sources has become one of the most discussed capabilities in performance marketing over the past year. The pattern is consistent across the teams we speak to, which is: connect the agent to the tools it needs, point it at the data, and let it answer questions directly. For paid search and paid social teams that usually means the Google Ads or Meta APIs, a BigQuery warehouse, or both.

The opportunity is real. An agent that can answer reporting questions on demand removes a large amount of manual analysis. What very few teams have measured is whether the answers are correct at any meaningful depth. Asking an agent what you spent last week is a different problem from asking what your CPA was by campaign type, whether you were pacing against your monthly target, or which segment drove a week-on-week drop. The questions that actually drive decisions are harder than simple lookups, and there is a wide gap between an agent that can attempt an answer and one that returns the right number.

Most teams have treated reliability as a model question. The assumption is that a more capable model handles complexity better, reasons more accurately, and makes fewer mistakes, so attention has gone to model selection rather than data preparation.

We think the more important variable sits earlier in the chain. Context engineering is the practice of curating what an agent can access: which tables it can query, which metric definitions it has been given, which business rules it knows about, and what fallback strategies exist for known gaps in the source data. Our working hypothesis was that well-structured data produces more reliable outputs than a more capable model working from poorly prepared data. To test it, we ran a structured study against a live Google Ads account using the same data access patterns we deploy for clients.

Two points before the results.

Scope

This problem is not specific to paid search.

Meta and other platforms present the same complexity. Any reporting context where an agent queries business data faces the same structural question: which definition of the key metric should the agent use when the data does not say? In paid search that is a conversion action and its attribution path. In financial reporting it is which revenue figure. In product analytics it is how an active user is counted. The underlying data engineering problem is identical.

 

Methodology

Two runs against one account is not enough for statistical confidence.

We are not claiming otherwise. The directional results are consistent with what we see across client deployments and with similar published studies. More importantly, the failure modes are structural rather than random: they reproduce reliably whenever the same data conditions are present.

What We Tested: Four Levels of Context

We tested four access patterns that represent a real progression from raw to structured. These are not arbitrary configurations. They map to the path most teams actually follow when building AI on top of performance data, from the quickest setup to the most invested.

Source What it is
1. Google Ads MCP Direct API access. No added context on metric definitions or account structure. The default starting point when teams first connect AI to their ad platforms.
2. BigQuery Raw Staging tables. Data in its pre-modelled state, with no joins, no computed fields, and no metric definitions.
3. BigQuery Curated Our dbt-modelled layer. Pre-joined tables with campaign type computed and ready for querying.
4. Client Semantics The curated layer plus a YAML configuration file encoding metric formulas, segment definitions, and pacing fallback strategies.

How We Tested It

We asked 18 questions across four tiers of increasing complexity.

Tier 1

Simple lookups

Spend, impressions, clicks, conversions.

Tier 2

Calculated metrics

CPA, CVR, CPC, conversion rate by segment.

Tier 3

Week-on-week comparisons

 

Tier 4

Anomalies and business logic

Pacing, segment breakdowns, edge cases.

Each source was tested in a fresh agent session, with identical question wording across all four. Every source ran twice to separate systematic failures from random variance. An answer was counted as correct if it fell within 1% of a reference value taken from a Tableau report currently treated as the source of truth. Every failure was tagged with a root cause.

Results

The results form two distinct clusters rather than a gradient. The Google Ads MCP sat in the mid-thirties on both runs. The modelled sources occupied a different band entirely. Within that upper band, the semantic layer added a further step, most visible at Tier 4.

Source Run 1 Run 2 Average
Google Ads MCP 33% 39% 36%
BigQuery Raw* 39% N/A N/A
BigQuery Curated 61% 89% 75%
Client Semantics 83% 94% 89%

* BigQuery Raw excluded after Run 1.

 
We dropped BigQuery Raw after the first run. It scored one question higher than the Google Ads API across the 18 questions, required more tool calls than any other source, ran up significant BigQuery job cost from processing unnecessary data, and in Tier 4 returned conversion rates above 100%. Without semantic guidance to filter to the right conversion action, the agent took whatever it could find and produced numbers that cannot exist. A confident wrong answer is harder to catch than a declared gap, because nothing signals that anything went wrong. The cost-to-signal ratio did not justify a second run.
 
A note on the BigQuery Curated variance. The 28-point jump from Run 1 to Run 2 is partly an artefact. One question had ambiguous wording that we corrected between runs, and one returned different granularity from run to run because of non-deterministic query behaviour. The true reliability of BigQuery Curated sits somewhere inside that range, not consistently at either end. As we explain below, that instability is itself the finding.

Finding 1: The attribution gap is a structural ceiling

The Google Ads API returns conversions based on whatever is marked as primary in the account. The problem is that the conversion action a team actually reports on is often a custom action, a secondary conversion, or one that aggregates across attribution paths that only exist in the modelled data layer. In our account, the API returned roughly 521 conversions. The figure in our curated layer, combining online-attributed and offline-attributed paths, was 665.

Every metric with conversions in the formula is wrong as a result. CPA was inflated by 28%. That is a magnitude error, and most practitioners would catch it.

Returning the wrong number is one thing. Returning the wrong direction of a week-on-week shift is worse, because it feeds directly into recommended actions. Spend gets reallocated away from what is working. Campaigns get flagged for review that do not need it. A real performance drop goes unaddressed while the team responds to a trend that does not exist.

API reported

+1.6%

Actual result

−8.4%

The instability compounds the problem. The agent chose a different query strategy on each run. Run 1 pulled primary conversions only. Run 2 attempted to sum all conversions using data-driven attribution fractional splits. Neither matched the reference. With no anchor defining which conversion action to use or how it is attributed, the agent guesses differently each time.

It is not all downside. The Google Ads MCP approach is reliable on attribution-independent metrics: spend, impressions, clicks, and CPC. For any account where the conversion action used in reporting differs from what the platform surfaces by default, that is its reliable ceiling.

Finding 2: Three decisions encoded in a configuration file

Client Semantics outperformed BigQuery Curated because three specific data engineering decisions were written into the configuration file that the curated layer alone could not make consistently. A simplified version looks like this.

metrics:
  conversion_rate:
    formula: "conversions / clicks"
  cpa:
    formula: "cost / conversions"
segments:
  campaign_type:
    values: [Brand, Generic, Competitor Type A, Competitor Type B, PMax]
pacing:
  strategy: bq_derived
  fallback: derive from MAX(daily_budget) per campaign_type in performance table

Pacing against target. The curated datasets returned no answer on both runs because the account was absent from where the agent assumed it would find the information. The bq_derived fallback in the config bypassed this entirely, deriving the pacing figure from the performance table instead. Same underlying data, completely different outcome.

Highest-CPA campaign sub-type. The curated approach collapsed two sub-types of a campaign segment into a single aggregated row in Run 1, masking the correct answer. In Run 2, by chance, the agent ran a schema inspection step first and surfaced the distinct values. Client Semantics returned the correct answer on both runs because the segment definitions in the config prevent that collapse.

Lowest conversion rate by campaign type. Here the curated version returned a different wrong answer each run. Client Semantics returned the correct answer both times because an automated campaign type is named as a distinct segment and the formula is anchored explicitly.

None of these are reasoning failures. The curated tables contained the data needed to answer all three correctly. The errors came from how the agent interpreted that data, not from the data itself.

Finding 3: Non-determinism is a production problem

BigQuery Curated’s two-run average of 75% looks acceptable in isolation. The problem is what drives the variance.

On Tier 4 questions, whether the agent gets the right answer depends on whether it runs a schema inspection step at the start of the session. With that step, it surfaces the distinct campaign sub-type values and queries correctly. Without it, it collapses them into aggregated rows and returns the wrong answer. Same data, same model, different result, with no way to predict which one you get unless you explicitly tell the agent to inspect the schema in the prompt.

One of the most notable findings of the study was that the data was right, the modelling was right, but the output was still unreliable.

For a monitoring or reporting use case, non-deterministic accuracy is worse than a consistently lower score. A consistent score is predictable. You know what the agent can and cannot answer, and you build around it. Non-determinism means the same question might be answered correctly on Tuesday and incorrectly on Thursday, with no signal that anything has changed.

Finding 4: A consistent answer is not the same as a matching one

Client Semantics returned the same CVR figure across both full runs. The dashboard showed a different number. The config defined CVR using one formula; the reference dashboard used a different but equally valid definition of the same metric, applied to a different stage of the funnel. The agent followed the config, correctly and consistently.

The semantic layer did not produce a wrong answer. It produced a consistent answer to a question where two valid metric definitions coexisted and had never been reconciled. Without the config, different runs returned different CVR figures depending on which formula the agent happened to choose. With the config, the divergence was stable and traceable to a specific piece of business logic defined in the YAML.

This is one of the more useful things a semantic layer can do: it surfaces metric definition disagreements that already exist rather than hiding them in variance. The fix is straightforward, because two calculations measuring two different things should not share the same metric name.

The infrastructure cost of getting it wrong

When evaluating AI capabilities it is easy to overlook the operational realities of running these models at scale. A poorly connected agent does not just produce bad reports and questionable recommendations. It also carries significant infrastructure overhead.

Source Tool calls Retries End-to-end time
Google Ads MCP 24 2 ~12 min
BigQuery Curated 14 1 ~3 min
Client Semantics 12 ↓50% 0 ↓100% ~3 min ↓75%

 

Token inflation. The Google Ads API approach required 24 tool calls and multiple retries. In an agentic loop, every retry re-sends large payloads of history back into the model’s context window. More tool calls scale up inference costs directly and create more opportunities for compounding logic errors.

Warehouse compute overhead. BigQuery Raw ran up significant job costs from processing entirely unnecessary rows and columns. Because the agent had no semantic guide telling it where to look, it executed broad data sweeps just to answer basic questions.

The latency penalty. A 12-minute wait for a single reporting lookup is functionally useless in production. Injecting a 400-line YAML config up front lets the agent map the exact columns it needs on the first call, bypassing the loop that slows down raw database setups and dropping latency to a production-grade 3 minutes.

What this means

01

Deploy the semantic layer, and resolve the metric definitions first.

89% average reliability across 18 questions, including anomaly detection, is production-grade for reporting. The remaining gap is largely a naming decision in the config. Before deploying, audit whether the metric definitions in the config match what you expect. The CVR finding is a useful reminder that a config can be internally consistent while still diverging from the reference view.

02

Do not rely on platform MCPs for conversions.

Spend, impressions, clicks, and CPC are safe over raw connections. Anything with a conversion action in the formula depends on whether the API and your attribution model agree on what counts. For most real accounts there will be a gap, and in some cases that gap produces directionally wrong week-on-week reads rather than just wrong absolute numbers.

03

Insist on determinism.

The relatively high 75% average of the curated approach masks a non-determinism problem. On anomaly detection and business logic questions it cannot be trusted to answer the same question the same way twice. That makes it unsuitable as a production reporting source without a semantic layer on top.

04

Budget for maintenance.

The semantic layer has a running cost. The config becomes the source of truth, so it has to stay in sync with how the account is actually structured. New campaign types, renamed conversion actions, and changes to naming conventions all need to be reflected in the config to maintain reliability. The benefit is that the config makes these decisions explicit and auditable rather than buried in dashboard logic that is hard to version or review.

Why modern data teams already have the foundation

A modern data stack reduces the friction here significantly. The semantic layer in this study is a YAML config file that defines metric formulas, names segment values, and specifies fallback strategies for known data gaps. It sits alongside dbt models in a Git repository, which means every change to a metric definition or a segment value goes through the same review process as a change to the data models themselves.

For a modern data team, that is the same workflow they already use for the rest of their infrastructure. The decisions encoded in the config are the same decisions required when building a curated reporting layer for a dashboard. Context engineering is not a new discipline bolted on top of data work. It is the same data work applied to a new endpoint.

The teams that struggle with AI reporting reliability are, in most cases, the teams that skipped the modelling step. A team that has invested in modelling its data correctly, defining its metrics explicitly, and maintaining those definitions in version control already has most of what an AI reporting agent needs to work reliably.

If you are building or evaluating an AI reporting agent today, the minimum viable setup for anything beyond simple spend and traffic metrics is a modelled data layer with explicit metric formulas, named segment values, and fallback strategies for known data gaps. Direct API access alone will not get you there, and that holds whether the channel is paid search, paid social, or any other reporting domain.

This is the work we do for clients: building the modelled data layers, metric definitions, and semantic configuration that make agentic reporting trustworthy rather than merely impressive in a demo.

If you are weighing up an AI reporting agent on top of your performance data, we are happy to walk through what the setup actually requires. Get in touch.