Building Data Foundations for AIModule 4 of 7

Module 4 of 757% through

Module 4

Running a Data Quality Programme That Holds Up

How to design SLOs, monitoring, and incident response for action data so it stays trustworthy under regulatory and operational pressure.

Module 4 — 90-second video overview

From data quality as opinion to data quality as measurement

In most enterprises, "data quality" is an opinion. The data team thinks it's mostly fine. The business team thinks it's mostly broken. Audit thinks it's well-documented. The AI team thinks it's the reason their projects are stuck. They're all partially right, and there is no shared instrument for resolving the disagreement.

Action data forces you to fix this. When the data is feeding automated decisions, "mostly fine" isn't good enough — you need to know exactly which properties of the data are guaranteed, to what degree, over what time window, and what happens when those guarantees are broken. That is what a data quality programme actually is: not a data quality team, not a quarterly score, not a documented framework. A working set of measurable commitments and the operational machinery to enforce them.

This module is about how to build that programme.

Service-level objectives for data

The right starting point is the same one that DevOps used for application reliability a decade ago: service-level objectives. An SLO is an explicit, measurable commitment to a property of the data over a time window, with a clear consequence if it is breached.

For action data, the SLOs that matter most are:

Freshness. "The customer wide-row will be updated within 90 seconds of any change to its source systems, 99% of the time over a rolling 7-day window."
Completeness. "Every customer wide-row will have a non-null value in the 12 mandatory fields, 99.9% of the time."
Validity. "All customer IDs in the wide row will resolve to a valid record in the customer master, 100% of the time."
Distributional stability. "The distribution of risk scores will not drift by more than 2 standard deviations week over week without an alert."
Lineage availability. "The full lineage of any row in the wide table will be queryable within 2 seconds, 99% of the time."

These are concrete commitments. They can be measured. They can be alerted on. They can be reported to regulators. They can be the basis of incident response. None of them require a "quarterly data quality review."

Monitoring as a first-class concern

You cannot run an SLO regime without production monitoring. The monitoring needs to be in place before the data is consumed by anything — not after.

A working monitoring stack for action data should cover:

Freshness monitors. Per-dataset, with thresholds and alerts. Catch a stalled pipeline within the SLO window, not the next day.
Volume monitors. Sudden drops in row counts almost always indicate a broken upstream connector. Catch them within minutes.
Schema monitors. A new column appears, a type changes, a constraint is dropped — these are silent killers because they don't crash anything; they just produce wrong data. You need automatic detection.
Distribution monitors. Statistical drift in key fields. Sometimes harmless, sometimes the signal that an upstream system has changed its meaning of a value. Either way, you want to know.
Validity monitors. Foreign key resolution rates, format conformance, range checks. These are the boring, load-bearing data quality checks that catch most real incidents.
Lineage observability. Real-time, queryable, integrated with your incident response.

Off-the-shelf tools that do most of this well include Monte Carlo, Bigeye, Datafold, and Great Expectations. The point is not which tool you use; the point is that you have one, and it is wired into your on-call rotation.

Data incidents are real incidents

The biggest behavioural shift for most organisations adopting an action-data discipline is treating data incidents like production incidents. A breached freshness SLO needs the same urgency as a downed API. A schema change with no advance notice is a sev-2. A distribution drift alert that turns out to be a real upstream change is a post-mortem item.

This is uncomfortable for data teams that have grown up in the reporting world, where data issues are routine, slow-resolved, and tracked in tickets. It is necessary for action data because the consequences of slow response are different. While the data team is taking three days to triage the issue, the workflow is making thousands of automated decisions on bad inputs.

The operational pattern that works:

On-call rotation for the data platform team.
Paging on critical data SLO breaches, with documented response procedures.
Runbooks per dataset, listing common causes and first-step diagnostics.
Incident reviews for any breach of a regulated dataset, with named owner and documented learnings.
Quarterly review of incident patterns to drive structural fixes, not just symptomatic ones.

What the regulator expects

If you are operating in financial services, expect the regulator to care about your data quality programme more every year. The trajectory is clear:

BCBS 239 (Risk Data Aggregation and Reporting Principles) set the early standard for data quality in risk-data pipelines: completeness, accuracy, integrity, timeliness, and adaptability.
PRA SS1/23 (Model Risk Management) extended these expectations to model inputs more broadly, requiring named ownership, documented lineage, and observable controls.
EU AI Act (2024–2026 phased implementation) requires data quality and governance for any data used in high-risk AI systems, including financial services automation.
DORA (Digital Operational Resilience Act) requires demonstrable data integrity and recoverability across critical operational pipelines.
FCA SYSC has long required adequate systems and controls for data; supervisory expectations are tightening for data feeding automated decisions.

The pattern across all of these is the same: regulators are converging on the expectation that data feeding automated decisions has named ownership, observable lineage, measurable quality controls, and a documented incident response process. If your data quality programme can show all four of these for the data feeding your AI workflows, you are in a defensible position. If it cannot, you have governance debt that will be expensive to retire under pressure.

A pragmatic starting point

You do not have to build this all at once. The pattern that consistently works in our engagements:

Pick one dataset that is feeding an active AI workflow.
Define 3–5 SLOs for that dataset — freshness, completeness, validity, the ones that matter most.
Wire the monitoring for those SLOs into your existing observability stack.
Set up paging for breaches and document the response procedure.
Hold a real incident review the first time something breaks.
Use the experience to design the SLOs for the next dataset.

Within six months you will have a working data quality programme that covers your AI-critical data — without having tried to boil the ocean.

What's next

In Module 5 we'll cover lineage and observability in detail — how to build them as queryable, production capabilities rather than as documentation.

Module 3: Schema Design for Workflows, Not Reports

Module 5: Lineage and Observability as Production Capabilities

Monthly newsletter

Stay current between modules

Subscribe to the monthly essay for long-form analysis on AI enablement, embedded governance, and operating-model design — written for the same audience this course serves.

No spam. Unsubscribe anytime. Read by senior practitioners across FS, healthcare, energy, and the public sector.