A working diagnostic
In Module 1 we made the case that there are two kinds of data inside your organisation, and that AI needs the second kind. This module gives you a working diagnostic: five specific characteristics that every AI-ready data layer should have. You can apply this to any dataset in your organisation and tell, in 30 minutes, whether it is action-ready or reporting-only.
The five characteristics are:
- Captured at the point of action
- Standardised at capture time
- Structured around the workflow, not the report
- Available in real time, with consistent latency
- Lineage and quality observable in production
We'll walk through each one, with the anti-pattern that signals it is missing.
1. Captured at the point of action
What it means. Data is created where the work happens — inside the application, the workflow, or the system of engagement — at the moment the work happens. Not reconstructed afterwards from logs, exports, or end-of-day extracts.
Why it matters. Every layer of indirection between the work and the data introduces three things you cannot afford in an automated decision: lag, gaps, and reconstruction errors. By the time the data has been exported, transformed, loaded, normalised, and joined to other tables, the decision the AI needed to make has already been made — by a human, slowly, with worse information.
Anti-pattern. "We extract the data from the source system every hour into a staging area, then run a normalisation job, then load it into the warehouse, then refresh the BI cube, and the AI team picks it up from there." This is reporting plumbing. Every step adds latency and loses fidelity. By the time the AI has the data, the workflow has moved on.
What to do instead. Push the capture upstream. The application that handles the work should produce the action data as a first-class output, in a structured form, into a stream or store that downstream systems can consume directly. The warehouse can still pull from this stream for reporting — but the action consumers don't have to wait for the warehouse to finish its job.
2. Standardised at capture time
What it means. Definitions, units, identifiers, and field formats are enforced as data enters the system, not patched up later in normalisation jobs. If "customer ID" must be a 12-character alphanumeric, the application enforces that on input. If "transaction amount" must be in pence as an integer, the schema makes it impossible to write anything else.
Why it matters. Late normalisation is too late. By the time the midnight batch job runs to clean up the data, downstream systems have already consumed the dirty version and acted on it. Fixing it after the fact is more expensive than preventing it at capture, and it leaves a trail of inconsistent decisions in the wake of every unfixed-in-time error.
Anti-pattern. "We have a data quality team that runs validation reports each morning and flags issues for remediation." That is a sign that data is being captured uncleaned. By the time the issues are flagged, the wrong data has already been used.
What to do instead. Define the canonical schemas at the application layer. Make it impossible to write non-conforming data. Treat capture-time validation as part of the application's correctness, not as a downstream data quality concern.
3. Structured around the workflow, not the report
What it means. The schema is designed to support the next decision, not the next dashboard. When an automated workflow needs to ask "what was this customer's risk score five minutes ago?", the answer should be a single read against a single table — not a 4-way join across three databases.
Why it matters. Reporting schemas optimise for flexibility and aggregation. Action schemas optimise for the specific question the workflow is going to ask. Repurposing a reporting schema for action work is one of the most common reasons enterprise AI initiatives stall.
Anti-pattern. "We have a fact table joined to seven dimension tables and the AI team writes their own materialised views off it." This is reporting architecture being asked to do action work. The materialised views are a band-aid that introduces yet another sync layer to keep current.
What to do instead. Design schemas for the workflow that will use them. If the workflow needs customer-with-risk-score-as-of-now, model that explicitly. The reporting layer can be derived from the action layer — not the other way around.
4. Available in real time, with consistent latency
What it means. Data is fresh enough to support automated decisions, and its freshness is predictable. Five seconds is fine if it is always five seconds. Five seconds most of the time and four hours occasionally is much worse than always being a steady minute.
Why it matters. AI systems cannot reason safely about data whose freshness varies unpredictably. They cannot decide "should I act on this number?" if the number is sometimes current and sometimes hours stale. Predictability matters more than raw speed.
Anti-pattern. "It usually takes about five minutes but during heavy load it can be longer, and sometimes the pipeline fails and we backfill the next day." This is fine for a dashboard. It is not fine for an AI workflow that has to decide whether to authorise a payment.
What to do instead. Use streaming or change-data-capture patterns where freshness matters. Set explicit freshness SLOs (service-level objectives) per dataset. Monitor freshness in production. Treat freshness violations as incidents, not as background drift.
5. Lineage and quality observable in production
What it means. When a model produces a strange output, the team can trace the input data back to its origin in real time, see who or what touched it, and understand whether it has degraded since the model was trained. Lineage is not a one-time documentation exercise — it is a queryable production capability.
Why it matters. Two reasons. First, debugging: when something goes wrong, you need to be able to walk the chain of dependencies quickly, before the regulator or the customer notices. Second, regulatory defensibility: in financial services, you increasingly need to be able to reconstruct any individual automated decision from the underlying data on demand.
Anti-pattern. "We have a data lineage diagram in Confluence." That is documentation, not observability. The diagram will be out of date within a week of a single schema change.
What to do instead. Use a lineage tool that is integrated with your pipelines and your observability stack — DataHub, OpenLineage, Marquez, or a vendor equivalent. Make it queryable in real time. Make it part of how the team responds to incidents, not just part of how they pass audits.
Putting the diagnostic to work
Pick one dataset in your organisation that you think is "good." Walk it through the five characteristics:
- Is it captured at the point of action, or reconstructed afterwards?
- Is it standardised at capture, or normalised in a batch job?
- Is it structured around a workflow, or around a report?
- Is its freshness predictable enough for an automated system to reason about?
- Can lineage be queried in production?
If the answer to any of these is no, that dataset is reporting data, not action data. That doesn't mean it's bad — it means it isn't ready for AI without rework.
In Module 3 we'll cover how to design schemas around workflows in practice, with worked examples in KYC, fraud, and regulatory reporting.