AI Playbook: 09 - Data Quality Validation¶
This playbook describes the process of transforming the data quality rules, defined in the object specification documents, into a suite of automated tests. It ensures that the data populating the CDF data model is continuously monitored for accuracy, completeness, and integrity.
Goal: To establish an automated data quality framework where business rules, defined in Markdown, are systematically converted into executable tests.
1. The Data Quality Lifecycle¶
The data quality lifecycle follows our core "docs-as-code" philosophy:
- Define in Markdown: Business-level data quality rules are defined in a table within each object's
_specification.mdfile. - Translate to Tests: An AI agent or helper script reads these rules and translates them into executable tests using a testing framework like dbt, pytest, or Great Expectations.
- Execute via CI/CD: These tests are executed automatically as part of the CI/CD pipeline, typically after data has been transformed.
- Alert on Failures: Any test failures trigger alerts, notifying the data stewards or operations team to investigate the issue.
2. Defining Rules in the Specification¶
Data quality rules are defined directly in the .../*_specification.md files. This keeps the rules close to the data they are validating.
Example Rule Definition Table (from well_specification.md)¶
| Rule ID | Description | Severity | Implementation Example | |---|---|---|---| | DQ-001 | externalId must be unique. | Critical | dbt test unique | | DQ-002 | spudDate cannot be in the future. | High | dbt test "where spudDate <= current_date" | | DQ-003 | status must be one of [OPEN, SHUT_IN]. | Medium | dbt test accepted_values | | DQ-004| wellheadElevation must be > -500. | Medium | dbt test "where wellheadElevation > -500"|
3. Translating Rules to Tests¶
This is the core technical step of the playbook.
- Action: A dedicated script (
tools/generate_dq_tests.py) is executed as part of the CI/CD pipeline. - Process:
- The script recursively finds all
_specification.mdfiles in the project. - For each file, it parses the "Data Quality & Validation Rules" table.
- It then maps each rule to a test template for the chosen testing framework (e.g., dbt).
- It generates the final test files (e.g.,
schema.ymlfor dbt) and places them in the appropriate directory for the testing tool.
Example: dbt Test Generation¶
Given the rule table above, the generate_dq_tests.py script would produce a schema.yml file like this:
version: 2
models:
- name: well_view # This would be the name of the view for the Well object
columns:
- name: externalId
tests:
- unique
- not_null
- name: spudDate
tests:
- dbt_utils.expression_is_true:
expression: "spudDate <= current_date"
- name: status
tests:
- accepted_values:
values: ['OPEN', 'SHUT_IN']
- name: wellheadElevation
tests:
- dbt_utils.expression_is_true:
expression: "wellheadElevation > -500"
4. Execution and Alerting¶
-
Execution: The data quality tests are run as a dedicated step in the main CI/CD pipeline, immediately after the
cognite-toolkit applystep for transformations has completed. -
Command Example (dbt):
- Alerting: If the
dbt buildcommand fails, it will exit with a non-zero code, failing the pipeline step. This failure should be configured to send an alert to the appropriate team, including a link to the pipeline logs which will contain the detailed test results.
Note: This playbook promotes a powerful synergy between documentation and testing. By defining the rules in the same place as the object's design, it ensures that the data quality requirements are never out of sync with the implementation.