Data Contracts are data agreements that are enforceable programmatically. They are an application of governance-as-code. Here's an example below!
There are 3 critical aspects to a data contract that teams can enforce in a straightforward preventative manner:
1. Schema: Column names, data types & variables
2. Business Logic: NULL values, expected data ranges, etc
3. SLAs: Frequency of data refreshes
Where each component of the contract is enforced depends on the complexity of the solution that has been implemented and the technology stack of the organization. What's important is that each constraint in the contract can be at minimal alerting upon, and ideally enforced programmatically.
Data can be checked and measured against the contract upon arrival, or in the best case scenario contracts can be validated in CI/CD as integration tests, preventing data generating code assets from ever being deployed into production.
Here are a few of the essential additional parameters of a contract:
ID: Each contract should have a unique identifier so that it can be referenced by tests in the CI/CD pipeline or elsewhere
Resource Name: Contracts should be 1-to-1 with data resources. I prefer to start implementation on upstream source data sets like databases, protobuf/avro files, JSON schemas, or unstructured event schemas. Anything with a schema can be covered by a contract!
The spec version: All contracts should be version-controlled and ultimately stored in Git. If any consumers wish to modify their contract, it can be done through a PR. This prevents many contracts from being applied to the same data resource.
Name: The semantic object the data contract describes. In the example below, we have a contract on the vehicle status of each bus in Seattle's bus fleet.
Namespace: The data domain. The combination of Name and Namespace prevents duplicate contracts from being created, and each can be categorized/stored according to their Namespace.
Documentation: In my opinion, documentation should be a required field for all data contracts. Consumers should not get the benefit of data protection, without putting in the work to describe what data they expect and how it's being used. (Also, this allows semantic level checks to be surfaced qualitatively)
Owner: The owner is responsible for the contract. Contracts should be initially defined by data consumers who have some expectation of their data, but ultimately should be owned by data producers once the cost of managing many downstream dependencies becomes significant.
Once the foundations are in place, contracts can be extended to include PII, compliance, and other forms of data governance. These checks should be done as upstream/left as possible, ideally informing data producers when violations are about to occur.
Data Contracts are absolutely ESSENTIAL to scale data governance and data quality to the software engineering organization.
Good luck!
CEO @ Gable.ai (Shift Left Data Platform)
12moDQC Meet-up registration link: https://www.eventbrite.com/e/data-quality-camp-happy-hour-tickets-569929704087