data-dict.yaml
data-dict.yaml is a data dictionary specification that describes a collection of related tables: their contents, constraints, connections, and the specialised vocabulary you need to understand them. It is designed to be a living document, co-written by humans and agents, that tracks your understanding of a dataset as it evolves.
data-dict.yaml is designed to be lightweight. It doesn’t attempt to precisely describe every possible type of metadata in a machine readable way. Instead it focuses on precisely recording the most important components, leaving the remainder to plain text fields that require a human or agent to interpret. This means that data-dict.yaml doesn’t itself do data cleaning, but it is a useful complement to tools that do.
You can read the details of the spec in the specification, or dive in by looking at a few examples:
Why data-dict.yaml?
There have been many previous attempts to encode data dictionaries in structured text. What makes data-dict.yaml different? Why revisit this problem now?
- The costs of creating a data dictionary are lower than ever before because AI agents can automate much of the boilerplate, including porting documentation from existing unstructured formats (e.g.
.doc,.html,.pdf). - The benefits of creating a data dictionary are higher, because AI agents need the context that currently exists only in your head. As a very pleasant side-effect, this also helps your human colleagues, particularly those who are newer to your organisation.
- LLMs change what it means for something to be machine readable. While we explicitly encode the most important structures, we can leave the more unusual quirks to free-form text.
- Unlike previous data dictionaries, we assume data is stored in parquet files or database tables. This means that many parsing details are out of scope, radically simplifying the spec.
- The cost of describing the data semantics in multiple places (i.e.
data-dict.yamland data transformation code) is lower because an AI agent can easily keep both in sync.
Inspirations
Here are a few of the resources that guided the design of data-dict.yaml:
- Data management in large-scale education research
- Frictionless data
- Hex’s semantic modelling
- Snowflake’s semantic views
- Soda’s contract language
- dbt tests
It’s worth noting that while semantic models influenced the design of data-dict.yaml, it is not a semantic model. This means it doesn’t think about dimensions or metrics, because that distinction reflects intended use, not the data itself. It’s primarily designed to support data scientists, not data analysts.
Additionally, and while terminology is still evolving, the “semantic” in semantic models is typically interpreted narrowly, focussing on structural semantics (what’s needed for queries to return consistent values) not what the data actually means.
Missing features
data-dict.yaml currently ships a validator (see the CLI) that checks a dictionary against the spec and against the underlying data. We plan to add more tooling in the future:
User facing documentation: There’s currently no way to turn your
.yamlfile into attractive HTML documentation of your data. If you’ve put the time into maintaining an accurate data dictionary, we want to make it easy to turn it into a beautiful website that you can share with your colleagues.Large tables: A standalone
data-dict.yamlis not designed for hundreds of tables or hundreds of columns. We also plan to provide tools that allow you to aggregate multiple dictionaries and index larger data catalogs.Export: Export your data dictionary to other formats like csv, excel, and googlesheets.