data-dict.yaml
This document describes version 0.1.0 of the data-dict.yaml specification.
A data dictionary has one required top-level key, version, plus three optional keys that hold the actual content:
version(required): the version of thedata-dict.yamlspec this document conforms to. Currently0.1.0.tablesis where the bulk of most data-dict.yaml files will be. It describes the tables and their columns.relationshipsdescribes the relationships between tables. It gives the details you need to safely create joins.glossaryprovides a place to define important domain-specific terms. This is a good place to write down those special words that your company loves to use.
While the spec is pre-1.0, breaking changes between versions should be expected. Once the spec stabilises at 1.0, the major version will only change on breaking changes.
Tables
tables is a named list that describes each table in the dataset. Each table represents a rectangle of data with observations in the rows and variables in the columns. Each table has the following properties:
description: a human-readable description of the table. May contain markdown, and is usually a few sentences or a paragraph. A good description answers two questions:- What’s the grain? What does a row represent? (e.g. “each row is a food item”, “each row is one patient visit”).
- What’s the population? What’s been included or filtered out to produce this dataset? (e.g. “only completed orders from 2020 onwards”, “excludes test accounts”).
details: additional information about the table. This is the place for “here be dragons”: assumptions baked into the data, known weak spots, surprising calculations, and known problems. Also covers how the data was collected or constructed. Can be any length.source(required): ways to access the data.columns(required): an ordered list of column metadata.
For example:
tables:
food:
description: >
Each row is a food item in the USDA FoodData Central database.
Includes both branded and foundation foods.
source:
parquet: inst/parquet/food.parquet
R: foodbank::food
SQL: foodbank.food
columns:
- name: fdc_id
type: number(id)
constraints: [primary_key]
description: Unique identifier for the food item.
examples: [167512, 174231, 325871, 534109, 715322]
- name: description
type: string
constraints: [required]
description: Full text description of the food.
examples: [Hummus, Egg rolls, Cheese spread, Grapes, Pickle relish]
- name: food_category_id
type: number(id)
constraints: [foreign_key]
description: Links to the food_category table.
examples: [9, 11, 14, 18, 25]
- name: data_type
type: enum
values: [foundation, branded]
description: Whether the food is a foundation or branded food.Source
source is a map whose keys name the access method and whose values give the location. For example:
source:
parquet: inst/parquet/food.parquet
R: foodbank::food
SQL: foodbank.foodThe currently supported keys are:
parquet: path to a Parquet file (may include globs).SQL: a (possibly schema-qualified) table name (e.g.foodorfoodbank.food) or a fullSELECTquery.RandPython: R or Python code that returns the data (e.g.foodbank::food, orread.csv("food.csv", comment.char = "#")).pin: the name of a Posit Connect pin.
This variety of source types reflects the variety of ways which you might retrieve a dataset. It’s good practice to upstream as much of this processing as possible so that over time you exclusively use parquet or SQL with a table.
Columns
Each entry in the columns list is a column descriptor with the following properties:
name(required): column name. Must match the column name in the underlying data.type: the column’s data type. Must match (approximately) the underlying data type (see Types).constraints: a list of column-level constraints (see Column constraints).description: a human-readable description of the column. Can use markdown.details: additional information about the column, e.g. how it was computed or edge cases to watch out for. Can be any length.units: the unit of measurement, fornumber(quantity)columns only (see Measures).
A column also carries one of values, range, or examples, which represents the data it contains. Which one is determined by its type (see Types).
Description & details
The description and details are free text fields that humans and agents can use to jot down important notes. The description should be short, typically a few sentences or at most a paragraph and will be displayed in user interfaces. The details can be any length, and is a good place to carefully record all the details of the table.
Types
Types capture data types at a level that makes sense for analysis, which is typically coarser than the logical types of the underlying data.
The supported types are:
number: numeric values (integers or floating-point). Can be qualified with a measure in parentheses:number(id),number(ordinal), ornumber(quantity). See Measures.string: UTF-8 text strings.boolean: true/false values.date: calendar dates.datetime: date-times with timezone.enum: a column with repeated values from a known set. The allowed values are listed in thevaluesproperty.
Every type has some way of representing the data it contains: an exhaustive set of values, a range, or a handful of examples. Each column therefore carries exactly one of the following three properties, and which one is determined by the column’s type:
values: the allowed values for anenumcolumn. Can be a list ([M, F, U]) when values are self-explanatory, or a map ({M: Male, F: Female, U: Unknown}) when values need labels. (booleancolumns implicitly havevalues: [true, false], no need to explicitly include it.)range: a two-element list[min, max]giving the inclusive range. Used for the ordered numeric and temporal types:number(ordinal),number(quantity),date, anddatetime.examples: a list of ~5 representative values from the column. Used for all other types:string,number, andnumber(id). A handful of concrete examples helps LLMs understand the column far better than a description alone. For instance, knowing that an id column holds[1, 2, 3, 4, 5]versus[10000, 1235452, 234234]. A good baseline is to select 5 evenly spaced values along the sorted unique values, and then add any particularly surprising values as you encounter them.
Measures
The number type can be qualified with a measure in parentheses that classifies what operations are meaningful:
| Type | Can compare | Can average | Can sum | Examples |
|---|---|---|---|---|
number(id) |
No | No | No | primary keys, foreign keys, codes |
number(ordinal) |
Yes | No | No | ranks, years, sequence numbers |
number(quantity) |
Yes | Yes | Yes | weights, counts, amounts |
A number(quantity) column can also declare its units: a free-text string naming the unit of measurement, such as kg, USD, or seconds. Units are only meaningful for quantities — they’re how you tell apart two columns that share a range but measure different things — so units is an error on any other type.
- name: mass
type: number(quantity)
units: g
range: [0, 5000]Column constraints
The constraints property is a list of constraint names. The supported constraints are:
primary_key: the set of columns with theprimary_keyconstraint uniquely identifies each row. Impliesrequiredandunique.foreign_key: the column references a primary key in another table. The specific relationship is defined inrelationships.required: the column does not contain null/missing values.unique: the column’s values are distinct (no duplicates).
Relationships
relationships is a list of join descriptors. Each entry describes how two tables are related.
cardinality(required): eitherone-to-one,one-to-many, ormany-to-one. Describes the relationship from the left table to the right table in the join expression.join(required): a join expression of the formtable1.column = table2.column, ortable1.date >= table2.start AND table1.date <= table2.end.description: human-readable description of the relationship. Only needed if it’s not clear from the context.conflicts: a list of column names that appear in both tables with different meanings. These fields would cause ambiguity in a join and may need to be renamed or dropped.
For example:
relationships:
cardinality: many-to-one
join: food.food_category_id = food_category.id
conflicts: [description]Glossary
glossary is a map from term to definition. Each entry provides a plain-language definition of a domain-specific term used in the table or column descriptions, or is likely to be used by a domain expert working with this data.
glossary:
foundation food: >
A food whose nutrient and food component values are derived
primarily by chemical analysis.