# SQL Best Practices
## Instructions
### Model Naming
Our models (typically) fit into three main categories: staging, marts, base/intermediate. For more detail about why we use this structure, check out [this discourse post](https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355). The file and naming structures are as follows:
```
├── dbt_project.yml
└── models
├── marts
| └── core
| ├── intermediate
| | ├── intermediate.yml
| | ├── customers__unioned.sql
| | ├── customers__grouped.sql
| └── core.yml
| └── core.docs
| └── dim_customers.sql
| └── fct_orders.sql
└── staging
└── stripe
├── base
| ├── base__stripe_invoices.sql
├── src_stripe.yml
├── src_stripe.docs
├── stg_stripe.yml
├── stg_stripe__customers.sql
└── stg_stripe__invoices.sql
```
- All objects should be plural, such as: `stg_stripe__invoices`
- Base tables are prefixed with `base__`, such as: `base__<source>_<object>`
- Intermediate tables should end with a past tense verb indicating the action performed on the object, such as: `customers__unioned`
- Marts are categorized between fact (immutable, verbs) and dimensions (mutable, nouns) with a prefix that indicates either, such as: `fct_orders` or `dim_customers`
### Model Configuration
- Model-specific attributes (like sort/dist keys) should be specified in the model.
- If a particular configuration applies to all models in a directory, it should be specified in the `dbt_project.yml` file.
- In-model configurations should be specified like this:
```
{{
config(
materialized = 'table',
sort = 'id',
dist = 'id'
)
}}
```
- Marts should always be configured as tables
### Dbt Conventions
- Only `stg_` models (or `base_` models if your project requires them) should select from `source`s.
- All other models should only select from other models.
### Testing
- Every subdirectory should contain a `.yml` file, in which each model in the subdirectory is tested. For staging folders, the naming structure should be `src_sourcename.yml`. For other folders, the structure should be `foldername.yml` (example `core.yml`).
- At a minimum, unique and not_null tests should be applied to the primary key of each model.
### Naming and Field Conventions
- Schema, table and column names should be in `snake_case`.
- Use names based on the *business* terminology, rather than the source terminology.
- Each model should have a primary key.
- The primary key of a model should be named `<object>_id`, e.g. `account_id` – this makes it easier to know what `id` is being referenced in downstream joined models.
- For base/staging models, fields should be ordered in categories, where identifiers are first and timestamps are at the end.
- Timestamp columns should be named `<event>_at`, e.g. `created_at`, and should be in UTC. If a different timezone is being used, this should be indicated with a suffix, e.g `created_at_pt`.
- Booleans should be prefixed with `is_` or `has_`.
- Price/revenue fields should be in decimal currency (e.g. `19.99` for $19.99; many app databases store prices as integers in cents). If non-decimal currency is used, indicate this with suffix, e.g. `price_in_cents`.
- Avoid reserved words as column names
- Consistency is key! Use the same field names across models where possible, e.g. a key to the `customers` table should be named `customer_id` rather than `user_id`.
### CTEs
For more information about why we use so many CTEs, check out [this discourse post](https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091).
- All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file
- Where performance permits, CTEs should perform a single, logical unit of work.
- CTE names should be as verbose as needed to convey what they do
- CTEs with confusing or notable logic should be commented
- CTEs that are duplicated across models should be pulled out into their own models
- create a `final` or similar CTE that you select from as your last line of code. This makes it easier to debug code within a model (without having to comment out code!)
- CTEs should be formatted like this:
```
with
events as (
...
),
-- CTE comments go here
filtered_events as (
...
)
select * from filtered_events
```
### SQL Style Guide
Use trailing commas
Indents should be four spaces (except for predicates, which should line up with the `where` keyword)
Lines of SQL should be no longer than 80 characters
Field names and function names should all be lowercase
The `as` keyword should be used when aliasing a field or table
Fields should be stated before aggregates / window functions
Aggregations should be executed as early as possible before joining to another table.
Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see [this rant](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/) for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design.
Prefer `union all` to `union` [*](http://docs.aws.amazon.com/redshift/latest/dg/c_example_unionall_query.html)
Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers".
If joining two or more tables, *always* prefix your column names with the table alias. If only selecting from one table, prefixes are not needed.
Be explicit about your join (i.e. write `inner join` instead of `join`). `left joins` are normally the most useful, `right joins` often indicate that you should change which table you select `from` and which one you `join` to.
*DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP, BRAIN TIME IS EXPENSIVE*
#### Example SQL
```
with
my_data as (
select * from {{ ref('my_data') }}
),
some_cte as (
select * from {{ ref('some_cte') }}
),
some_cte_agg as (
select
id,
sum(field_4) as total_field_4,
max(field_5) as max_field_5
from some_cte
group by 1
),
final as (
select [distinct]
my_data.field_1,
my_data.field_2,
my_data.field_3,
-- use line breaks to visually separate calculations into blocks
case
when my_data.cancellation_date is null
and my_data.expiration_date is not null
then expiration_date
when my_data.cancellation_date is null
then my_data.start_date + 7
else my_data.cancellation_date
end as cancellation_date,
some_cte_agg.total_field_4,
some_cte_agg.max_field_5
from my_data
left join some_cte_agg
on my_data.id = some_cte_agg.id
where my_data.field_1 = 'abc'
and (
my_data.field_2 = 'def' or
my_data.field_2 = 'ghi'
)
having count(*) > 1
)
select * from final
```
- Your join should list the "left" table first (i.e. the table you are selecting `from`):
```
select
trips.*,
drivers.rating as driver_rating,
riders.rating as rider_rating
from trips
left join users as drivers
on trips.driver_id = drivers.user_id
left join users as riders
on trips.rider_id = riders.user_id
```
### YAML Style Guide
- Indents should be two spaces
- List items should be indented
- Use a new line to separate list items that are dictionaries where appropriate
- Lines of YAML should be no longer than 80 characters.
#### Example YAML
```
version: 2
models:
- name: events
columns:
- name: event_id
description: This is a unique identifier for the event
tests:
- unique
- not_null
- name: event_time
description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)"
tests:
- not_null
- name: user_id
description: The ID of the user who recorded the event
tests:
- not_null
- relationships:
to: ref('users')
field: id
```
### Jinja Style Guide
- When using Jinja delimiters, use spaces on the inside of your delimiter, like `{{ this }}` instead of `{{this}}`
- Use newlines to visually indicate logical blocks of Jinja
## Overview
🔼Topic:: [[SQL (MOC)]]
◀Origin::
🔗Link:: [Source](https://github.com/dbt-labs/corp/blob/master/dbt_style_guide.md)