The Next-Gen Syllabus // 2026 Edition

The Modern Analytics Engineering Guide

Forget messy stored procedures. The era of dumping raw tables into BI tools is over. Welcome to Analytics Engineering. Master the basics of dbt, progress into Data Contracts and Unit Testing, and culminate by building AI-Native RAG pipelines directly inside your Data Warehouse.

1. What is dbt? (The Basics)

dbt (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouse simply by writing `SELECT` statements. It handles turning those SELECT statements into tables and views automatically.

The Old Way (ETL)

Extract, Transform, Load. You extract data from an API, transform it in a brittle Python script running on an external server, and load the clean data into a database. If the Python server crashes, your data is missing.

Requires managing separate compute resources and complex Python dependency chains.

The dbt Way (ELT)

Extract, Load, Transform. You use tools like Fivetran or Airbyte to dump raw, messy data directly into a powerful cloud warehouse (Snowflake, BigQuery). Then, you use dbt to transform the data in place using native SQL.

Leverages the massive, auto-scaling compute of the modern data warehouse.

The Core Concept: Everything is a SELECT

In dbt, you don't write `CREATE TABLE x AS...`. You just write the logic. When you execute the command `dbt run`, dbt wraps your SELECT statement in the correct Data Definition Language (DDL) for your specific database.

models/my_first_model.sql
-- All you write in dbt is this:
SELECT
    user_id,
    LOWER(email) AS email_address,
    status
FROM raw_database.stripe.users
WHERE status = 'active'

---------------------------------------------------------
-- When you type `dbt run`, dbt compiles and executes this against Snowflake:
-- CREATE OR REPLACE VIEW analytics.prod.my_first_model AS (
--     SELECT user_id, LOWER(email) AS email_address, status
--     FROM raw_database.stripe.users WHERE status = 'active'
-- );

2. Installation & Project Architecture

dbt is a Python application. In the past, engineers struggled with messy Python virtual environments (`venv`). The 2026 standard is to use uv by Astral. It is written in Rust, runs 100x faster than `pip`, and handles isolated CLI tools flawlessly on Ubuntu/Linux.

ubuntu_terminal
# 1. Update Ubuntu and install prerequisites
 sudo apt update && sudo apt upgrade -y
 sudo apt install git curl -y

# 2. Install 'uv' (The hyper-fast Rust-based Python manager)
 curl -LsSf https://astral.sh/uv/install.sh | sh
 source ~/.bashrc

# 3. Install the dbt adapter globally using `uv tool`
# `uv tool` creates an isolated virtual environment in milliseconds 
# and symlinks `dbt` to your path. (dbt-core is pulled automatically).
 uv tool install dbt-snowflake # Or dbt-bigquery, dbt-postgres

# 4. Initialize your new project
 dbt init jaffle_shop

The Two Crucial Configuration Files

Every dbt project relies on two files to function. One lives inside your repository, and one lives securely on your local computer.

dbt_project.yml

Committed to Git. Tells dbt where to look for models, tests, and macros. Defines global configurations.

name: 'jaffle_shop'
version: '1.0.0'
profile: 'jaffle_shop' # Links to profiles.yml

models:
  jaffle_shop:
    marts:
      materialized: table

~/.dbt/profiles.yml

Never commit to Git. Lives in your home directory. Holds your secure database passwords and roles.

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: xy12345
      user: alice
      password: secure123
      database: raw_data
      schema: dbt_alice

3. Building Models (The Medallion Architecture)

A dbt project is a Directed Acyclic Graph (DAG). To ensure code reuse and scalability, industry standard dictates splitting your models into three layers: Staging (Bronze), Intermediate (Silver), and Marts (Gold).

Step 1: Declaring Sources

You should never hardcode raw table names (like `raw_db.stripe.customers`) into your SQL. Instead, define them in a `sources.yml` file. This allows dbt to track data freshness and makes it easy to point your project to a new database if you migrate.

models/staging/stripe/sources.yml
version: 2

sources:
  - name: stripe
    database: raw_data
    schema: stripe_ingest
    tables:
      - name: raw_customers
      - name: raw_payments

Step 2: Staging Models (Bronze)

Staging models have a 1-to-1 relationship with source tables. Their *only* job is renaming columns to standard conventions, casting data types, and doing light cleaning. You use the {{ source() }} macro to refer to the YAML definition.

models/staging/stripe/stg_stripe__customers.sql
SELECT
    id AS customer_id,
    LOWER(email) AS email_address,
    CAST(created_date AS TIMESTAMP) AS created_at,
    is_active
FROM {{ source('stripe', 'raw_customers') }}

Step 3: Marts (Gold) and the `ref` Function

Marts combine staging models together to answer business questions. This is where the magic happens. By using {{ ref('model_name') }}, dbt infers the execution order. It knows it must build `stg_stripe__customers` *before* it can build `dim_customers`.

models/marts/dim_customers.sql
WITH customers AS (
    SELECT * FROM {{ ref('stg_stripe__customers') }}
),

payments AS (
    SELECT * FROM {{ ref('stg_stripe__payments') }}
)

SELECT
    c.customer_id,
    c.email_address,
    SUM(p.amount) AS lifetime_value,
    MIN(p.payment_date) AS first_order_date
FROM customers c
LEFT JOIN payments p ON c.customer_id = p.customer_id
GROUP BY 1, 2

4. Macros & Jinja Magic (DRY Code)

Standard SQL is rigid. You can't write a for loop in SQL. But because dbt files are compiled using the Jinja templating language, you can inject Python-like logic into your SQL to make it dynamic and DRY (Don't Repeat Yourself).

Creating a Macro

A Macro is a reusable snippet of SQL. If Stripe provides payments in cents, but Shopify provides them in dollars, you don't want to write `amount / 100` everywhere. Create a macro.

macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, scale=2) %}
    ROUND( CAST({{ column_name }} AS NUMERIC) / 100, {{ scale }} )
{% endmacro %}

Now, you can use it in any model across your entire project:

models/staging/stripe/stg_stripe__payments.sql
SELECT
    payment_id,
    customer_id,
    -- Compiles to: ROUND( CAST(amount_in_cents AS NUMERIC) / 100, 2 ) AS amount_usd
    {{ cents_to_dollars('amount_in_cents') }} AS amount_usd
FROM {{ source('stripe', 'raw_payments') }}

5. Testing & Data Quality

Code without tests is legacy code. dbt allows you to write assertions about your data. If a test fails, `dbt test` throws an error, alerting your team before the CEO sees a broken dashboard.

Generic Tests (YAML)

dbt ships with 4 built-in generic tests: `unique`, `not_null`, `accepted_values`, and `relationships`. You apply them in your schema.yml files.

models/marts/schema.yml
models:
  - name: dim_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['active', 'churned', 'pending']

Singular Tests (Custom SQL)

If you have complex business logic (e.g., "An order cannot be refunded before it was created"), write a singular test. A singular test is a `.sql` file in your `tests/` directory. If the query returns 0 rows, the test passes. If it returns any rows, the test fails.

tests/assert_refund_after_creation.sql
-- This query selects bad data. If 0 rows return, we are good.
SELECT
    order_id,
    created_at,
    refunded_at
FROM {{ ref('fct_orders') }}
WHERE refunded_at < created_at
Unit Testing (New in dbt 1.8+)

Standard tests check the data after it hits the warehouse. Unit Tests allow you to mock input rows in YAML and assert the expected output without querying real data. This allows you to test complex Regex logic instantaneously during development.

6. Advanced Materializations

By default, dbt builds everything as a view. Views are great for staging because they take zero storage space, but they execute the query every time someone looks at the dashboard. For performance, you alter the Materialization.

  • table

    Drops the old table and completely rebuilds it. Fast to query, but slow to build. Perfect for gold `marts` under 10 million rows.

  • ephemeral

    Does not exist in the database at all! dbt interpolates the SQL directly into downstream models as a CTE (Common Table Expression).

The Incremental Deep-Dive

If you have an event log with 10 billion rows, rebuilding a `table` every day will cost a fortune. You need materialized='incremental'. This tells dbt to only process the new rows that arrived since the last run.

models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

SELECT *
FROM {{ ref('stg_tracking__events') }}

{% if is_incremental() %}
    -- This block ONLY runs if the table already exists.
    -- `this` refers to the target table currently in the database.
    WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}
Gotcha: The Incremental 'Merge' Trap

dbt defaults to a merge strategy. If you have a massive table and your `unique_key` is not clustered correctly, the database will perform a full table scan just to update 10 rows. This ruins the cost-saving benefits. Always partition on date, and consider the incremental_strategy: insert_overwrite for Big Data.

7. Data Contracts (Guarantees for Downstream APIs)

In the past, an Analytics Engineer might rename a column from `user_id` to `customer_id`. The dbt run would succeed, but the downstream Machine Learning pipeline consuming that table would instantly crash. Data Contracts solve this.

By setting enforced: true, dbt performs a pre-flight check. It compares the SQL output to the YAML definition. If they don't match exactly, dbt halts the build *before* touching the database.

models/marts/schema.yml
models:
  - name: dim_customers
    config:
      contract:
        enforced: true
    columns:
      - name: customer_id
        data_type: int  # If the SQL returns VARCHAR, the run fails.
      - name: email
        data_type: varchar
        constraints:
          - type: not_null  # Enforced natively at the DDL level (e.g. Snowflake NOT NULL)

8. The dbt Semantic Layer (MetricFlow)

If your Marketing team defines "Active User" in Tableau, and Finance defines it differently in PowerBI, your company has a trust problem. The Semantic Layer decouples metric definitions from the BI tool. You define business metrics centrally in dbt YAML. Tools then query the dbt API, guaranteeing identical numbers everywhere.

models/metrics/revenue.yml
semantic_models:
  - name: transactions
    model: ref('fct_transactions')
    entities:
      - name: customer
        type: foreign
        expr: customer_id
    dimensions:
      - name: transaction_date
        type: time
        type_params:
          time_granularity: day
    measures:
      - name: total_revenue
        agg: sum
        expr: amount_usd

metrics:
  - name: monthly_revenue
    description: "Total gross revenue aggregated monthly."
    type: simple
    type_params:
      measure: total_revenue

9. Agentic AI Integration & RAG

The Frontier

The biggest mistake data teams make today is exposing raw database tables directly to LLMs (Text-to-SQL). LLMs hallucinate table relationships, ignore nuances like timezone conversions, and write hideously unoptimized JOINs.

Never let an AI write raw SQL against your warehouse.

Instead, provide your AI Agents with the dbt Semantic Layer API. If the user asks the Chatbot "What was revenue last month in Germany?", the LLM does not write SQL. It simply calls the dbt API: { metric: "monthly_revenue", dimensions: ["country: Germany"] }. dbt dynamically compiles the perfect, flawless SQL on the fly.

Building RAG (Vector Embeddings) inside dbt

If you are building Retrieval-Augmented Generation (RAG) applications, you need Vector Embeddings. You don't need external Python scripts for this anymore. Modern cloud warehouses support embedding functions natively, allowing you to build Vector DBs directly inside your dbt DAG.

models/marts/support/dim_support_vectors.sql
{{ config(
    materialized='incremental',
    unique_key='ticket_id'
) }}

WITH cleaned_tickets AS (
    SELECT 
        ticket_id,
        -- Use dbt macros to strip PII before sending to the embedding model
        {{ mask_pii('resolution_text') }} AS safe_text
    FROM {{ ref('stg_zendesk__tickets') }}
    WHERE status = 'solved'
)

SELECT
    ticket_id,
    safe_text,
    -- Calling Snowflake Cortex (or BigQuery ML) natively inside dbt!
    SNOWFLAKE.CORTEX.EMBED_TEXT_768(
        'snowflake-arctic-embed-m', 
        safe_text
    ) AS vector_embedding
FROM cleaned_tickets

{% if is_incremental() %}
    -- Only generate expensive embeddings for new tickets
    WHERE ticket_id NOT IN (SELECT ticket_id FROM {{ this }})
{% endif %}

10. CI/CD & Data FinOps

The cloud is not free. A poorly written dbt `JOIN` running every hour can cost thousands of dollars a month. You must implement Query Tagging to trace warehouse compute credits directly back to the specific dbt model and Analytics Engineer who wrote it.

dbt_project.yml
name: 'ai_native_analytics'

# This injects metadata as comments into every SQL statement dbt sends to the warehouse.
# You can parse the warehouse query_history view to build a cost dashboard!
query-comment:
  comment: '{"dbt_model": "{{ node.name }}", "team": "{{ node.config.meta.owner }}"}'
  append: true
Slim CI (State:Modified)

When you open a Pull Request, do not run dbt build on the entire project. This wastes massive compute testing models you didn't touch. Instead, use dbt's state features: dbt build --select state:modified+ --defer --state ./prod-run-artifacts. This tells dbt to only build the models you changed, and resolve the upstream dependencies against production data without rebuilding them.