No sections found
Adjust your search query.
The Modern Analytics Engineering Guide
Forget messy stored procedures. The era of dumping raw tables into BI tools is over. Welcome to Analytics Engineering. Master the basics of dbt, progress into Data Contracts and Unit Testing, and culminate by building AI-Native RAG pipelines directly inside your Data Warehouse.
1. What is dbt? (The Basics)
dbt (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouse simply by writing `SELECT` statements. It handles turning those SELECT statements into tables and views automatically.
The Old Way (ETL)
Extract, Transform, Load. You extract data from an API, transform it in a brittle Python script running on an external server, and load the clean data into a database. If the Python server crashes, your data is missing.
The dbt Way (ELT)
Extract, Load, Transform. You use tools like Fivetran or Airbyte to dump raw, messy data directly into a powerful cloud warehouse (Snowflake, BigQuery). Then, you use dbt to transform the data in place using native SQL.
The Core Concept: Everything is a SELECT
In dbt, you don't write `CREATE TABLE x AS...`. You just write the logic. When you execute the command `dbt run`, dbt wraps your SELECT statement in the correct Data Definition Language (DDL) for your specific database.
-- All you write in dbt is this: SELECT user_id, LOWER(email) AS email_address, status FROM raw_database.stripe.users WHERE status = 'active' --------------------------------------------------------- -- When you type `dbt run`, dbt compiles and executes this against Snowflake: -- CREATE OR REPLACE VIEW analytics.prod.my_first_model AS ( -- SELECT user_id, LOWER(email) AS email_address, status -- FROM raw_database.stripe.users WHERE status = 'active' -- );
2. Installation & Project Architecture
dbt is a Python application. In the past, engineers struggled with messy Python virtual environments (`venv`). The 2026 standard is to use uv by Astral. It is written in Rust, runs 100x faster than `pip`, and handles isolated CLI tools flawlessly on Ubuntu/Linux.
# 1. Update Ubuntu and install prerequisites ❯ sudo apt update && sudo apt upgrade -y ❯ sudo apt install git curl -y # 2. Install 'uv' (The hyper-fast Rust-based Python manager) ❯ curl -LsSf https://astral.sh/uv/install.sh | sh ❯ source ~/.bashrc # 3. Install the dbt adapter globally using `uv tool` # `uv tool` creates an isolated virtual environment in milliseconds # and symlinks `dbt` to your path. (dbt-core is pulled automatically). ❯ uv tool install dbt-snowflake # Or dbt-bigquery, dbt-postgres # 4. Initialize your new project ❯ dbt init jaffle_shop
The Two Crucial Configuration Files
Every dbt project relies on two files to function. One lives inside your repository, and one lives securely on your local computer.
dbt_project.yml
Committed to Git. Tells dbt where to look for models, tests, and macros. Defines global configurations.
name: 'jaffle_shop' version: '1.0.0' profile: 'jaffle_shop' # Links to profiles.yml models: jaffle_shop: marts: materialized: table
~/.dbt/profiles.yml
Never commit to Git. Lives in your home directory. Holds your secure database passwords and roles.
jaffle_shop: target: dev outputs: dev: type: snowflake account: xy12345 user: alice password: secure123 database: raw_data schema: dbt_alice
3. Building Models (The Medallion Architecture)
A dbt project is a Directed Acyclic Graph (DAG). To ensure code reuse and scalability, industry standard dictates splitting your models into three layers: Staging (Bronze), Intermediate (Silver), and Marts (Gold).
Step 1: Declaring Sources
You should never hardcode raw table names (like `raw_db.stripe.customers`) into your SQL. Instead, define them in a `sources.yml` file. This allows dbt to track data freshness and makes it easy to point your project to a new database if you migrate.
version: 2 sources: - name: stripe database: raw_data schema: stripe_ingest tables: - name: raw_customers - name: raw_payments
Step 2: Staging Models (Bronze)
Staging models have a 1-to-1 relationship with source tables. Their *only* job is renaming columns to standard conventions, casting data types, and doing light cleaning. You use the {{ source() }} macro to refer to the YAML definition.
SELECT id AS customer_id, LOWER(email) AS email_address, CAST(created_date AS TIMESTAMP) AS created_at, is_active FROM {{ source('stripe', 'raw_customers') }}
Step 3: Marts (Gold) and the `ref` Function
Marts combine staging models together to answer business questions. This is where the magic happens. By using {{ ref('model_name') }}, dbt infers the execution order. It knows it must build `stg_stripe__customers` *before* it can build `dim_customers`.
WITH customers AS ( SELECT * FROM {{ ref('stg_stripe__customers') }} ), payments AS ( SELECT * FROM {{ ref('stg_stripe__payments') }} ) SELECT c.customer_id, c.email_address, SUM(p.amount) AS lifetime_value, MIN(p.payment_date) AS first_order_date FROM customers c LEFT JOIN payments p ON c.customer_id = p.customer_id GROUP BY 1, 2
4. Macros & Jinja Magic (DRY Code)
Standard SQL is rigid. You can't write a for loop in SQL. But because dbt files are compiled using the Jinja templating language, you can inject Python-like logic into your SQL to make it dynamic and DRY (Don't Repeat Yourself).
Creating a Macro
A Macro is a reusable snippet of SQL. If Stripe provides payments in cents, but Shopify provides them in dollars, you don't want to write `amount / 100` everywhere. Create a macro.
{% macro cents_to_dollars(column_name, scale=2) %} ROUND( CAST({{ column_name }} AS NUMERIC) / 100, {{ scale }} ) {% endmacro %}
Now, you can use it in any model across your entire project:
SELECT payment_id, customer_id, -- Compiles to: ROUND( CAST(amount_in_cents AS NUMERIC) / 100, 2 ) AS amount_usd {{ cents_to_dollars('amount_in_cents') }} AS amount_usd FROM {{ source('stripe', 'raw_payments') }}
5. Testing & Data Quality
Code without tests is legacy code. dbt allows you to write assertions about your data. If a test fails, `dbt test` throws an error, alerting your team before the CEO sees a broken dashboard.
Generic Tests (YAML)
dbt ships with 4 built-in generic tests: `unique`, `not_null`, `accepted_values`, and `relationships`. You apply them in your schema.yml files.
models: - name: dim_customers columns: - name: customer_id tests: - unique - not_null - name: status tests: - accepted_values: values: ['active', 'churned', 'pending']
Singular Tests (Custom SQL)
If you have complex business logic (e.g., "An order cannot be refunded before it was created"), write a singular test. A singular test is a `.sql` file in your `tests/` directory. If the query returns 0 rows, the test passes. If it returns any rows, the test fails.
-- This query selects bad data. If 0 rows return, we are good. SELECT order_id, created_at, refunded_at FROM {{ ref('fct_orders') }} WHERE refunded_at < created_at
Standard tests check the data after it hits the warehouse. Unit Tests allow you to mock input rows in YAML and assert the expected output without querying real data. This allows you to test complex Regex logic instantaneously during development.
6. Advanced Materializations
By default, dbt builds everything as a view. Views are great for staging because they take zero storage space, but they execute the query every time someone looks at the dashboard. For performance, you alter the Materialization.
-
table
Drops the old table and completely rebuilds it. Fast to query, but slow to build. Perfect for gold `marts` under 10 million rows.
-
ephemeral
Does not exist in the database at all! dbt interpolates the SQL directly into downstream models as a CTE (Common Table Expression).
The Incremental Deep-Dive
If you have an event log with 10 billion rows, rebuilding a `table` every day will cost a fortune. You need materialized='incremental'. This tells dbt to only process the new rows that arrived since the last run.
{{ config( materialized='incremental', unique_key='event_id' ) }} SELECT * FROM {{ ref('stg_tracking__events') }} {% if is_incremental() %} -- This block ONLY runs if the table already exists. -- `this` refers to the target table currently in the database. WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }}) {% endif %}
dbt defaults to a merge strategy. If you have a massive table and your `unique_key` is not clustered correctly, the database will perform a full table scan just to update 10 rows. This ruins the cost-saving benefits. Always partition on date, and consider the incremental_strategy: insert_overwrite for Big Data.
7. Data Contracts (Guarantees for Downstream APIs)
In the past, an Analytics Engineer might rename a column from `user_id` to `customer_id`. The dbt run would succeed, but the downstream Machine Learning pipeline consuming that table would instantly crash. Data Contracts solve this.
By setting enforced: true, dbt performs a pre-flight check. It compares the SQL output to the YAML definition. If they don't match exactly, dbt halts the build *before* touching the database.
models: - name: dim_customers config: contract: enforced: true columns: - name: customer_id data_type: int # If the SQL returns VARCHAR, the run fails. - name: email data_type: varchar constraints: - type: not_null # Enforced natively at the DDL level (e.g. Snowflake NOT NULL)
8. The dbt Semantic Layer (MetricFlow)
If your Marketing team defines "Active User" in Tableau, and Finance defines it differently in PowerBI, your company has a trust problem. The Semantic Layer decouples metric definitions from the BI tool. You define business metrics centrally in dbt YAML. Tools then query the dbt API, guaranteeing identical numbers everywhere.
semantic_models: - name: transactions model: ref('fct_transactions') entities: - name: customer type: foreign expr: customer_id dimensions: - name: transaction_date type: time type_params: time_granularity: day measures: - name: total_revenue agg: sum expr: amount_usd metrics: - name: monthly_revenue description: "Total gross revenue aggregated monthly." type: simple type_params: measure: total_revenue
9. Agentic AI Integration & RAG
The FrontierThe biggest mistake data teams make today is exposing raw database tables directly to LLMs (Text-to-SQL). LLMs hallucinate table relationships, ignore nuances like timezone conversions, and write hideously unoptimized JOINs.
Instead, provide your AI Agents with the dbt Semantic Layer API. If the user asks the Chatbot "What was revenue last month in Germany?", the LLM does not write SQL. It simply calls the dbt API: { metric: "monthly_revenue", dimensions: ["country: Germany"] }. dbt dynamically compiles the perfect, flawless SQL on the fly.
Building RAG (Vector Embeddings) inside dbt
If you are building Retrieval-Augmented Generation (RAG) applications, you need Vector Embeddings. You don't need external Python scripts for this anymore. Modern cloud warehouses support embedding functions natively, allowing you to build Vector DBs directly inside your dbt DAG.
{{ config( materialized='incremental', unique_key='ticket_id' ) }} WITH cleaned_tickets AS ( SELECT ticket_id, -- Use dbt macros to strip PII before sending to the embedding model {{ mask_pii('resolution_text') }} AS safe_text FROM {{ ref('stg_zendesk__tickets') }} WHERE status = 'solved' ) SELECT ticket_id, safe_text, -- Calling Snowflake Cortex (or BigQuery ML) natively inside dbt! SNOWFLAKE.CORTEX.EMBED_TEXT_768( 'snowflake-arctic-embed-m', safe_text ) AS vector_embedding FROM cleaned_tickets {% if is_incremental() %} -- Only generate expensive embeddings for new tickets WHERE ticket_id NOT IN (SELECT ticket_id FROM {{ this }}) {% endif %}
10. CI/CD & Data FinOps
The cloud is not free. A poorly written dbt `JOIN` running every hour can cost thousands of dollars a month. You must implement Query Tagging to trace warehouse compute credits directly back to the specific dbt model and Analytics Engineer who wrote it.
name: 'ai_native_analytics' # This injects metadata as comments into every SQL statement dbt sends to the warehouse. # You can parse the warehouse query_history view to build a cost dashboard! query-comment: comment: '{"dbt_model": "{{ node.name }}", "team": "{{ node.config.meta.owner }}"}' append: true
When you open a Pull Request, do not run dbt build on the entire project. This wastes massive compute testing models you didn't touch. Instead, use dbt's state features: dbt build --select state:modified+ --defer --state ./prod-run-artifacts. This tells dbt to only build the models you changed, and resolve the upstream dependencies against production data without rebuilding them.