Professional

ETL Patterns That Actually Scale

Feb 2026 2 min read fr

Reading mode

SEC 1 / 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Every data engineer has been woken up at 3am by a failing pipeline. These patterns help prevent that.

Pattern 1: Idempotent Loads

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore. The single most important property of any pipeline: running it twice produces the same result as running it once.

-- Idempotent upsert using ON CONFLICT
INSERT INTO staging.events (event_id, payload, loaded_at)
VALUES ($1, $2, NOW())
ON CONFLICT (event_id) DO UPDATE
SET payload = EXCLUDED.payload,
    loaded_at = NOW();

If your pipeline isn't idempotent, it's a ticking time bomb. It's not a matter of if it will run twice — it's when.

Pattern 2: Schema-on-Read with JSONB

Lorem ipsum dolor sit amet. Store raw data as JSONB first, parse later:

-- Land raw data fast
INSERT INTO raw.api_responses (source, payload)
VALUES ('stripe', $1::JSONB);

-- Parse when ready
CREATE VIEW clean.payments AS
SELECT
  payload->>'id' AS payment_id,
  (payload->>'amount')::NUMERIC / 100 AS amount_dollars,
  (payload->>'created')::TIMESTAMPTZ AS created_at
FROM raw.api_responses
WHERE source = 'stripe';

Pattern 3: Watermark Tracking

Track what you've already processed:

Store the last processed timestamp/ID per source
On each run, fetch only records after the watermark
Update watermark only after successful processing

Pattern 4: Dead Letter Queues

Not every record will parse cleanly. Instead of failing the entire batch:

Try to process the record
On failure, write it to a dead_letter table with the error message
Continue processing the rest
Alert on dead letter count exceeding threshold

Pattern 5: Backfill by Design

Lorem ipsum dolor sit amet, consectetur adipiscing elit:

Partition by date so you can reprocess one day without touching the rest
Parameterize date ranges in every pipeline step
Log processed ranges for auditability

Summary

These five patterns won't make your pipeline bulletproof, but they'll make the 3am pages a lot less frequent.