Metric Granularity for Batch Workloads

The Question This Post Answers

Part 2 established the architecture: StatsD for counts, Pushgateway for state snapshots. But two critical questions remain:

What level of granularity should you achieve with each tool?
Why Gauges over Counters — what does it actually fix, and what doesn't it fix?

The Three Levels of Observability Granularity

Level 1: Aggregate Operational Metrics

Tool: StatsD → Prometheus | Cardinality: Low

These are your "system health" metrics: airflow_ti_finish_total, airflow_dag_duration_seconds, pool utilisation. Granularity is per-DAG, per-task-name, per-state. No run_id. StatsD natively aggregates thousands of UDP bursts into single metrics. These power your SLO alerts and operational dashboards.

Level 2: Task State Snapshots

Tool: Pushgateway (Gauges) → Prometheus | Cardinality: Depends on approach

These answer "what is the current state of task X?": airflow_task_instance_status, airflow_task_instance_duration_seconds, airflow_task_instance_retries.

There are two approaches, and the right one depends on your scale:

V3 (Production — Low Cardinality): Grouping key uses only dag_id + task_id + instance. Each task has exactly ONE slot. Latest execution overwrites previous — shows current state only. Cardinality is bounded by the number of unique (dag_id, task_id) pairs. No Sweeper needed.

V2 (Stepping Stone — High Cardinality): Grouping key includes run_id. Each execution gets its own slot. You can see the state of every individual run. But cardinality grows linearly with executions and requires a Sweeper DAG. Acceptable for small systems (≤ hundreds of runs/day). Not production-grade at scale.

🚨 The Golden Rule: NEVER inject high-cardinality keys like run_id into Prometheus at scale. Doing so causes severe series churn, bloats the TSDB index, and destroys query performance. If you need per-run history, that's a Level 3 problem — use OLAP or structured logs.

Critical constraint: Both V2 and V3 MUST use Gauges, not Counters. The reason is semantic, not performance. Broken down below.

Level 3: Per-Run Execution History (Audit & Traceability)

Tool: NOT Prometheus | Cardinality: Unbounded

This is your "what exactly happened inside the job" data: rows processed, data quality scores, error messages, data lineage, reconciliation records. Must be queryable weeks or months later.

Why NOT Prometheus: Prometheus is optimised for aggregate monitoring with low-to-moderate cardinality and short retention. Per-execution audit data has unbounded cardinality and requires durable, long-term, exact-value storage. This includes per-run_id tracking at scale — if your system processes thousands of runs per day, per-run data is an OLAP problem, not a metrics problem.

Where it belongs:

Structured execution records: Log aggregators like Grafana Loki or OpenSearch are perfect for indexing logs and stack traces without cardinality explosion.
Row counts and data quality scores: OLAP databases like ClickHouse, BigQuery, or DuckDB are optimized for analytical queries over millions of high-cardinality execution logs.
Real-time execution events: Streaming platforms like Kafka decouple execution events and route them safely to OLAP sinks or real-time alerting systems.
Simple audit tables: Traditional relational databases like PostgreSQL are suitable for light transactional audit trails.

We cover event-based audit implementation in Part 5.

Counter vs Gauge: The Precise Technical Argument

What Most Guides Get Wrong

Many guides claim:

❌ Gauges "solve" cardinality problems that Counters create
❌ count(gauge == -1) is more efficient than sum(counter)
❌ Sweeping Counters from Pushgateway "destroys history" but sweeping Gauges doesn't

None of these are true. If both use the same labels (including run_id), they produce identical cardinality, identical query cost, and identical behaviour after sweeping. Once Prometheus scrapes a metric, that data lives in the TSDB until retention expires, regardless of whether you delete it from Pushgateway.

What Gauges Actually Fix

1. Retry Safety (The Strongest Argument)

Task lifecycle: running → failed → retried → success.

Gauge (status = -1 then overwritten to status = 1): Dashboard query count(status == -1) correctly shows zero failures because the latest state is success.

Counter (failure_total++ then success_total++): Both increments persist. Dashboard shows a failure AND a success for the same task. Failure count is permanently inflated. No way to undo.

2. Natural State Modelling

A Gauge maps directly to task lifecycle:

task_state = 0   →  running
task_state = -1  →  failed
task_state = 1   →  success
task_state = 2   →  skipped

Counters only go up. You'd need separate counters per state with no way to determine a task's final state.

3. Counter Reset Semantics

Counters expect long-lived, monotonically increasing processes. Short-lived tasks reset to 0 every execution. rate() and increase() attempt to compensate for resets, producing unpredictable results for ephemeral tasks.

What Gauges Do NOT Fix

While Gauges offer semantic improvements, they do not resolve scale and storage issues. Here is a breakdown of what Gauges do and do not fix compared to Counters:

What they do NOT fix:
- Cardinality explosion from run_id labels: Both Counters and Gauges produce identical cardinality.
- Pushgateway OOM without Sweeper: Both will exhaust memory identically if not cleaned up.
- Prometheus series churn: Stale metadata remains an issue for both.
- Query performance (count vs sum): Both require scanning the exact same number of active series.
What they DO fix:
- Task state modelling: Overwriting values cleanly aligns with a discrete lifecycle.
- Retry correctness: Overwriting errors with success ensures accurate final counts.
- Counter reset semantics: Gauges bypass unpredictable rate() calculations for short-lived tasks.

Bottom line: Gauges fix how your data means something. They do not fix how much data you produce.

The Complete Decision Matrix

To choose the right monitoring pattern, map your core questions to the correct tool and ingestion plugin:

How many tasks failed today?
- Tool: StatsD → Prometheus
- Plugin: Native Airflow integration (safe, aggregated UDP)
What is the average DAG duration?
- Tool: StatsD → Prometheus
- Plugin: Native Airflow integration
What is the latest state of task X?
- Tool: Pushgateway (Gauge) → Prometheus
- Plugin: V3 plugin (production-grade, low-cardinality)
Did task X in run Y succeed? (at small scale)
- Tool: Pushgateway (Gauge) → Prometheus
- Plugin: V2 plugin (stepping stone, uses run_id)
Did task X in run Y succeed? (at production scale)
- Tool: NOT Prometheus (Grafana Loki or an OLAP backend)
- Plugin/Method: Structured JSON logs
How many rows did run Y process?
- Tool: NOT Prometheus (Loki, ClickHouse, or BigQuery)
- Plugin/Method: Structured logs
What upstream sources did run Y read?
- Tool: NOT Prometheus (OpenLineage and OLAP engines)
- Plugin/Method: Dedicated event pipeline

Key Takeaways

Three levels of granularity — aggregate (StatsD), task state (Pushgateway Gauges), per-run audit (NOT Prometheus).
Gauges fix semantics, not scale. Retry safety and state modelling are the real arguments. Cardinality and query cost are identical for both metric types.
Never inject run_id into Prometheus at scale. Use V3 (low cardinality) for production dashboards. Use OLAP/Loki for per-run history.
Prometheus is not an audit store. Per-execution data belongs in Loki, OLAP, or event streams.

Metric Granularity for Batch Workloads

The Question This Post Answers

The Three Levels of Observability Granularity

Level 1: Aggregate Operational Metrics

Level 2: Task State Snapshots

Level 3: Per-Run Execution History (Audit & Traceability)

Counter vs Gauge: The Precise Technical Argument

What Most Guides Get Wrong

What Gauges Actually Fix

1. Retry Safety (The Strongest Argument)

2. Natural State Modelling

3. Counter Reset Semantics

What Gauges Do NOT Fix

The Complete Decision Matrix

Key Takeaways

References

Comments

Batch Workloads Observability

The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads

More from this blog

Building the Stack — Plugin, StatsD, Sweeper, and Grafana

The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads

The Green Tick Fallacy — Why Batch Observability is Fundamentally Different

Command Palette

The Question This Post Answers

The Three Levels of Observability Granularity

Level 1: Aggregate Operational Metrics

Level 2: Task State Snapshots

Level 3: Per-Run Execution History (Audit & Traceability)

Counter vs Gauge: The Precise Technical Argument

What Most Guides Get Wrong

What Gauges Actually Fix

1. Retry Safety (The Strongest Argument)

2. Natural State Modelling

3. Counter Reset Semantics

What Gauges Do NOT Fix

The Complete Decision Matrix

Key Takeaways

References

Comments

Batch Workloads Observability

The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads

More from this blog