Metric Granularity for Batch Workloads
What to Track, Where to Store It, and Why Counters Break

The Question This Post Answers
Part 2 established the architecture: StatsD for counts, Pushgateway for state snapshots. But two critical questions remain:
What level of granularity should you achieve with each tool?
Why Gauges over Counters — what does it actually fix, and what doesn't it fix?
The Three Levels of Observability Granularity
Level 1: Aggregate Operational Metrics
Tool: StatsD → Prometheus | Cardinality: Low
These are your "system health" metrics: airflow_ti_finish_total, airflow_dag_duration_seconds, pool utilisation. Granularity is per-DAG, per-task-name, per-state. No run_id. StatsD natively aggregates thousands of UDP bursts into single metrics. These power your SLO alerts and operational dashboards.
Level 2: Task State Snapshots
Tool: Pushgateway (Gauges) → Prometheus | Cardinality: Depends on approach
These answer "what is the current state of task X?": airflow_task_instance_status, airflow_task_instance_duration_seconds, airflow_task_instance_retries.
There are two approaches, and the right one depends on your scale:
V3 (Production — Low Cardinality): Grouping key uses only dag_id + task_id + instance. Each task has exactly ONE slot. Latest execution overwrites previous — shows current state only. Cardinality is bounded by the number of unique (dag_id, task_id) pairs. No Sweeper needed.
V2 (Stepping Stone — High Cardinality): Grouping key includes run_id. Each execution gets its own slot. You can see the state of every individual run. But cardinality grows linearly with executions and requires a Sweeper DAG. Acceptable for small systems (≤ hundreds of runs/day). Not production-grade at scale.
🚨 The Golden Rule: NEVER inject high-cardinality keys like run_id into Prometheus at scale. Doing so causes severe series churn, bloats the TSDB index, and destroys query performance. If you need per-run history, that's a Level 3 problem — use OLAP or structured logs.
Critical constraint: Both V2 and V3 MUST use Gauges, not Counters. The reason is semantic, not performance. Broken down below.
Level 3: Per-Run Execution History (Audit & Traceability)
Tool: NOT Prometheus | Cardinality: Unbounded
This is your "what exactly happened inside the job" data: rows processed, data quality scores, error messages, data lineage, reconciliation records. Must be queryable weeks or months later.
Why NOT Prometheus: Prometheus is optimised for aggregate monitoring with low-to-moderate cardinality and short retention. Per-execution audit data has unbounded cardinality and requires durable, long-term, exact-value storage. This includes per-run_id tracking at scale — if your system processes thousands of runs per day, per-run data is an OLAP problem, not a metrics problem.
Where it belongs:
Structured execution records: Log aggregators like Grafana Loki or OpenSearch are perfect for indexing logs and stack traces without cardinality explosion.
Row counts and data quality scores: OLAP databases like ClickHouse, BigQuery, or DuckDB are optimized for analytical queries over millions of high-cardinality execution logs.
Real-time execution events: Streaming platforms like Kafka decouple execution events and route them safely to OLAP sinks or real-time alerting systems.
Simple audit tables: Traditional relational databases like PostgreSQL are suitable for light transactional audit trails.
We cover event-based audit implementation in Part 5.
Counter vs Gauge: The Precise Technical Argument
What Most Guides Get Wrong
Many guides claim:
❌ Gauges "solve" cardinality problems that Counters create
❌
count(gauge == -1)is more efficient thansum(counter)❌ Sweeping Counters from Pushgateway "destroys history" but sweeping Gauges doesn't
None of these are true. If both use the same labels (including run_id), they produce identical cardinality, identical query cost, and identical behaviour after sweeping. Once Prometheus scrapes a metric, that data lives in the TSDB until retention expires, regardless of whether you delete it from Pushgateway.
What Gauges Actually Fix
1. Retry Safety (The Strongest Argument)
Task lifecycle: running → failed → retried → success.
Gauge (status = -1 then overwritten to status = 1): Dashboard query count(status == -1) correctly shows zero failures because the latest state is success.
Counter (failure_total++ then success_total++): Both increments persist. Dashboard shows a failure AND a success for the same task. Failure count is permanently inflated. No way to undo.
2. Natural State Modelling
A Gauge maps directly to task lifecycle:
task_state = 0 → running
task_state = -1 → failed
task_state = 1 → success
task_state = 2 → skipped
Counters only go up. You'd need separate counters per state with no way to determine a task's final state.
3. Counter Reset Semantics
Counters expect long-lived, monotonically increasing processes. Short-lived tasks reset to 0 every execution. rate() and increase() attempt to compensate for resets, producing unpredictable results for ephemeral tasks.
What Gauges Do NOT Fix
While Gauges offer semantic improvements, they do not resolve scale and storage issues. Here is a breakdown of what Gauges do and do not fix compared to Counters:
What they do NOT fix:
Cardinality explosion from
run_idlabels: Both Counters and Gauges produce identical cardinality.Pushgateway OOM without Sweeper: Both will exhaust memory identically if not cleaned up.
Prometheus series churn: Stale metadata remains an issue for both.
Query performance (
countvssum): Both require scanning the exact same number of active series.
What they DO fix:
Task state modelling: Overwriting values cleanly aligns with a discrete lifecycle.
Retry correctness: Overwriting errors with success ensures accurate final counts.
Counter reset semantics: Gauges bypass unpredictable
rate()calculations for short-lived tasks.
Bottom line: Gauges fix how your data means something. They do not fix how much data you produce.
The Complete Decision Matrix
To choose the right monitoring pattern, map your core questions to the correct tool and ingestion plugin:
How many tasks failed today?
Tool: StatsD → Prometheus
Plugin: Native Airflow integration (safe, aggregated UDP)
What is the average DAG duration?
Tool: StatsD → Prometheus
Plugin: Native Airflow integration
What is the latest state of task X?
Tool: Pushgateway (Gauge) → Prometheus
Plugin: V3 plugin (production-grade, low-cardinality)
Did task X in run Y succeed? (at small scale)
Tool: Pushgateway (Gauge) → Prometheus
Plugin: V2 plugin (stepping stone, uses
run_id)
Did task X in run Y succeed? (at production scale)
Tool: NOT Prometheus (Grafana Loki or an OLAP backend)
Plugin/Method: Structured JSON logs
How many rows did run Y process?
Tool: NOT Prometheus (Loki, ClickHouse, or BigQuery)
Plugin/Method: Structured logs
What upstream sources did run Y read?
Tool: NOT Prometheus (OpenLineage and OLAP engines)
Plugin/Method: Dedicated event pipeline
Key Takeaways
Three levels of granularity — aggregate (StatsD), task state (Pushgateway Gauges), per-run audit (NOT Prometheus).
Gauges fix semantics, not scale. Retry safety and state modelling are the real arguments. Cardinality and query cost are identical for both metric types.
Never inject
run_idinto Prometheus at scale. Use V3 (low cardinality) for production dashboards. Use OLAP/Loki for per-run history.Prometheus is not an audit store. Per-execution data belongs in Loki, OLAP, or event streams.


