Skip to main content

Command Palette

Search for a command to run...

Metric Granularity for Batch Workloads

What to Track, Where to Store It, and Why Counters Break

Updated
6 min read
Metric Granularity for Batch Workloads
C
Hi, I'm Chirag! By day, I'm a Staff Engineer designing high-scale distributed platforms and full-stack systems. By night, I’m just a highly curious engineer who loves writing down what I learn, sharing my architectural experiments, and chatting about how things work (and why they sometimes break!). I started this blog because I believe the best engineering happens when we share our blueprints openly, embrace our mistakes, and learn from one another. Grab a cup of coffee and let's explore!

The Question This Post Answers

Part 2 established the architecture: StatsD for counts, Pushgateway for state snapshots. But two critical questions remain:

  1. What level of granularity should you achieve with each tool?

  2. Why Gauges over Counters — what does it actually fix, and what doesn't it fix?

The Three Levels of Observability Granularity

Level 1: Aggregate Operational Metrics

Tool: StatsD → Prometheus | Cardinality: Low

These are your "system health" metrics: airflow_ti_finish_total, airflow_dag_duration_seconds, pool utilisation. Granularity is per-DAG, per-task-name, per-state. No run_id. StatsD natively aggregates thousands of UDP bursts into single metrics. These power your SLO alerts and operational dashboards.

Level 2: Task State Snapshots

Tool: Pushgateway (Gauges) → Prometheus | Cardinality: Depends on approach

These answer "what is the current state of task X?": airflow_task_instance_status, airflow_task_instance_duration_seconds, airflow_task_instance_retries.

Prometheus Metrics UI

There are two approaches, and the right one depends on your scale:

V3 (Production — Low Cardinality): Grouping key uses only dag_id + task_id + instance. Each task has exactly ONE slot. Latest execution overwrites previous — shows current state only. Cardinality is bounded by the number of unique (dag_id, task_id) pairs. No Sweeper needed.

V2 (Stepping Stone — High Cardinality): Grouping key includes run_id. Each execution gets its own slot. You can see the state of every individual run. But cardinality grows linearly with executions and requires a Sweeper DAG. Acceptable for small systems (≤ hundreds of runs/day). Not production-grade at scale.

🚨 The Golden Rule: NEVER inject high-cardinality keys like run_id into Prometheus at scale. Doing so causes severe series churn, bloats the TSDB index, and destroys query performance. If you need per-run history, that's a Level 3 problem — use OLAP or structured logs.

Critical constraint: Both V2 and V3 MUST use Gauges, not Counters. The reason is semantic, not performance. Broken down below.

Level 3: Per-Run Execution History (Audit & Traceability)

Tool: NOT Prometheus | Cardinality: Unbounded

This is your "what exactly happened inside the job" data: rows processed, data quality scores, error messages, data lineage, reconciliation records. Must be queryable weeks or months later.

Why NOT Prometheus: Prometheus is optimised for aggregate monitoring with low-to-moderate cardinality and short retention. Per-execution audit data has unbounded cardinality and requires durable, long-term, exact-value storage. This includes per-run_id tracking at scale — if your system processes thousands of runs per day, per-run data is an OLAP problem, not a metrics problem.

Where it belongs:

  • Structured execution records: Log aggregators like Grafana Loki or OpenSearch are perfect for indexing logs and stack traces without cardinality explosion.

  • Row counts and data quality scores: OLAP databases like ClickHouse, BigQuery, or DuckDB are optimized for analytical queries over millions of high-cardinality execution logs.

  • Real-time execution events: Streaming platforms like Kafka decouple execution events and route them safely to OLAP sinks or real-time alerting systems.

  • Simple audit tables: Traditional relational databases like PostgreSQL are suitable for light transactional audit trails.

We cover event-based audit implementation in Part 5.

Counter vs Gauge: The Precise Technical Argument

What Most Guides Get Wrong

Many guides claim:

  • ❌ Gauges "solve" cardinality problems that Counters create

  • count(gauge == -1) is more efficient than sum(counter)

  • ❌ Sweeping Counters from Pushgateway "destroys history" but sweeping Gauges doesn't

None of these are true. If both use the same labels (including run_id), they produce identical cardinality, identical query cost, and identical behaviour after sweeping. Once Prometheus scrapes a metric, that data lives in the TSDB until retention expires, regardless of whether you delete it from Pushgateway.

What Gauges Actually Fix

Lab Comparison Dashboard

1. Retry Safety (The Strongest Argument)

Task lifecycle: running → failed → retried → success.

Gauge (status = -1 then overwritten to status = 1): Dashboard query count(status == -1) correctly shows zero failures because the latest state is success.

Counter (failure_total++ then success_total++): Both increments persist. Dashboard shows a failure AND a success for the same task. Failure count is permanently inflated. No way to undo.

2. Natural State Modelling

A Gauge maps directly to task lifecycle:

task_state = 0   →  running
task_state = -1  →  failed
task_state = 1   →  success
task_state = 2   →  skipped

Counters only go up. You'd need separate counters per state with no way to determine a task's final state.

3. Counter Reset Semantics

Counters expect long-lived, monotonically increasing processes. Short-lived tasks reset to 0 every execution. rate() and increase() attempt to compensate for resets, producing unpredictable results for ephemeral tasks.

What Gauges Do NOT Fix

While Gauges offer semantic improvements, they do not resolve scale and storage issues. Here is a breakdown of what Gauges do and do not fix compared to Counters:

  • What they do NOT fix:

    • Cardinality explosion from run_id labels: Both Counters and Gauges produce identical cardinality.

    • Pushgateway OOM without Sweeper: Both will exhaust memory identically if not cleaned up.

    • Prometheus series churn: Stale metadata remains an issue for both.

    • Query performance (count vs sum): Both require scanning the exact same number of active series.

  • What they DO fix:

    • Task state modelling: Overwriting values cleanly aligns with a discrete lifecycle.

    • Retry correctness: Overwriting errors with success ensures accurate final counts.

    • Counter reset semantics: Gauges bypass unpredictable rate() calculations for short-lived tasks.

Bottom line: Gauges fix how your data means something. They do not fix how much data you produce.

The Complete Decision Matrix

To choose the right monitoring pattern, map your core questions to the correct tool and ingestion plugin:

  • How many tasks failed today?

    • Tool: StatsD → Prometheus

    • Plugin: Native Airflow integration (safe, aggregated UDP)

  • What is the average DAG duration?

    • Tool: StatsD → Prometheus

    • Plugin: Native Airflow integration

  • What is the latest state of task X?

    • Tool: Pushgateway (Gauge) → Prometheus

    • Plugin: V3 plugin (production-grade, low-cardinality)

  • Did task X in run Y succeed? (at small scale)

    • Tool: Pushgateway (Gauge) → Prometheus

    • Plugin: V2 plugin (stepping stone, uses run_id)

  • Did task X in run Y succeed? (at production scale)

    • Tool: NOT Prometheus (Grafana Loki or an OLAP backend)

    • Plugin/Method: Structured JSON logs

  • How many rows did run Y process?

    • Tool: NOT Prometheus (Loki, ClickHouse, or BigQuery)

    • Plugin/Method: Structured logs

  • What upstream sources did run Y read?

    • Tool: NOT Prometheus (OpenLineage and OLAP engines)

    • Plugin/Method: Dedicated event pipeline

Key Takeaways

  • Three levels of granularity — aggregate (StatsD), task state (Pushgateway Gauges), per-run audit (NOT Prometheus).

  • Gauges fix semantics, not scale. Retry safety and state modelling are the real arguments. Cardinality and query cost are identical for both metric types.

  • Never inject run_id into Prometheus at scale. Use V3 (low cardinality) for production dashboards. Use OLAP/Loki for per-run history.

  • Prometheus is not an audit store. Per-execution data belongs in Loki, OLAP, or event streams.

References

Batch Workloads Observability

Part 1 of 3

Stop treating batch jobs like long-running services. This series provides a Staff-level blueprint for batch workload observability—covering push-based telemetry, metric granularity, and the architectural divide between monitoring state vs. auditing history.

Up next

The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads

Master the Airflow Prometheus StatsD architecture for batch workloads. Eliminate race conditions and cardinality explosions with push-based telemetry.

More from this blog

T

TheStaffBlueprint

3 posts

The Staff Blueprint is a shared space for exploring the complex, often messy world of high-level data and software architecture. We document our production-grade strategies and architectural experiments—not as final truths, but as evolving blueprints. Here, we bridge the gap between senior intuition and staff-level clarity by building, failing, and iterating together. Technical excellence is our goal, and the journey (mistakes included) is how we get there.