<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[TheStaffBlueprint]]></title><description><![CDATA[The Staff Blueprint is a shared space for exploring the complex, often messy world of high-level data and software architecture. We document our production-grade strategies and architectural experiments—not as final truths, but as evolving blueprints. Here, we bridge the gap between senior intuition and staff-level clarity by building, failing, and iterating together. Technical excellence is our goal, and the journey (mistakes included) is how we get there.]]></description><link>https://blog.thestaffblueprint.com</link><image><url>https://cdn.hashnode.com/uploads/logos/6471d940421f715ac07f9905/45c14bc6-de12-42ae-b349-52d7cbbd3494.png</url><title>TheStaffBlueprint</title><link>https://blog.thestaffblueprint.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sun, 07 Jun 2026 05:26:12 GMT</lastBuildDate><atom:link href="https://blog.thestaffblueprint.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Building the Stack — Plugin, StatsD, Sweeper, and Grafana]]></title><description><![CDATA[(Part 4 of the Batch Workloads Observability series. Read Part 3: Metric Granularity for the metric classification rationale.)
(Clone the companion repo: TheStaffBlueprint/batch-workloads-observabilit]]></description><link>https://blog.thestaffblueprint.com/building-the-stack-plugin-statsd-sweeper-and-grafana</link><guid isPermaLink="true">https://blog.thestaffblueprint.com/building-the-stack-plugin-statsd-sweeper-and-grafana</guid><category><![CDATA[airflow]]></category><category><![CDATA[#prometheus]]></category><category><![CDATA[statsd]]></category><category><![CDATA[Grafana]]></category><category><![CDATA[data-engineering]]></category><dc:creator><![CDATA[Chirag Bhatia]]></dc:creator><pubDate>Fri, 29 May 2026 10:43:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/f127e49d-970f-473f-be04-16ab4db82599.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>(Part 4 of the Batch Workloads Observability series. Read</em> <a href="/airflow-counter-vs-gauge-batch"><em>Part 3: Metric Granularity</em></a> <em>for the metric classification rationale.)</em></p>
<p><em>(Clone the companion repo:</em> <a href="https://github.com/TheStaffBlueprint/batch-workloads-observability"><em>TheStaffBlueprint/batch-workloads-observability</em></a> <em>for the full docker-compose stack, plugins, and Sweeper DAG.)</em></p>
<img src="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/15393a40-2fcc-4716-8b6d-524421ba92a0.png" alt="Part 4 Implementation Explainer" style="display:block;margin:0 auto" />

<p>In Parts 2 and 3, we established the architecture and metric classification: StatsD for aggregate counts, Pushgateway with Gauges for per-run state, and structured logs/OLAP for audit data. Today, we're building all of it.</p>
<h2>Step 1: StatsD — Zero Custom Code</h2>
<p>Add these lines to your Airflow environment:</p>
<pre><code class="language-ini">[metrics]
statsd_on = True
statsd_host = statsd-exporter
statsd_port = 8125
statsd_prefix = airflow
</code></pre>
<p>Airflow natively pushes <code>ti.finish</code> counts, <code>dag.duration</code> timers, and dozens of other operational metrics over UDP. The StatsD exporter aggregates these and exposes them to Prometheus as clean, low-cardinality metrics. See <code>statsd_mapping.yml</code> in the companion repo for the full mapping configuration.</p>
<h2>Step 2: The Pushgateway Plugin — V1 to V2 Evolution</h2>
<h3>The V1 Anti-Pattern (Two Compounding Mistakes)</h3>
<p><strong>Mistake 1 — No run isolation.</strong> V1 used a coarse grouping key (<code>dag_id</code> + <code>task_id</code>) with no <code>run_id</code>. Every execution of the same task overwrites the previous one. If 10 parallel tasks push at the same time, only the last survives. This is a race condition caused by the <strong>grouping key design</strong>, not the metric type — it would happen with Gauges too.</p>
<pre><code class="language-python"># V1 ANTI-PATTERN: Coarse key — no run_id
def _get_task_group_key(self, ti):
    return {
        'dag_id': ti.dag_id,
        'task_id': ti.task_id,
        'instance': self.instance_name,
    }
</code></pre>
<p><strong>Mistake 2 — Using Counters for state.</strong> Even with <code>run_id</code> added, Counters are semantically wrong. If a task fails then retries successfully, both <code>failure_total = 1</code> and <code>success_total = 1</code> persist. Failure count is permanently inflated. Both Counters and Gauges with <code>run_id</code> produce identical cardinality.</p>
<h3>The V2 Fix (Two Corrections)</h3>
<pre><code class="language-python"># airflow/plugins/v2_gauge_fix_plugin.py
from prometheus_client import CollectorRegistry, Gauge, pushadd_to_gateway

class PushgatewayV2GaugeListeners:
    def __init__(self):
        self.instance_name = os.environ.get("AIRFLOW_VAR_PROMETHEUS_INSTANCE_NAME", "airflow-local")

    def _push(self, registry, job_name, group_key):
        # THIS is the key: Read from environment to avoid SQLAlchemy session detachment
        enabled_str = os.environ.get("AIRFLOW_VAR_PROMETHEUS_METRICS_ENABLED", "true").lower()
        push_gateway_url = os.environ.get("AIRFLOW_VAR_PUSHGATEWAY_URL", "http://host.docker.internal:9091")
        if enabled_str != "true" or not push_gateway_url:
            return
        try:
            pushadd_to_gateway(push_gateway_url, job=job_name, grouping_key=group_key, registry=registry)
        except Exception as e:
            logging.error(f"Failed to push metric: {e}")
</code></pre>
<h3>The SQLAlchemy Trap</h3>
<p>We use <code>os.environ.get</code> instead of Airflow's <code>Variable.get()</code>. When the Scheduler fires a listener hook, it often does so outside an active database session. Using <code>Variable.get()</code> throws a <code>DetachedInstanceError</code> and crashes the scheduler loop.</p>
<h3>Fix 1 — Run Isolation via Grouping Key</h3>
<pre><code class="language-python">    def _get_task_group_key(self, ti):
        key = {
            'dag_id': ti.dag_id,
            'task_id': ti.task_id,
            'run_id': ti.run_id,        # Isolates each DAG run
            'instance': self.instance_name,
        }
        map_index = getattr(ti, 'map_index', -1)
        if map_index &gt;= 0:
            key['map_index'] = str(map_index)  # Isolates mapped tasks
        return key
</code></pre>
<h3>Fix 2 — Gauges for Semantic Correctness</h3>
<p>Gauges represent state as an absolute value. On retry, the Gauge overwrites from -1 to 1 — only the final state is visible:</p>
<pre><code class="language-python">    @hookimpl
    def on_task_instance_success(self, previous_state: TaskInstanceState, task_instance: TaskInstance):
        registry = CollectorRegistry()
        group_key = self._get_task_group_key(task_instance)
        
        g = Gauge('task_status', 'Task state snapshot (1=success, -1=failed)', registry=registry)
        g.set(1)  # 1 = success (overwrites any previous -1 from a failed attempt)
        
        self._push(registry, group_key)
</code></pre>
<h2>Step 3: The Sweeper DAG — Preventing OOM Crashes</h2>
<p>Pushgateway has no native TTL. Every <code>run_id</code> creates a metric group held in memory forever. The Sweeper runs every 24 hours and deletes anything older than the threshold:</p>
<img src="airflow_sweeper.png" alt="Pushgateway Sweeper DAG" style="display:block;margin:0 auto" />

<pre><code class="language-python"># airflow/dags/pushgateway_sweeper.py
from datetime import datetime, timezone
import requests
from airflow.decorators import dag, task

@task
def clean_stale_metrics(max_age_mins: int):
    gateway_url = "http://host.docker.internal:9091"
    response = requests.get(f"{gateway_url}/api/v1/metrics")
    data = response.json()
    now = datetime.now(timezone.utc).timestamp()
    max_age_secs = max_age_mins * 60
    
    for group in data.get('data', []):
        push_time = group.get('push_time_seconds', {}).get('metrics', [{}])[0].get('value')
        if now - float(push_time) &gt; max_age_secs:
            labels = group.get('labels', {})
            delete_url = f"{gateway_url}/metrics/job/{labels.pop('job', 'unknown_job')}"
            for k, v in labels.items():
                if v:
                    delete_url += f"/{k}/{v}"
            requests.delete(delete_url)

@dag(schedule="0 0 * * *", start_date=datetime(2026, 1, 1), catchup=False)
def pushgateway_sweeper():
    clean_stale_metrics(max_age_mins=1440)

sweeper_dag = pushgateway_sweeper()
</code></pre>
<p><strong>The Sweeper is mandatory for V2 deployments with</strong> <code>run_id</code> <strong>in the grouping key.</strong></p>
<h2>Step 4: V3 — The Production-Grade Plugin</h2>
<p>V2 works for small systems, but at scale (thousands+ runs/day), <code>run_id</code> in the grouping key causes severe series churn. V3 removes <code>run_id</code> entirely, giving you bounded cardinality with no Sweeper:</p>
<pre><code class="language-python"># airflow/plugins/v3_low_cardinality_plugin.py
class PushgatewayV3LowCardinalityListeners:
    def _get_task_group_key(self, ti):
        """LOW CARDINALITY: No run_id. Each (dag, task) pair has exactly one slot.
        Latest execution overwrites previous — shows current state only."""
        return {
            'dag_id': ti.dag_id,
            'task_id': ti.task_id,
            'instance': self.instance_name,
        }
</code></pre>
<p>The key difference: V3's grouping key has <strong>no</strong> <code>run_id</code>. Each task has exactly one slot in the Pushgateway. The latest execution overwrites the previous one. This means:</p>
<ul>
<li><p>✅ Bounded cardinality — no series churn, no Sweeper needed</p>
</li>
<li><p>✅ Retry safe — Gauge overwrites reflect final state</p>
</li>
<li><p>✅ Fast dashboards — Prometheus queries scan a small, fixed set of series</p>
</li>
<li><p>❌ No per-run history — only the latest execution is visible</p>
</li>
</ul>
<p><strong>For per-run history</strong>, emit structured JSON logs to an OLAP engine or log aggregator. This is covered in Part 5.</p>
<h3>The Plugin Evolution</h3>
<table>
<thead>
<tr>
<th>Plugin</th>
<th><code>run_id</code></th>
<th>Cardinality</th>
<th>Sweeper</th>
<th>Use Case</th>
</tr>
</thead>
<tbody><tr>
<td>V1</td>
<td>No</td>
<td>Low</td>
<td>No</td>
<td>❌ Anti-pattern (Counters + race condition)</td>
</tr>
<tr>
<td>V2</td>
<td><strong>Yes</strong></td>
<td>High</td>
<td><strong>Yes</strong></td>
<td>⚠️ Small systems only</td>
</tr>
<tr>
<td>V3</td>
<td>No</td>
<td>Low</td>
<td>No</td>
<td>✅ Production-grade</td>
</tr>
</tbody></table>
<h3>The V3 Production Dashboard</h3>
<p>We've added a dedicated dashboard for the V3 plugin: <code>airflow_v3_low_cardinality.json</code>.</p>
<img src="grafana_v3.png" alt="Grafana V3 Low Cardinality Dashboard" style="display:block;margin:0 auto" />

<p>Unlike the V2 dashboard, this one:</p>
<ul>
<li><p><strong>Removes the</strong> <code>run_id</code> <strong>filter:</strong> Since metrics are no longer partitioned by run, the dashboard shows the global fleet state.</p>
</li>
<li><p><strong>Focuses on "Latest State":</strong> The panels show the status of the most recent execution of every task.</p>
</li>
<li><p><strong>Improved Performance:</strong> Because the number of series is bounded, the dashboard loads instantly even with thousands of historic runs in Prometheus.</p>
</li>
</ul>
<h2>Key Takeaways</h2>
<ul>
<li><p>Use <code>os.environ.get()</code> in Airflow listener plugins, never <code>Variable.get()</code>.</p>
</li>
<li><p>V1 had <strong>two</strong> problems: missing <code>run_id</code> (race condition) AND Counters (semantic mismatch).</p>
</li>
<li><p>V2 fixes both but creates high cardinality — acceptable for small systems, not production-grade at scale.</p>
</li>
<li><p><strong>V3 is the production recommendation</strong> — removes <code>run_id</code>, bounded cardinality, no Sweeper.</p>
</li>
<li><p><strong>🚨 Golden Rule:</strong> NEVER inject <code>run_id</code> into Prometheus at scale. Per-run data belongs in OLAP or structured logs.</p>
</li>
</ul>
<h2>References</h2>
<ul>
<li><p><a href="https://github.com/TheStaffBlueprint/batch-workloads-observability">TheStaffBlueprint/batch-workloads-observability (Companion Repo)</a></p>
</li>
<li><p><a href="https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/plugins.html#listeners">Apache Airflow Listener Plugin Documentation</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Metric Granularity for Batch Workloads]]></title><description><![CDATA[The Question This Post Answers
Part 2 established the architecture: StatsD for counts, Pushgateway for state snapshots. But two critical questions remain:

What level of granularity should you achieve]]></description><link>https://blog.thestaffblueprint.com/metric-granularity-for-batch-workloads</link><guid isPermaLink="true">https://blog.thestaffblueprint.com/metric-granularity-for-batch-workloads</guid><category><![CDATA[#prometheus]]></category><category><![CDATA[observability]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[metrics]]></category><dc:creator><![CDATA[Chirag Bhatia]]></dc:creator><pubDate>Wed, 20 May 2026 15:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/ff081dac-cf0c-469c-8a56-e14c4249b878.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Question This Post Answers</h2>
<p>Part 2 established the architecture: StatsD for counts, Pushgateway for state snapshots. But two critical questions remain:</p>
<ol>
<li><p><strong>What level of granularity</strong> should you achieve with each tool?</p>
</li>
<li><p><strong>Why Gauges over Counters</strong> — what does it actually fix, and what doesn't it fix?</p>
</li>
</ol>
<h2>The Three Levels of Observability Granularity</h2>
<h3>Level 1: Aggregate Operational Metrics</h3>
<p><strong>Tool:</strong> StatsD → Prometheus | <strong>Cardinality:</strong> Low</p>
<p>These are your "system health" metrics: <code>airflow_ti_finish_total</code>, <code>airflow_dag_duration_seconds</code>, pool utilisation. Granularity is per-DAG, per-task-name, per-state. No <code>run_id</code>. StatsD natively aggregates thousands of UDP bursts into single metrics. These power your SLO alerts and operational dashboards.</p>
<h3>Level 2: Task State Snapshots</h3>
<p><strong>Tool:</strong> Pushgateway (Gauges) → Prometheus | <strong>Cardinality:</strong> Depends on approach</p>
<p>These answer "what is the current state of task X?": <code>airflow_task_instance_status</code>, <code>airflow_task_instance_duration_seconds</code>, <code>airflow_task_instance_retries</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/a46b9042-7865-4e91-90c2-6075ee9f4495.png" alt="Prometheus Metrics UI" style="display:block;margin:0 auto" />

<p>There are two approaches, and the right one depends on your scale:</p>
<p><strong>V3 (Production — Low Cardinality):</strong> Grouping key uses only <code>dag_id</code> + <code>task_id</code> + <code>instance</code>. Each task has exactly ONE slot. Latest execution overwrites previous — shows current state only. Cardinality is bounded by the number of unique (dag_id, task_id) pairs. No Sweeper needed.</p>
<p><strong>V2 (Stepping Stone — High Cardinality):</strong> Grouping key includes <code>run_id</code>. Each execution gets its own slot. You can see the state of every individual run. But cardinality grows linearly with executions and requires a Sweeper DAG. <strong>Acceptable for small systems (≤ hundreds of runs/day). Not production-grade at scale.</strong></p>
<p><strong>🚨 The Golden Rule:</strong> NEVER inject high-cardinality keys like <code>run_id</code> into Prometheus at scale. Doing so causes severe series churn, bloats the TSDB index, and destroys query performance. If you need per-run history, that's a Level 3 problem — use OLAP or structured logs.</p>
<p><strong>Critical constraint:</strong> Both V2 and V3 MUST use Gauges, not Counters. The reason is semantic, not performance. Broken down below.</p>
<h3>Level 3: Per-Run Execution History (Audit &amp; Traceability)</h3>
<p><strong>Tool:</strong> NOT Prometheus | <strong>Cardinality:</strong> Unbounded</p>
<p>This is your "what exactly happened inside the job" data: rows processed, data quality scores, error messages, data lineage, reconciliation records. Must be queryable weeks or months later.</p>
<p><strong>Why NOT Prometheus:</strong> Prometheus is optimised for aggregate monitoring with low-to-moderate cardinality and short retention. Per-execution audit data has unbounded cardinality and requires durable, long-term, exact-value storage. <strong>This includes per-</strong><code>run_id</code> <strong>tracking at scale</strong> — if your system processes thousands of runs per day, per-run data is an OLAP problem, not a metrics problem.</p>
<p><strong>Where it belongs:</strong></p>
<ul>
<li><p><strong>Structured execution records:</strong> Log aggregators like <strong>Grafana Loki</strong> or <strong>OpenSearch</strong> are perfect for indexing logs and stack traces without cardinality explosion.</p>
</li>
<li><p><strong>Row counts and data quality scores:</strong> OLAP databases like <strong>ClickHouse</strong>, <strong>BigQuery</strong>, or <strong>DuckDB</strong> are optimized for analytical queries over millions of high-cardinality execution logs.</p>
</li>
<li><p><strong>Real-time execution events:</strong> Streaming platforms like <strong>Kafka</strong> decouple execution events and route them safely to OLAP sinks or real-time alerting systems.</p>
</li>
<li><p><strong>Simple audit tables:</strong> Traditional relational databases like <strong>PostgreSQL</strong> are suitable for light transactional audit trails.</p>
</li>
</ul>
<p>We cover event-based audit implementation in Part 5.</p>
<h2>Counter vs Gauge: The Precise Technical Argument</h2>
<h3>What Most Guides Get Wrong</h3>
<p>Many guides claim:</p>
<ul>
<li><p>❌ Gauges "solve" cardinality problems that Counters create</p>
</li>
<li><p>❌ <code>count(gauge == -1)</code> is more efficient than <code>sum(counter)</code></p>
</li>
<li><p>❌ Sweeping Counters from Pushgateway "destroys history" but sweeping Gauges doesn't</p>
</li>
</ul>
<p><strong>None of these are true.</strong> If both use the same labels (including <code>run_id</code>), they produce identical cardinality, identical query cost, and identical behaviour after sweeping. Once Prometheus scrapes a metric, that data lives in the TSDB until retention expires, regardless of whether you delete it from Pushgateway.</p>
<h3>What Gauges Actually Fix</h3>
<img src="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/53e0efc5-5107-45b8-bf27-7d8154422300.png" alt="Lab Comparison Dashboard" style="display:block;margin:0 auto" />

<h4>1. Retry Safety (The Strongest Argument)</h4>
<p>Task lifecycle: <code>running → failed → retried → success</code>.</p>
<p><strong>Gauge</strong> (<code>status = -1</code> then overwritten to <code>status = 1</code>): Dashboard query <code>count(status == -1)</code> correctly shows zero failures because the latest state is success.</p>
<p><strong>Counter</strong> (<code>failure_total++</code> then <code>success_total++</code>): Both increments persist. Dashboard shows a failure AND a success for the same task. Failure count is permanently inflated. No way to undo.</p>
<h4>2. Natural State Modelling</h4>
<p>A Gauge maps directly to task lifecycle:</p>
<pre><code class="language-plaintext">task_state = 0   →  running
task_state = -1  →  failed
task_state = 1   →  success
task_state = 2   →  skipped
</code></pre>
<p>Counters only go up. You'd need separate counters per state with no way to determine a task's <strong>final</strong> state.</p>
<h4>3. Counter Reset Semantics</h4>
<p>Counters expect long-lived, monotonically increasing processes. Short-lived tasks reset to 0 every execution. <code>rate()</code> and <code>increase()</code> attempt to compensate for resets, producing unpredictable results for ephemeral tasks.</p>
<h3>What Gauges Do NOT Fix</h3>
<p>While Gauges offer semantic improvements, they do not resolve scale and storage issues. Here is a breakdown of what Gauges do and do not fix compared to Counters:</p>
<ul>
<li><p><strong>What they do NOT fix:</strong></p>
<ul>
<li><p><strong>Cardinality explosion from</strong> <code>run_id</code> <strong>labels:</strong> Both Counters and Gauges produce identical cardinality.</p>
</li>
<li><p><strong>Pushgateway OOM without Sweeper:</strong> Both will exhaust memory identically if not cleaned up.</p>
</li>
<li><p><strong>Prometheus series churn:</strong> Stale metadata remains an issue for both.</p>
</li>
<li><p><strong>Query performance (</strong><code>count</code> <strong>vs</strong> <code>sum</code><strong>):</strong> Both require scanning the exact same number of active series.</p>
</li>
</ul>
</li>
<li><p><strong>What they DO fix:</strong></p>
<ul>
<li><p><strong>Task state modelling:</strong> Overwriting values cleanly aligns with a discrete lifecycle.</p>
</li>
<li><p><strong>Retry correctness:</strong> Overwriting errors with success ensures accurate final counts.</p>
</li>
<li><p><strong>Counter reset semantics:</strong> Gauges bypass unpredictable <code>rate()</code> calculations for short-lived tasks.</p>
</li>
</ul>
</li>
</ul>
<p><strong>Bottom line:</strong> Gauges fix how your data <em>means</em> something. They do not fix how much data you <em>produce</em>.</p>
<h2>The Complete Decision Matrix</h2>
<p>To choose the right monitoring pattern, map your core questions to the correct tool and ingestion plugin:</p>
<ul>
<li><p><strong>How many tasks failed today?</strong></p>
<ul>
<li><p><em>Tool:</em> StatsD → Prometheus</p>
</li>
<li><p><em>Plugin:</em> Native Airflow integration (safe, aggregated UDP)</p>
</li>
</ul>
</li>
<li><p><strong>What is the average DAG duration?</strong></p>
<ul>
<li><p><em>Tool:</em> StatsD → Prometheus</p>
</li>
<li><p><em>Plugin:</em> Native Airflow integration</p>
</li>
</ul>
</li>
<li><p><strong>What is the latest state of task X?</strong></p>
<ul>
<li><p><em>Tool:</em> Pushgateway (Gauge) → Prometheus</p>
</li>
<li><p><em>Plugin:</em> <strong>V3</strong> plugin (production-grade, low-cardinality)</p>
</li>
</ul>
</li>
<li><p><strong>Did task X in run Y succeed? (at small scale)</strong></p>
<ul>
<li><p><em>Tool:</em> Pushgateway (Gauge) → Prometheus</p>
</li>
<li><p><em>Plugin:</em> V2 plugin (stepping stone, uses <code>run_id</code>)</p>
</li>
</ul>
</li>
<li><p><strong>Did task X in run Y succeed? (at production scale)</strong></p>
<ul>
<li><p><em>Tool:</em> <strong>NOT Prometheus</strong> (Grafana Loki or an OLAP backend)</p>
</li>
<li><p><em>Plugin/Method:</em> Structured JSON logs</p>
</li>
</ul>
</li>
<li><p><strong>How many rows did run Y process?</strong></p>
<ul>
<li><p><em>Tool:</em> <strong>NOT Prometheus</strong> (Loki, ClickHouse, or BigQuery)</p>
</li>
<li><p><em>Plugin/Method:</em> Structured logs</p>
</li>
</ul>
</li>
<li><p><strong>What upstream sources did run Y read?</strong></p>
<ul>
<li><p><em>Tool:</em> <strong>NOT Prometheus</strong> (OpenLineage and OLAP engines)</p>
</li>
<li><p><em>Plugin/Method:</em> Dedicated event pipeline</p>
</li>
</ul>
</li>
</ul>
<h2>Key Takeaways</h2>
<ul>
<li><p><strong>Three levels of granularity</strong> — aggregate (StatsD), task state (Pushgateway Gauges), per-run audit (NOT Prometheus).</p>
</li>
<li><p><strong>Gauges fix semantics, not scale.</strong> Retry safety and state modelling are the real arguments. Cardinality and query cost are identical for both metric types.</p>
</li>
<li><p><strong>Never inject</strong> <code>run_id</code> <strong>into Prometheus at scale.</strong> Use V3 (low cardinality) for production dashboards. Use OLAP/Loki for per-run history.</p>
</li>
<li><p><strong>Prometheus is not an audit store.</strong> Per-execution data belongs in Loki, OLAP, or event streams.</p>
</li>
</ul>
<h2>References</h2>
<ul>
<li><p><a href="https://github.com/TheStaffBlueprint/batch-workloads-observability">TheStaffBlueprint/batch-workloads-observability (Companion Repo)</a></p>
</li>
<li><p><a href="https://prometheus.io/docs/concepts/metric_types/">Prometheus Data Model: Metric Types</a></p>
</li>
<li><p><a href="https://prometheus.io/docs/practices/pushing/">Prometheus Official: When to use the Pushgateway</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[The Architecture — Prometheus, Grafana, and StatsD for Batch Workloads]]></title><description><![CDATA[(This is Part 2 of the Batch Workloads Observability series. Read Part 1: The Green Tick Fallacy first for context.)


The Default Choice: Prometheus
When engineering teams realise they need to extrac]]></description><link>https://blog.thestaffblueprint.com/the-architecture-prometheus-grafana-and-statsd-for-batch-workloads</link><guid isPermaLink="true">https://blog.thestaffblueprint.com/the-architecture-prometheus-grafana-and-statsd-for-batch-workloads</guid><dc:creator><![CDATA[Chirag Bhatia]]></dc:creator><pubDate>Sun, 17 May 2026 12:25:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/71541cae-e099-4373-a3f4-1d84ec08af9b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>(This is Part 2 of the Batch Workloads Observability series. Read</em> <a href="https://thestaffblueprint.substack.com/p/the-green-tick-fallacy-why-batch-observability-is-fundamentally-different"><em>Part 1: The Green Tick Fallacy</em></a> <em>first for context.)</em></p>
<img src="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/d7568341-bcac-49db-816a-cdd9f8c3715f.png" alt="" style="display:block;margin:0 auto" />

<h2>The Default Choice: Prometheus</h2>
<p>When engineering teams realise they need to extract internal application metrics from their batch pipelines (to escape the "Green Tick Fallacy"), they inevitably reach for Prometheus.</p>
<p>It makes perfect sense. Prometheus is the industry standard. Grafana integrates beautifully with it. PromQL is incredibly powerful. Your infrastructure team already has it running. <em>(If you want a detailed breakdown of each component and how the standard Prometheus architecture works under the hood, check out</em> <a href="https://devopscube.com/prometheus-architecture/"><em>this comprehensive guide by DevOpsCube</em></a><em>.)</em></p>
<p>But there is a fundamental mismatch: <strong>Prometheus is a pull-based system.</strong> It expects to scrape a <code>/metrics</code> HTTP endpoint exposed by a continuously running service.</p>
<p>Batch jobs, however, are ephemeral. A Spark job might spin up, process terabytes of data in 45 seconds, and vanish. By the time the Prometheus scraper comes around (typically every 15 to 30 seconds), the container is already dead. You cannot scrape batch jobs—you must <strong>push</strong> telemetry out before the container dies.</p>
<h2>The Push Problem &amp; The Proxy</h2>
<p>To solve the push-vs-pull mismatch, the standard architecture introduces the <strong>Prometheus Pushgateway</strong>.</p>
<p>The idea is simple: it acts as a middleman. Your ephemeral batch job pushes its metrics to the Pushgateway via HTTP just before exiting. The Pushgateway caches those metrics in memory. Prometheus then continuously scrapes the Pushgateway at its own pace.</p>
<p>Problem solved, right? Not quite. This is where most batch observability architectures begin to rot. To understand why, we have to talk about dimensions.</p>
<h2>Understanding Dimensions and Cardinality</h2>
<p>In Prometheus, data isn't just stored as a flat list of numbers. It's stored as <strong>Time Series</strong>, defined by a metric name and a set of key-value pairs called labels (or dimensions).</p>
<p>For example, a task failure metric might look like this: <code>airflow_task_status{dag_id="daily_etl", task_id="load_users", status="failed"}</code></p>
<p>Every unique combination of labels creates a brand new time series in the database. The total number of unique time series is called <strong>cardinality</strong>.</p>
<ul>
<li><p><strong>Low-Cardinality Labels:</strong> <code>dag_id</code> (maybe 50 total), <code>task_id</code> (maybe 200 total), <code>status</code> (success, failed, running). These are bounded. They are safe.</p>
</li>
<li><p><strong>High-Cardinality Labels:</strong> <code>run_id</code> (a unique UUID for every single execution), <code>user_id</code>, or <code>error_message</code>. These are unbounded. They grow infinitely.</p>
</li>
</ul>
<p>When monitoring a batch workload, you naturally want to know: <em>Which DAG? Which task? Which specific run? How many rows did it process?</em> So, engineers intuitively add a <code>run_id</code> label to their Pushgateway metrics.</p>
<h2>The Dumb Proxy and The Race Condition</h2>
<p>If you <em>don't</em> use a unique label like <code>run_id</code>, you hit an immediate wall.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/a7ac83aa-2c9e-4079-bdeb-aa0e00e9523b.png" alt="" style="display:block;margin:0 auto" />

<p>The Pushgateway is essentially a dumb proxy. It doesn't aggregate metrics, it doesn't add numbers together, and it doesn't deduplicate. It simply acts as a key-value store. The "key" is the combination of your labels (the grouping key).</p>
<p>If ten parallel Airflow tasks fail at the exact same millisecond, and they all push <code>failure_count = 1</code> to the Pushgateway without a <code>run_id</code>, they all write to the exact same key. The Pushgateway just overwrites the value ten times. Prometheus scrapes it once and sees: <code>failure_count = 1</code>. You just lost nine failure records.</p>
<p>This is a race condition. Parallel tasks overwrite each other because they share the same grouping key.</p>
<h2>The Fatal Fix: Fighting Prometheus</h2>
<p>The "obvious" fix is to add <code>run_id</code> to the grouping key. Because every run has a unique ID, every task gets its own isolated slot in the Pushgateway. The race condition is solved!</p>
<p>But you've just created a ticking time bomb.</p>
<p>By adding <code>run_id</code>, you have introduced unbounded cardinality. Every single task execution creates a brand new time series. Furthermore, the Pushgateway has <strong>no native TTL</strong> (Time To Live). It holds every metric group in memory forever.</p>
<p>Within a few days of high-volume DAG runs, the Pushgateway's memory balloons until the container OOM (Out of Memory) crashes. When Prometheus tries to scrape it, the massive payload crashes the scraper.</p>
<p>This happens because teams misunderstand what Prometheus is. <strong>Prometheus is designed for low-cardinality, continuous metrics.</strong> It is not designed to store high-cardinality, event-based data. When you push per-run execution data into Prometheus, you are treating a time-series database like an event log.</p>
<h2>The Realization: Two Different Problems</h2>
<p>The breakthrough happens when you realize that "batch observability" isn't a single problem. You are actually trying to answer two completely different questions:</p>
<ol>
<li><p><strong>Operational Metrics:</strong> <em>"Is the system healthy right now? What is the overall failure rate?"</em> (Needs aggregation, low-cardinality).</p>
</li>
<li><p><strong>State Snapshots:</strong> <em>"What is the current status of this specific task run?"</em> (Needs per-run isolation, high-cardinality).</p>
</li>
</ol>
<p>Trying to force both of these through the Pushgateway is the core architectural mistake. We need two different data paths.</p>
<h2>The Architecture Stack</h2>
<p>Here is the robust, Staff-level architecture we use to solve this cleanly:</p>
<ul>
<li><p><strong>Prometheus:</strong> Time-series database — stores aggregated operational metrics and temporary task state snapshots.</p>
</li>
<li><p><strong>Grafana:</strong> Dashboarding and alerting — visualizes metrics from Prometheus.</p>
</li>
<li><p><strong>StatsD Exporter:</strong> <strong>The missing link</strong> — catches high-frequency UDP bursts, aggregates them, and safely exposes them to Prometheus.</p>
</li>
<li><p><strong>Pushgateway:</strong> Push-based ingestion — restricted <em>only</em> to temporary state snapshots.</p>
</li>
<li><p><strong>Sweeper DAG:</strong> Automated cleanup — deletes stale Pushgateway metric groups to prevent OOM crashes.</p>
</li>
</ul>
<h3>Path 1: Operational Metrics via StatsD</h3>
<pre><code class="language-text">Airflow Task → UDP packet → StatsD Exporter → Prometheus scrape → Grafana
</code></pre>
<p>StatsD solves the race condition at the network layer. It listens over UDP. When 10 tasks fire <code>airflow.task.failed:1|c</code> at the exact same millisecond, StatsD catches all 10 packets in memory, adds them up, and flushes a single aggregated metric (<code>failure_count = 10</code>) to Prometheus.</p>
<p>No unique identifiers (<code>run_id</code>) are needed. No race conditions. Zero cardinality explosion. Best of all, Airflow has StatsD support built into its core—you just have to enable it.</p>
<h3>Path 2: Task State Snapshots via Pushgateway</h3>
<pre><code class="language-text">Airflow Plugin → HTTP POST → Pushgateway → Prometheus scrape → Grafana
</code></pre>
<p>StatsD is perfect for counts, but it can't answer: <em>"What is the current state of task X in run Y?"</em></p>
<p>For per-task state visibility at a small-to-medium scale, we <em>do</em> use the Pushgateway, and we <em>do</em> use <code>run_id</code> as a grouping key. However, we apply strict lifecycle management. We deploy an Airflow <strong>Sweeper DAG</strong> that runs on a schedule, queries the Pushgateway REST API, and deletes metric groups older than a configured threshold.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/35eaf282-a39c-4eca-a505-15989729db77.png" alt="" style="display:block;margin:0 auto" />

<p>This prevents OOM crashes while giving us exactly enough runway to monitor active DAGs.</p>
<p><strong>The Architectural Boundary:</strong> It is critical to understand that this Pushgateway approach is a stepping stone. At a truly massive scale (tens of thousands of tasks per hour), Prometheus ingestion will still choke on the cardinality, and the Sweeper DAG itself becomes a bottleneck. For large-scale environments, you must abandon Prometheus for per-run tracking entirely and push task events to a dedicated event log system (like Elasticsearch, Grafana Loki, or ClickHouse).</p>
<h2>Key Takeaways</h2>
<ul>
<li><p><strong>Prometheus is not an event log:</strong> It is built for low-cardinality, continuous metrics. Do not use it for durable, per-run audit trails.</p>
</li>
<li><p><strong>The Pushgateway is a dumb proxy:</strong> Pushing concurrent metrics without unique keys causes race conditions. Adding unique keys causes OOM crashes.</p>
</li>
<li><p><strong>StatsD is the missing link:</strong> It aggregates concurrent events at the network layer, completely eliminating race conditions and cardinality bloat for operational metrics.</p>
</li>
<li><p><strong>Scale dictates your state architecture:</strong> Use the Pushgateway + Sweeper DAG for medium-scale task state snapshots. For massive scale, move per-run state tracking entirely to an event logging system like Elasticsearch or Loki.</p>
</li>
</ul>
<h2>References</h2>
<ul>
<li><p><a href="https://github.com/TheStaffBlueprint/batch-workloads-observability">TheStaffBlueprint/batch-workloads-observability (Companion Repo)</a></p>
</li>
<li><p><a href="https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html">Apache Airflow Metrics Configuration</a></p>
</li>
<li><p><a href="https://prometheus.io/docs/practices/pushing/">Prometheus Official: When to use the Pushgateway</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[The Green Tick Fallacy — Why Batch Observability is Fundamentally Different]]></title><description><![CDATA[The Green Tick Fallacy
There is a dangerous assumption that every junior data engineer makes: If the Airflow task turns green, the job was successful.
This is the "Green Tick Fallacy." When your Spark]]></description><link>https://blog.thestaffblueprint.com/the-green-tick-fallacy-why-batch-observability-is-fundamentally-different</link><guid isPermaLink="true">https://blog.thestaffblueprint.com/the-green-tick-fallacy-why-batch-observability-is-fundamentally-different</guid><category><![CDATA[Grafana]]></category><category><![CDATA[Grafana Monitoring]]></category><category><![CDATA[airflow]]></category><category><![CDATA[apache-airflow]]></category><category><![CDATA[observability]]></category><category><![CDATA[#prometheus]]></category><dc:creator><![CDATA[Chirag Bhatia]]></dc:creator><pubDate>Sat, 16 May 2026 18:34:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6471d940421f715ac07f9905/78c6620c-01b5-4c55-b4e6-716bd18d0675.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Green Tick Fallacy</h2>
<p>There is a dangerous assumption that every junior data engineer makes: <em>If the Airflow task turns green, the job was successful.</em></p>
<p>This is the "Green Tick Fallacy." When your Spark job finishes, Airflow checks exactly one thing: did the container return an <code>exit 0</code> status code? It has absolutely no idea if your job processed 10 billion rows flawlessly, or if it processed 0 rows because an upstream partition was empty. It just knows the container didn't crash.</p>
<p>Relying on the green tick is how you get paged at 3 AM for silent data corruption. To build true batch workload observability, you have to extract internal application metrics — and doing that for batch workloads is fundamentally harder than for services.</p>
<h2>Why Batch Observability is Hard</h2>
<p>Traditional microservice observability is straightforward. The service runs 24/7, exposing a <code>/metrics</code> HTTP endpoint. Prometheus scrapes it every 15 seconds. The process is always alive to respond.</p>
<p>Batch jobs are ephemeral. They spin up, chew through a terabyte of data in 45 seconds, and vanish. By the time Prometheus tries to scrape them, the process is already dead. You cannot scrape batch jobs — you must <strong>push</strong> telemetry from inside the code out to an aggregator before the container dies.</p>
<p>This creates a fundamentally different architectural challenge. In the world of services, your observability tool pulls data. In the world of batch, your job pushes data. And the tools designed for pulling don't work cleanly for pushing.</p>
<h2>What Observability Actually Means for Batch</h2>
<p>Before diving into tools, it's worth defining what "observability" actually means for batch workloads. There are three distinct categories of data you need, and each demands a different architectural approach:</p>
<h3>1. Operational Metrics</h3>
<p><em>"Is the system healthy right now?"</em></p>
<ul>
<li><p>How many DAGs ran today?</p>
</li>
<li><p>What's the average task duration?</p>
</li>
<li><p>How many tasks failed in the last hour?</p>
</li>
</ul>
<p>These are <strong>low-cardinality, aggregate</strong> numbers. You don't need per-run-id granularity. You need rates, counts, and histograms. These are the bread and butter of Prometheus.</p>
<h3>2. Task State Snapshots</h3>
<p><em>"What is the current state of this specific task?"</em></p>
<ul>
<li><p>Is task <code>load_customers</code> in the <code>daily_etl</code> DAG currently running, failed, or succeeded?</p>
</li>
<li><p>What was the duration of this specific execution?</p>
</li>
<li><p>Did the task retry, and what is its final state?</p>
</li>
</ul>
<p>These are <strong>point-in-time state snapshots</strong> with moderate-to-high cardinality. Each task execution has a unique identity (<code>run_id</code>), and the state may change over the lifecycle (running → failed → retried → success). These can live in Prometheus temporarily, but require careful lifecycle management.</p>
<h3>3. Execution History &amp; Audit</h3>
<p><em>"What exactly happened in run XYZ?"</em></p>
<ul>
<li><p>How many rows did <code>run_id=abc123</code> process?</p>
</li>
<li><p>What was the data quality score for this specific schema version?</p>
</li>
<li><p>What was the exact error message and stack trace?</p>
</li>
</ul>
<p>This is <strong>high-cardinality, per-execution, durable data</strong>. It must be queryable weeks or months later for debugging, reconciliation, and compliance. This data does <strong>not</strong> belong in Prometheus.</p>
<h2>The Trap: Forcing Everything into One Tool</h2>
<p>The mistake most teams make is trying to force all three categories into a single observability system — usually Prometheus via the Pushgateway. This leads to:</p>
<ul>
<li><p><strong>Pushgateway abuse</strong>: Pushing per-<code>run_id</code> metrics to Prometheus via Pushgateway, creating unbounded cardinality</p>
</li>
<li><p><strong>OOM crashes</strong>: Pushgateway has no native TTL, so dynamically labelled metrics accumulate in memory forever</p>
</li>
<li><p><strong>Semantic mismatches</strong>: Using the wrong metric type (Counters where Gauges belong), leading to inflated or incorrect dashboard numbers when tasks retry</p>
</li>
</ul>
<p>In this series, we'll build a clean architecture that uses the right tool for each category. We'll show you exactly how to set up the stack, what code to write, and what traps to avoid.</p>
<h2>What's Coming in This Series</h2>
<ul>
<li><p><strong>Part 2: The Architecture</strong> — How to design a Prometheus + Grafana + StatsD architecture for batch workloads. What belongs in Prometheus, what doesn't, and where StatsD fits.</p>
</li>
<li><p><strong>Part 3: Metric Granularity &amp; Classification</strong> — What level of observability to achieve where. Why Gauges are semantically correct for batch state (and why it's NOT about performance). What data belongs in Prometheus vs. structured logs and OLAP stores for auditing and traceability.</p>
</li>
<li><p><strong>Part 4: The Implementation</strong> — Building a production-ready Airflow plugin with Gauges, configuring StatsD, designing a Sweeper DAG, and setting up Grafana dashboards. Complete code walkthrough.</p>
</li>
<li><p><strong>Part 5: Future Scope</strong> — How to build durable, per-execution audit trails using event streams, OLAP stores, distributed tracing, and data lineage for the data that should never touch Prometheus.</p>
</li>
</ul>
<h2>Key Takeaways</h2>
<ul>
<li><p><strong>For Juniors:</strong> The green tick lies. <code>exit 0</code> means the container didn't crash, not that it processed data correctly. You must push application metrics out of your batch jobs.</p>
</li>
<li><p><strong>For Seniors:</strong> Batch observability has three distinct data categories (operational metrics, state snapshots, execution history). Forcing all three into Prometheus is a common and expensive mistake.</p>
</li>
<li><p><strong>The Rule:</strong> Match your data to the right storage backend. Not everything belongs in a time-series database.</p>
</li>
</ul>
<h2>References</h2>
<ul>
<li><p><a href="https://github.com/TheStaffBlueprint/batch-workloads-observability">TheStaffBlueprint/batch-workloads-observability (Companion Repo)</a></p>
</li>
<li><p><a href="https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html">Apache Airflow Metrics Configuration</a></p>
</li>
<li><p><a href="https://prometheus.io/docs/practices/pushing/">Prometheus Official: When to use the Pushgateway</a></p>
</li>
</ul>
]]></content:encoded></item></channel></rss>