stevencodes.swe
Posts
stevencodes.swe - Nov 16, 2025

stevencodes.swe - Nov 16, 2025

Monitoring dead tuples, book snippet

November 16, 2025

👋 Hey friends,

Here’s what I’ve got in store for you this week:

A snippet from Chapter 4 of The Backend Lowdown
Why monitoring dead tuples is important

Let’s get into it 👇

The Backend Lowdown: Chapter 4 Preview

Every newsletter will include a snippet from my book in progress, The Backend Lowdown, available for $5 right now on Gumroad!

Get The Backend Lowdown →

Observability & Kill Switches

Caching without monitoring is like driving blindfolded... you won't know if you're helping or hurting until something breaks. Build a thin layer of instrumentation around your cache to track what's working, catch problems early, and disable problematic caches instantly when things go wrong.

This isn't about complex monitoring, just a few key signals and circuit breakers that answer critical questions: Is the cache actually making things faster? Are users getting stale data? Can we turn it off RIGHT NOW if needed?

Kill switches

When production is on fire and cache is the suspect, you need to turn it off NOW, not in 20 minutes after a deploy. Build simple controls that let you disable, bypass, or tune specific caches instantly through environment variables or feature flags.

Key principle: Make switches granular by namespace so you can disable just the problematic cache (products:*) without taking down everything else that's working fine.

module CacheControl
  # Control cache behavior via env vars or feature flags (Flipper, LaunchDarkly, etc.)
  # Can be toggled instantly without deploy - perfect for incidents
  
  # Disable all reads for a namespace (fall back to database)
  def self.reads_enabled?(ns)
    !flag?("CACHE_READS_OFF:#{ns}") &&    # Specific namespace disabled?
    !flag?("CACHE_READS_OFF:*")           # All caches disabled?
  end
  
  # Disable cache writes (still read existing cached data)
  def self.writes_enabled?(ns)
    !flag?("CACHE_WRITES_OFF:#{ns}") && 
    !flag?("CACHE_WRITES_OFF:*")
  end
  
  # Force cache misses for testing or debugging
  def self.force_miss?(ns)
    flag?("CACHE_FORCE_MISS:#{ns}")
  end
  
  # Dynamically adjust TTLs (e.g., 0.1 = 10% of normal, 2.0 = double)
  def self.ttl_multiplier(ns)
    (ENV["CACHE_TTL_MULTIPLIER:#{ns}"] || "1.0").to_f
  end

  private
  
  # Check env vars first, fall back to feature flag system if available
  def self.flag?(key)
    ENV[key] == "1" || 
    Flipper.enabled?(key.parameterize.underscore.to_sym) rescue ENV[key] == "1"
  end
end

Usage at call sites:

def cached_fetch(ns:, key:, base_ttl:, &compute)
  # Skip cache entirely if disabled or forcing misses
  if CacheControl.force_miss?(ns) || !CacheControl.reads_enabled?(ns)
    Rails.logger.info("Cache bypassed", namespace: ns, reason: "kill_switch")
    return compute.call
  end

  ttl = (base_ttl * CacheControl.ttl_multiplier(ns)).seconds

  if !CacheControl.writes_enabled?(ns)
    # True read-only mode: read if present, otherwise compute and return
    cached = Rails.cache.read(key)
    return cached unless cached.nil?
    return compute.call
  end

  # Normal read-through cache
  Rails.cache.fetch(key, expires_in: ttl, race_condition_ttl: 2.seconds) do
    compute.call
  end
end

# Example usage in controller:
def show
  @product = cached_fetch(
    ns: "products",           # Namespace for targeted control
    key: "product:#{params[:id]}",
    base_ttl: 5.minutes
  ) do
    Product.find(params[:id]).to_presenter
  end
end

# During incident, disable products cache instantly:
# $ heroku config:set CACHE_READS_OFF:products=1
# Or in Rails console: ENV["CACHE_READS_OFF:products"] = "1"

The Dead Tuple Alert That Caught a Slow Query

Here’s a monitoring pattern I really like for Postgres: track dead tuples globally as an early warning signal.

A typical scenario looks like this:

A global “dead tuple ratio” alert fires.
Latency graphs look… fine.
Error rates are… fine.
But Postgres is quietly building up more cleanup work than usual.

When you see that, it’s often not ‘bloat’ yet. It’s the database telling you, ‘I’m generating dead rows faster than VACUUM (autovacuum) can reclaim them.’

One very common root cause: slow or long-running queries.

Quick refresher: what are dead tuples?

In Postgres, updates and deletes don’t overwrite rows in place. Because of MVCC:

An UPDATE creates a new row version and eventually marks the old one as dead.
A DELETE marks a row as dead, but doesn’t physically remove it immediately.

Those dead tuples:

Still take up space on disk
Still live in pages that need to be scanned
Contribute to table and index bloat
Make queries do more I/O than they should

VACUUM is what eventually cleans them up. If VACUUM falls behind, dead tuples pile up.

How this points to slow queries

When a global dead-tuple alert fires but your user-facing metrics look okay, it’s a great time to ask:

“Is something holding old row versions hostage?”

A very common pattern:

A query or background job runs much longer than expected.
It keeps a transaction open.
VACUUM can’t safely remove old row versions that transaction might still see.
Dead tuples accumulate across the tables that query touches.

You can confirm this pretty quickly by checking both dead tuples and long-running queries.

Top tables by dead tuples:

SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  last_vacuum,
  last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

Long-running queries / transactions:

SELECT
  pid,
  now() - query_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY duration DESC;

If you see a query that’s been running for minutes or hours, touching the same tables that are full of dead tuples, you’ve probably found your culprit.

What you can actually do

Once you’ve tied the dead-tuple growth to a workload, the options are usually:

Fix or kill the slow query
- Add an index, adjust the plan, break it into smaller chunks, or cap runtime.
Tune autovacuum (per table, if needed)
- Lower autovacuum_vacuum_scale_factor and/or autovacuum_vacuum_threshold for high-churn tables.
- Read the Postgres docs if you want more info on how to tweak autovacuum!
Reduce unnecessary churn
- Avoid constantly updating the same hot rows if you don’t need to.

Dead tuple monitoring isn’t just about “bloat.” It’s an early-warning system for long-running queries and MVCC hygiene.

A simple global dead tuple ratio alert gives you a cheap, low-noise heads up that something in your workload, or your vacuum tuning, is off, long before it shows up as a full-blown incident.

That’s a wrap for this week. If something here made your day smoother, feel free to reply and tell me about it. And if you think a friend or teammate would enjoy this too, I’d be grateful if you shared it with them.

Until next time,
Steven