- stevencodes.swe
- Posts
- stevencodes.swe - Nov 16, 2025
stevencodes.swe - Nov 16, 2025
Monitoring dead tuples, book snippet
š Hey friends,
Hereās what Iāve got in store for you this week:
A snippet from Chapter 4 of The Backend Lowdown
Why monitoring dead tuples is important
Letās get into it š
The Backend Lowdown: Chapter 4 Preview
Every newsletter will include a snippet from my book in progress, The Backend Lowdown, available for $5 right now on Gumroad!
Get The Backend Lowdown āObservability & Kill Switches
Caching without monitoring is like driving blindfolded... you won't know if you're helping or hurting until something breaks. Build a thin layer of instrumentation around your cache to track what's working, catch problems early, and disable problematic caches instantly when things go wrong.
This isn't about complex monitoring, just a few key signals and circuit breakers that answer critical questions: Is the cache actually making things faster? Are users getting stale data? Can we turn it off RIGHT NOW if needed?
Kill switches
When production is on fire and cache is the suspect, you need to turn it off NOW, not in 20 minutes after a deploy. Build simple controls that let you disable, bypass, or tune specific caches instantly through environment variables or feature flags.
Key principle: Make switches granular by namespace so you can disable just the problematic cache (products:*) without taking down everything else that's working fine.
module CacheControl
# Control cache behavior via env vars or feature flags (Flipper, LaunchDarkly, etc.)
# Can be toggled instantly without deploy - perfect for incidents
# Disable all reads for a namespace (fall back to database)
def self.reads_enabled?(ns)
!flag?("CACHE_READS_OFF:#{ns}") && # Specific namespace disabled?
!flag?("CACHE_READS_OFF:*") # All caches disabled?
end
# Disable cache writes (still read existing cached data)
def self.writes_enabled?(ns)
!flag?("CACHE_WRITES_OFF:#{ns}") &&
!flag?("CACHE_WRITES_OFF:*")
end
# Force cache misses for testing or debugging
def self.force_miss?(ns)
flag?("CACHE_FORCE_MISS:#{ns}")
end
# Dynamically adjust TTLs (e.g., 0.1 = 10% of normal, 2.0 = double)
def self.ttl_multiplier(ns)
(ENV["CACHE_TTL_MULTIPLIER:#{ns}"] || "1.0").to_f
end
private
# Check env vars first, fall back to feature flag system if available
def self.flag?(key)
ENV[key] == "1" ||
Flipper.enabled?(key.parameterize.underscore.to_sym) rescue ENV[key] == "1"
end
endUsage at call sites:
def cached_fetch(ns:, key:, base_ttl:, &compute)
# Skip cache entirely if disabled or forcing misses
if CacheControl.force_miss?(ns) || !CacheControl.reads_enabled?(ns)
Rails.logger.info("Cache bypassed", namespace: ns, reason: "kill_switch")
return compute.call
end
ttl = (base_ttl * CacheControl.ttl_multiplier(ns)).seconds
if !CacheControl.writes_enabled?(ns)
# True read-only mode: read if present, otherwise compute and return
cached = Rails.cache.read(key)
return cached unless cached.nil?
return compute.call
end
# Normal read-through cache
Rails.cache.fetch(key, expires_in: ttl, race_condition_ttl: 2.seconds) do
compute.call
end
end
# Example usage in controller:
def show
@product = cached_fetch(
ns: "products", # Namespace for targeted control
key: "product:#{params[:id]}",
base_ttl: 5.minutes
) do
Product.find(params[:id]).to_presenter
end
end
# During incident, disable products cache instantly:
# $ heroku config:set CACHE_READS_OFF:products=1
# Or in Rails console: ENV["CACHE_READS_OFF:products"] = "1"The Dead Tuple Alert That Caught a Slow Query
Hereās a monitoring pattern I really like for Postgres: track dead tuples globally as an early warning signal.
A typical scenario looks like this:
A global ādead tuple ratioā alert fires.
Latency graphs look⦠fine.
Error rates are⦠fine.
But Postgres is quietly building up more cleanup work than usual.
When you see that, itās often not ābloatā yet. Itās the database telling you, āIām generating dead rows faster than VACUUM (autovacuum) can reclaim them.ā
One very common root cause: slow or long-running queries.
Quick refresher: what are dead tuples?
In Postgres, updates and deletes donāt overwrite rows in place. Because of MVCC:
An UPDATE creates a new row version and eventually marks the old one as dead.
A DELETE marks a row as dead, but doesnāt physically remove it immediately.
Those dead tuples:
Still take up space on disk
Still live in pages that need to be scanned
Contribute to table and index bloat
Make queries do more I/O than they should
VACUUM is what eventually cleans them up. If VACUUM falls behind, dead tuples pile up.
How this points to slow queries
When a global dead-tuple alert fires but your user-facing metrics look okay, itās a great time to ask:
āIs something holding old row versions hostage?ā
A very common pattern:
A query or background job runs much longer than expected.
It keeps a transaction open.
VACUUM canāt safely remove old row versions that transaction might still see.
Dead tuples accumulate across the tables that query touches.
You can confirm this pretty quickly by checking both dead tuples and long-running queries.
Top tables by dead tuples:
SELECT
relname,
n_live_tup,
n_dead_tup,
last_vacuum,
last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup DESC
LIMIT 10;Long-running queries / transactions:
SELECT
pid,
now() - query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY duration DESC;If you see a query thatās been running for minutes or hours, touching the same tables that are full of dead tuples, youāve probably found your culprit.
What you can actually do
Once youāve tied the dead-tuple growth to a workload, the options are usually:
Fix or kill the slow query
Add an index, adjust the plan, break it into smaller chunks, or cap runtime.
Tune autovacuum (per table, if needed)
Lower
autovacuum_vacuum_scale_factorand/orautovacuum_vacuum_thresholdfor high-churn tables.Read the Postgres docs if you want more info on how to tweak autovacuum!
Reduce unnecessary churn
Avoid constantly updating the same hot rows if you donāt need to.
Dead tuple monitoring isnāt just about ābloat.ā Itās an early-warning system for long-running queries and MVCC hygiene.
A simple global dead tuple ratio alert gives you a cheap, low-noise heads up that something in your workload, or your vacuum tuning, is off, long before it shows up as a full-blown incident.
Thatās a wrap for this week. If something here made your day smoother, feel free to reply and tell me about it. And if you think a friend or teammate would enjoy this too, Iād be grateful if you shared it with them.
Until next time,
Steven