When Reverse Proxies Surprise You: Hard Lessons from Operating at Scale

whstl 7 hours ago

It's nice to see someone else preaching this:

> Production Lesson: Never let exceptions dictate the norm. Handle them explicitly, in isolated paths or tiers, instead of polluting the mainline logic. What looks like "flexibility" is often just deferred fragility waiting to surface at scale.

I've seen this pattern far too often in production systems. In the name of "covering edge cases", a huge amount of complexity is moved over to configuration languages, interfaces, APIs, etc, to be more flexible. Not only this doesn't free up the developers time (because it overcomplicates it all), it also makes things worse on the other side for the users of such structures. We already have something "flexible": source code itself, no need to reinvent the wheel.

nijave 4 hours ago

I see something similar with AI generated code where it tries much too hard to handle all the exceptions and ends up swallowing or obfuscating them instead of making things more reliable. Claude seems particularly bad unless you prompt it to minimize complexity
immibis 7 hours ago

The configuration complexity clock: https://mikehadlow.blogspot.com/2012/05/configuration-comple...
- whstl 4 hours ago
  
  I wish people would realize that moving back to code is possible, though.
  It rarely happens because at this point the codebase is so littered with problems that things start requiring long QA, code freezes and once-a-month deployments, and it's impossible to get anything done.
  - dottedmag 16 minutes ago
    
    Better never stray from code.
    My faviourite configuration pattern for SaaS code: all the configuration for all targets, from local development setup, to unit tests, to CI throwaway deployments, to production is in a single Go package. The current environment is selected by a single environment variable.
    Need something else configured beyond your code? Write Go code to emit configs for the current environment, in "gen-config some-tool && some-tool" stanza.
- marcosdumay 2 hours ago
  
  Config values and a configurable plugins system completely solve the problem, dominating over the entire clock.
  Iterating further from config values is a great predictor that a project will become a disaster to use, and probably fail completely.

stacktrace 7 hours ago

Very interesting read! But I want to point out a small correction - the DNS collapse issue at HAProxy, along with O(N^2), also had some O(N^3) code paths, which is just mind-blowing.

Also, I believe this should be the correct GitHub issue link - https://github.com/haproxy/haproxy/issues/1404

> Production Lesson: Code that "works fine" at small scale may still hide O(N²) or worse behavior. At hundreds or thousands of nodes, those costs stop being theoretical and start breaking production.

bell-cot 3 hours ago

Re-sort the takeaway points, to put this one first:

> Prioritize human factors. Outage recovery depends on what operators can see and do under stress. When dashboards fail, clear logs, simple commands, and predictable behavior matter more than complex mechanisms.

Why - to make it really, really clear to bullet-skimming managers and complexity-loving engineers that too-clever "solutions", and just-an-afterthought "testing & training", and poorly documented configurations will turn into worlds of pain when things really go wrong. The "smart people" won't be in the Operations Center then. Let alone with all the details fresh in their minds. And several of them may have taken jobs elsewhere, to not much care if the org is desperate for their help right now.

megarey123 2 hours ago

[dead]