> Production Lesson: Never let exceptions dictate the norm. Handle them explicitly, in isolated paths or tiers, instead of polluting the mainline logic. What looks like "flexibility" is often just deferred fragility waiting to surface at scale.
I've seen this pattern far too often in production systems. In the name of "covering edge cases", a huge amount of complexity is moved over to configuration languages, interfaces, APIs, etc, to be more flexible. Not only this doesn't free up the developers time (because it overcomplicates it all), it also makes things worse on the other side for the users of such structures. We already have something "flexible": source code itself, no need to reinvent the wheel.
I see something similar with AI generated code where it tries much too hard to handle all the exceptions and ends up swallowing or obfuscating them instead of making things more reliable. Claude seems particularly bad unless you prompt it to minimize complexity
I wish people would realize that moving back to code is possible, though.
It rarely happens because at this point the codebase is so littered with problems that things start requiring long QA, code freezes and once-a-month deployments, and it's impossible to get anything done.
My faviourite configuration pattern for SaaS code: all the configuration for all targets, from local development setup, to unit tests, to CI throwaway deployments, to production is in a single Go package. The current environment is selected by a single environment variable.
Need something else configured beyond your code? Write Go code to emit configs for the current environment, in "gen-config some-tool && some-tool" stanza.
Ah, but what happens when your plugins need to themselves be configured for different client deployments?
You add a few flags, then you need to figure out backwards compatibility as your plugin evolves (which involves defining prioritization rules between options), then those rules get complex enough to have conditionals (say, for granular traffic patterns), which means you have a DSL. And when the DSL gets complex enough, it needs an entire Software Development Lifecycle, which means it's effectively hard-coded. Or, you have people fork the plugin, which is a hard-code in and of itself.
All in all, you don't avoid the "configurability clock," you just decentralize it!
The real problem is that clients inevitably have conflicting needs that cut across any modularization barriers you might think to build. When a configured plugin can have spooky action at a distance, perhaps under-tested due to configuration, is it truly modular? Thus, the clock emerges.
You do multiple plugins or use constant configuration values for them. That's why you want plugins, for putting all complex stuff in actual code that doesn't have to live with the main product.
That doesn't decentralize the clock, it gives a maximum capable interface for the few people that need to handle exceptional cases, and a minimally capable one to the people that just want to use your software as is. That is, you make the product live on two opposite values of the clock at the same time.
Very interesting read! But I want to point out a small correction - the DNS collapse issue at HAProxy, along with O(N^2), also had some O(N^3) code paths, which is just mind-blowing.
> Production Lesson: Code that "works fine" at small scale may still hide O(N²) or worse behavior. At hundreds or thousands of nodes, those costs stop being theoretical and start breaking production.
The engineer killing the proxy because they assumed processes running as "nobody" were stray (whatever that means - processes without a parent don't change username, and nobody doesn't mean no username) doesn't belong in that list. That was just an engineer out of their depth (I assume one used to dealing with other systems)
Re-sort the takeaway points, to put this one first:
> Prioritize human factors. Outage recovery depends on what operators can see and do under stress. When dashboards fail, clear logs, simple commands, and predictable behavior matter more than complex mechanisms.
Why - to make it really, really clear to bullet-skimming managers and complexity-loving engineers that too-clever "solutions", and just-an-afterthought "testing & training", and poorly documented configurations will turn into worlds of pain when things really go wrong. The "smart people" won't be in the Operations Center then. Let alone with all the details fresh in their minds. And several of them may have taken jobs elsewhere, to not much care if the org is desperate for their help right now.
It's nice to see someone else preaching this:
> Production Lesson: Never let exceptions dictate the norm. Handle them explicitly, in isolated paths or tiers, instead of polluting the mainline logic. What looks like "flexibility" is often just deferred fragility waiting to surface at scale.
I've seen this pattern far too often in production systems. In the name of "covering edge cases", a huge amount of complexity is moved over to configuration languages, interfaces, APIs, etc, to be more flexible. Not only this doesn't free up the developers time (because it overcomplicates it all), it also makes things worse on the other side for the users of such structures. We already have something "flexible": source code itself, no need to reinvent the wheel.
I see something similar with AI generated code where it tries much too hard to handle all the exceptions and ends up swallowing or obfuscating them instead of making things more reliable. Claude seems particularly bad unless you prompt it to minimize complexity
The configuration complexity clock: https://mikehadlow.blogspot.com/2012/05/configuration-comple...
I wish people would realize that moving back to code is possible, though.
It rarely happens because at this point the codebase is so littered with problems that things start requiring long QA, code freezes and once-a-month deployments, and it's impossible to get anything done.
Better never stray from code.
My faviourite configuration pattern for SaaS code: all the configuration for all targets, from local development setup, to unit tests, to CI throwaway deployments, to production is in a single Go package. The current environment is selected by a single environment variable.
Need something else configured beyond your code? Write Go code to emit configs for the current environment, in "gen-config some-tool && some-tool" stanza.
Config values and a configurable plugins system completely solve the problem, dominating over the entire clock.
Iterating further from config values is a great predictor that a project will become a disaster to use, and probably fail completely.
Ah, but what happens when your plugins need to themselves be configured for different client deployments?
You add a few flags, then you need to figure out backwards compatibility as your plugin evolves (which involves defining prioritization rules between options), then those rules get complex enough to have conditionals (say, for granular traffic patterns), which means you have a DSL. And when the DSL gets complex enough, it needs an entire Software Development Lifecycle, which means it's effectively hard-coded. Or, you have people fork the plugin, which is a hard-code in and of itself.
All in all, you don't avoid the "configurability clock," you just decentralize it!
The real problem is that clients inevitably have conflicting needs that cut across any modularization barriers you might think to build. When a configured plugin can have spooky action at a distance, perhaps under-tested due to configuration, is it truly modular? Thus, the clock emerges.
You do multiple plugins or use constant configuration values for them. That's why you want plugins, for putting all complex stuff in actual code that doesn't have to live with the main product.
That doesn't decentralize the clock, it gives a maximum capable interface for the few people that need to handle exceptional cases, and a minimally capable one to the people that just want to use your software as is. That is, you make the product live on two opposite values of the clock at the same time.
Very interesting read! But I want to point out a small correction - the DNS collapse issue at HAProxy, along with O(N^2), also had some O(N^3) code paths, which is just mind-blowing.
Also, I believe this should be the correct GitHub issue link - https://github.com/haproxy/haproxy/issues/1404
> Production Lesson: Code that "works fine" at small scale may still hide O(N²) or worse behavior. At hundreds or thousands of nodes, those costs stop being theoretical and start breaking production.
The engineer killing the proxy because they assumed processes running as "nobody" were stray (whatever that means - processes without a parent don't change username, and nobody doesn't mean no username) doesn't belong in that list. That was just an engineer out of their depth (I assume one used to dealing with other systems)
Re-sort the takeaway points, to put this one first:
> Prioritize human factors. Outage recovery depends on what operators can see and do under stress. When dashboards fail, clear logs, simple commands, and predictable behavior matter more than complex mechanisms.
Why - to make it really, really clear to bullet-skimming managers and complexity-loving engineers that too-clever "solutions", and just-an-afterthought "testing & training", and poorly documented configurations will turn into worlds of pain when things really go wrong. The "smart people" won't be in the Operations Center then. Let alone with all the details fresh in their minds. And several of them may have taken jobs elsewhere, to not much care if the org is desperate for their help right now.
[dead]