At an organisational level, you have to make a decision on how much time you spe...

pwr-electronics · on April 24, 2022

My whole point is that you don't, because there aren't always just two options. That's the false dilemma logical fallacy.

I'm saying you can fix problems without dropping everything and redoing work. You're allowed to problem solve and work with people to create a third option. And you can prevent new ones by learning and strategizing.

cameronh90 · on April 24, 2022

Well whether you drop everything or clean as you go or whatever other strategy, fixing stuff takes time. Even if it's just the mental effort of designing a better way and consensus building.

I'm just using simple analogies for the sake of explanation, but it is nearly always the case that expanding the scope of work to fix previous architectural decisions that were either flawed or no longer relevant will take considerably longer than just fixing the problem at hand.

There may be the odd time, particularly in a large, well defined piece of work, where you can say actually tidying up this other stuff will save time overall. Or perhaps you can batch a bunch of improvements in the same system together into a larger, more thoughtful architectural improvement. All of that is great if you can do it, but it's often not possible.

As far as preventing future architectural issues by learning and strategizing, I feel like that's what we spend our entire career trying to get better at doing ;). But alas I, and everyone else, seem to continue making decisions that don't pan out long term. Even if you did make a perfect decision at the time, often the world/business/third party dependency changes, and what was an excellent decision in the past becomes a pain point a few years later.

It used to be the case that we tried to design infinitely extensible software so future requirements could always be incorporated, but that makes the software unmaintainable. So the pendulum swung to YAGNI and only designing for exactly what was right in front of you, but that leads to major architectural overhauls every few months. True answer is somewhere in the middle, but learning where is something that only seems to come with decades of experience.

Unfortunately older programmers all seem to be forced out of developing and into management or other careers for some reason.

pwr-electronics · on April 24, 2022

I'm still trying to challenge your assumptions. Why does a different solution necessarily require expanding the scope of work? Like you said, that's where experience helps to have those skills in your toolbox. Doing things better doesn't have to be harder.

cameronh90 · on April 24, 2022

It doesn't always require expanding the scope of work, but very often does. I even suggested a few situations where it doesn't, but in many cases fixing the true underlying problem involves expanding the scope of work.

It's hard to argue the nitty gritty without examples so here's a real world one from quite a long time ago, in a company that went bust after the death of the owner.

--

We had a system that had a significant quantity of code written in a custom language that would be compiled by an internally written compiler. This compiler was in some ways a work of genius, written in the 80s, but it had a lot of very deep architectural flaws in the optimiser that meant certain patterns of code would generate invalid output. We didn't write much new code in this language but had a pretty large body of code that needed to continue running.

So during a server hardware refresh, we found that almost everything was crashing. Turns out, a compiler optimiser flaw meant that any time a loop had a number of iterations that wasn't a multiple of the number of CPUs, generated programs would segfault.

We investigated what it would take to fix the underlying issue but it would have been a week or more of work just to understand why it was happening. Porting all the old code would have taken even longer.

Instead what we did was, using a pre-existing AST manipulation library we had written, add a prebuild script that hacked all of the files to include a CPU count check then pad out the number of iterations with NOPs. Took a few hours and unblocked the server upgrade.

--

Another, perhaps less esoteric and more recent example:

A third party open source library we use had an issue where a particular function call would sometimes get stuck in an infinite loop due to incorrect network code in the library interacting badly with our network hardware.

We submitted a bug report and fix, but maintainer wouldn't accept a fix unless we also changed a bunch of other related code, added a bunch of tests etc. which we didn't have time to do. We considered a fork but that would involve keeping it up to date, rebuilding packages and so on.

We worked around the issue by running it in a different process and monitoring CPU usage. If CPU usage goes beyond q certain threshold, we kill the process and try again.

Workaround was quick and has been working fine for over a year now. Contributed patch is still languishing in an open PR with various +1s from other users.

pwr-electronics · on April 25, 2022

I think your examples agree with my point: You found minimal-time solutions that haven't caused continuous suffering afterwards, and can be easily removed when the root cause is fixed. That's a good result.