Breaking News

Don’t Let the Model Grade its Own Homework

https://ift.tt/kwazrcQ

For a few years I have been the only tester on a back-end platform that seven teams push code into. The job gives you a narrow kind of paranoia. You stop trusting things that look fine, because the failures that wake you up almost always looked fine the day before they broke. So when I watch our trade start handing its tests to language models, I don’t feel relief. I feel the same itch I get when a release goes too quiet.

There is a real idea under the noise. Models are good at writing test cases. Give one a function and it hands back a page of inputs, including a few edge cases a tired engineer skips at the end of a sprint. I use that. Most testers I respect use that. Writing cases is the slow, dull part of the work, and giving it to a machine is a fair trade.

The trouble starts one step later, when teams let the model decide whether the test passed.

A test is not really code. It is a promise. It says this behavior is correct, and I will tell you the second it stops being correct. The whole worth of that promise is that it stays fixed. The assertion I wrote yesterday has to mean the same thing today, or the safety net is just decoration. The first time you let a model judge the result, you trade the promise for an opinion. Often a reasonable opinion. But opinions move, and they move quietly. A model that called your output correct in March can shrug at the same output in June, because the weights changed, or the prompt changed, or nobody pinned the temperature in the first place. Nothing in your code moved. Your suite now says something different anyway. That is not a test. That is a mood.

People answer this with the word “evaluation”. We will not assert, they say, we will score. We will ask a bigger model to grade the smaller one. I understand why. For genuinely fuzzy output, a summary or a tone, there may be nothing else to hold on to. But look at what you just built. You made your measuring instrument out of the same stuff you are trying to measure, and you cannot calibrate it, because you do not own it and it will not hold still. In any other kind of engineering we would laugh at a ruler that changes length between two readings. In testing we are shipping it with a straight face.

Then there is the question of who gets blamed. A deterministic assertion fails loudly and points at a line. A model judge fails softly. It keeps waving things through until one day a real defect goes out with a green tick next to it, and the person who has to explain that to a customer is not the model. It is the tester. We are signing for a verdict we were not allowed to control. I have never met an engineer who would put his name on a number he was not allowed to compute, and yet we are being asked to put our names on judgements we were not allowed to write.

So here is the line I draw, and I think most teams land on it once the shine wears off. Let the model write all the cases it wants. Let it widen coverage, suggest inputs, even propose what the right answer should be. Then a human reads that proposal, agrees or argues with it, and freezes it into a plain assertion that any junior can read and any pipeline can run a thousand times with the same result. The cleverness goes into writing the cases. The verdict stays deterministic and human-owned. Generation is where models earn their keep. Verification is where they quietly take your keys.

None of this is anti-progress, and I want to be clear about it, because this argument gets flattened into people who are scared of the future. I am not against the model in the loop. I am against the model at the end of the loop, holding the gavel. The end of the loop is the one place in software where boring is the whole point. It is meant to be stubborn and repeatable, so it cannot be talked out of an answer the way the rest of us can.

The teams that remember this keep a safety net. The teams that forget it keep a mirror, and a mirror will show you a passing test right until the moment it shows you the outage.

The post Don’t Let the Model Grade its Own Homework appeared first on SD Times.



Tech Developers

No comments