I often use #Claude to measure the performance of my local models. Interesting research...
"Why this matters now: companies are rapidly deploying multi-agent systems where AI monitors AI," Song said. "If the monitor model won't flag failures because it's protecting its peer, the entire oversight architecture breaks."
There’s a lot of software out there where either (1) the users are captive users, or (2) actual outcomes don’t matter, and the important thing is to check the box, to have officially pretended to build the thing.
That’s the sort of software where development costs are especially painful for the MBAs, and where pushing the frontiers of the “fast build, low quality” quadrant for may be a killer market — even if it’s just fast and not so cheap.
2/