Potentially hot take: LLMs are reaching a dead end before they could even become remotely useful. The very approach boils down to brute force - you force-feed it more data until the problem goes away... and this works until it doesn't, and in this case it's actually breaking stuff.
Based on the output of those models, it's blatantly obvious that they don't use the data well at all; the whole thing is a glorified e-parrot, instead of machine learning. And yet, as the text shows, it's almost impossible to say why - because the whole thing is a blackbox.