I have now read so many "ChatGPT can do X job better than workers" papers, and I don't think that I've ever found one that wasn't at least flawed if not complete bunk once I went through the actual paper. I wrote about this a year ago, and I've since done the occasional follow-up on specific articles, including an official response to one of the most dishonest published papers that I've ever read that just itself passed peer review and is awaiting publication.
That academics are still "bench-marking" ChatGPT like this, a full year after I wrote that, is genuinely astounding to me on so many levels. I don't even have anything left to say about it at this point. At least fewer of them are now purposefully designing their experiments to conclude that AI is awesome, and are coming to the obvious conclusion that ChatGPT cannot actually replace doctors, because of course it can't.
This is my favorite one of these ChatGPT-as-doctor studies to date. It concluded that "GPT-4 ranked higher than the majority of physicians" on their exams. In reality, it actually can't do the exam, so the researchers made a special, ChatGPT-friendly version of the exam for the sole purpose of concluding that ChatGPT is better than humans.
Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.
Just a bunch of serious doctors at serious hospitals showing their whole ass.