We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid.
It sounds like they verify one LLM's answer by getting a second one to ask the same question over and over again in slightly different ways and see if it's answers stay the same, interesting over each piece of the answer so that essentially the original prompt is exploded and then montecarloed.