“The majority of mainstream evaluations reward hallucinatory behavior.”

The researchers say a “simple” change in evaluation method will change this.

This reads to me like humans who refuse to say “I don’t know the answer” wrote the algos & picked the evaluation criteria.

Our weaknesses in code…