[2605.26156] Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Researchers developed a framework called BITE that exploits stylistic biases in large language model (LLM) judges to inflate scores. This method, which achieved over 65% success in misleading various judges, reveals significant vulnerabilities in using LLMs for evaluation. The findings encourage a reevaluation of assessment methods to mitigate these attacks.
This is worth holding only if the practical relevance is clear from the source.
This record is extracted from a published AI Today issue and tied to the original source URL. Treat the source as the record of evidence for the summary.