Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbationsAranyak Acharyya, Carey E. Priebe, Hayden S. Helmhttps://arxiv.org/abs/2509.10963
Testing for LLM response differences: the case of a composite null consisting of semantically irrelevant query perturbationsGiven an input query, generative models such as large language models produce a random response drawn from a response distribution. Given two input queries, it is natural to ask if their response distributions are the same. While traditional statistical hypothesis testing is designed to address this question, the response distribution induced by an input query is often sensitive to semantically irrelevant perturbations to the query, so much so that a traditional test of equality might indicate …