Study 1: distinguishing AI-generated from human-written poems
As specified in our pre-registration (https://osf.io/5j4w9), we predicted that participants would be at chance when trying to identify AI-generated vs. human-written poems, setting the significance level at 0.00518; p’s between 0.05 and 0.005 are “suggestive”. Observed accuracy was in fact slightly lower than chance (46.6%, χ2(1, N = 16340) = 75.13, p < 0.0001). Observed agreement between participants was poor, but was higher than chance (Fleiss’s kappa = 0.005, p < 0.001). Poor agreement suggests that, as expected, participants found the task very difficult, and were at least in part answering randomly. However, as in10, the below-chance performance and the significant agreement between participants led us to conclude that participants were not answering entirely at random; they must be using at least some shared, yet mistaken, heuristics to distinguish AI-generated poems from human-written poems.