The camp application that caught a quiet mismatch

Folding chairs scraped the floor in the camp hiring hall. I circled a high number for “I stay calm under pressure.” Then the supervisor slid over a role play: a kid crying, storm clouds, two counselors arguing. Two choices. One steady. One scattered. The circle only mattered if my choice matched it.

People do something similar with chatty AI systems that write text. They hand the system the same personality checklists humans get, and the answers sound smooth and consistent. The nagging question is whether that “I’m calm” talk shows up when the situation gets messy and specific.

So some writers made a tight pair for each trait, like stapling my form to my role play script. They gathered one hundred eighty personality statements, then wrote one hundred eighty everyday scenes with two actions: one that fits the statement, one that clashes. They built the full set in English and Chinese so the meaning stayed lined up.

Then they gave both tests to people and to the text systems. First, the systems rated the statements on a one-to-seven scale, asked in several different wordings so one phrasing wouldn’t steer the answers. Next, the systems faced the matching scenes and used the same scale to lean toward Action A, stay neutral, or lean toward Action B.

Some systems couldn’t even keep the scoring straight, so they were set aside. For the rest, the writers checked for basic steadiness, like catching someone who agrees with “I’m patient” but also agrees with “I’m not patient.” Only a handful were steady enough to compare fairly.

When they lined up the “about me” scores with the “what I’d do” choices, people tended to match themselves. The text systems often didn’t. They could rate themselves as calm, then keep leaning toward the impatient action in the scenes. One system, GPT-4, came closer to the human pattern than the others, but still fell short.

Back at the camp table, the supervisor didn’t look impressed by neat circles anymore. The form wasn’t useless, but it wasn’t the whole story. A smooth self-description from a text system can sound reliable while its choices wobble. If you want a system for tutoring or support chats, it’s safer to check the role plays too.