The Sticker Table That Taught the Library a Hard Lesson

In the library’s back room, a cart of new books waited beside rolls of colored stickers. Volunteers flipped pages and chose one sticker per book: family friendly or needs caution. The librarian took the most common sticker as the final call. Takeaway: how you combine small judgments shapes what the system learns.

The librarian figured more opinions would cancel out quirks. But some volunteers weren’t random. A few kept judging certain authors, or certain kinds of characters, more harshly. When enough of those votes piled up, the “most common” sticker turned into a confident unfair one.

To check if this happens outside the library, a team looked at two big piles of past labeling where the right answers were already known. One pile was about judging case write-ups from a justice setting. Another was about tagging short online comments as toxic or not.

They could score each person two ways: how often the person matched the known right answer, and whether the person’s wrong calls hit one group harder than another. The surprise was simple. Some volunteers were usually right and still leaned unfairly, like a careful sticker-picker who keeps flagging one kind of author.

Then they looked at majority voting. On lots of items, the group doing the labeling was packed with the leaned volunteers, even with a gentle bar for calling someone leaned. The final sticker could flip away from the known right one, not from confusion, but from numbers. Kicking those volunteers out often made accuracy drop and left many items with too few stickers to use.

They also tried fancier ways to combine votes. Sometimes the final stickers got only a little closer to the known right ones, and the group-tilt often stayed. Sometimes it got worse. When later tools learned from these skewed stickers, they became less accurate and more uneven across groups. Fairness checks during sticker-making helped more than fixes after the stickers were set.