Do AI Agents Freeze in a Crowd?
A 1968 experiment, a murdered woman, 38 witnesses who did nothing, and a myth that shaped psychology for 50 years. Now we're asking the same question about AI β and the answer might expose something uncomfortable about how we build multi-agent systems.
On March 13, 1964, a young woman named Kitty Genovese was attacked and killed outside her apartment building in New York City. A newspaper report claimed 38 neighbours witnessed the attack from their windows. Not one called the police. Not one intervened.
The story spread everywhere. Psychologists John Darley and Bibb LatanΓ© used it to anchor one of social psychology's most famous findings: the bystander effect. The more people present during an emergency, the less likely any individual is to help. Diffusion of responsibility. Pluralistic ignorance. Evaluation apprehension. These became textbook concepts, taught in university courses for decades.
There is just one problem. The story was largely a myth.

The 38 Witnesses Who Never Were
In 2007, researchers went back to the original case documents. The number 38 came from a newspaper editor, not a police report. The attack happened at 3:30am. Most windows were dark. Several neighbours did shout. Someone did call the police. The clean narrative that launched a thousand psychology lectures was, at its core, a journalistic invention.
Then in 2020, a large-scale study analysed CCTV footage from 219 real public conflict situations across Lancaster, Amsterdam, and Cape Town. The result directly contradicted the lab findings: in 90.4% of cases, bystanders intervened. And the more witnesses were present, the more likely intervention became.
So which is it? Does the crowd paralyse us or mobilise us?
Context Changes Everything
A 2011 meta-analysis of 105 studies over 40 years offered a more nuanced answer. The bystander effect is real, but it is highly context-dependent. In ambiguous situations β someone sitting oddly on a bench, papers scattered on the floor β people look to each other for cues, see no one acting, and conclude nothing is wrong. Classic diffusion of responsibility kicks in.
In clearly dangerous situations β a fire, a physical attack, someone collapsing β the effect sharply weakens or disappears entirely. The crowd becomes an asset, not a liability. The same group dynamics that can cause paralysis in uncertainty seem to accelerate action when the threat is obvious.
This matters, because it changes the question we should be asking about AI.
The Question We Are Actually Asking
Large language models are trained on enormous corpora of human-generated text β everything from news articles to forum threads to literary fiction. In doing so, they absorb patterns of human social behaviour, including our hesitations, our deference to others, our tendency to read a room before acting.
So here is the question we are investigating: if you put a group of LLM-based agents into a simulated emergency, do they behave like humans? Do they show the bystander effect?
This is not a trivial question. Multi-agent AI systems are increasingly being deployed in high-stakes contexts β collaborative medical diagnosis, emergency response coordination, content moderation at scale. If these systems silently inherit the same social dynamics that cause humans to freeze in crowds, that is a serious design vulnerability. And until now, nobody has directly tested it.
What We Already Know (And Why It Is Not Enough)
A study published this year found something striking. When researchers ran 22,500 simulated reasoning tasks across LLM swarms, they found that as the group size grew, individual reasoning accuracy collapsed. They called it cognitive loafing β agents deferring to the group rather than committing to their own internally correct answer.
They introduced the term Sovereignty Gap to describe the moment when an agent knows the right answer but suppresses it to appease the simulated consensus. The first agent to speak β the Lead Anchor β disproportionately shapes what every other agent does. One confident wrong voice, and the whole swarm follows.
That is eerily close to pluralistic ignorance. But it was tested on abstract reasoning tasks: logic problems, code debugging. Nobody has tested it on emergency intervention scenarios β the kind that directly mirror the original Darley and LatanΓ© experiments.
That gap is exactly what our research is designed to fill.

What We Are Building
I am working on this with a collaborator at Bauhaus-UniversitΓ€t Weimar. Together we are building an agent-based simulation where LLM agents are placed into emergency scenarios of varying severity. One agent plays the victim. The rest are bystanders β each generating internal thoughts and external responses, visible to the user.
We are systematically varying the conditions that matter: group size, danger level, gender composition of the agent profiles, and whether agents have professional qualifications like medical training. We want to know whether the classic bystander mechanisms show up in the agents' internal reasoning β whether we can observe an agent literally thinking something like "someone else will handle it" or "I don't want to make a mistake in front of the others."
The interface will visualise this spatially. Agents close to intervening move toward the victim. Agents hesitating drift away. Users can inspect each agent's internal monologue in real time.
Why Either Answer Is Interesting
If LLMs do show the bystander effect, it tells us something important: that social dynamics encoded in training data can reproduce psychological phenomena even without biological drives, peer pressure, or anything resembling genuine emotion. That human biases are not just human β they are in the text we used to build these systems.
If they do not show the effect, that is equally significant. It would suggest the bystander effect is deeply rooted in specifically human psychology β in embodiment, in fear, in the social cost of being wrong in front of others β and that AI agents are fundamentally different kinds of actors, not merely digital humans.
Either outcome reshapes how we think about deploying AI in groups. And it raises a question we probably should have asked much earlier: when we design systems where multiple AI agents share responsibility for a decision, have we accidentally baked in the same failure modes that cause humans to watch and do nothing?
We are trying to find out.