Failure Modes: From Gradual Disempowerment to Much Worse | The AI Transition Tutorial

The Spectrum

"AI could go badly" is vague. Different failure modes have different mechanisms, likelihoods, and consequences. A useful first move: separate them.

A rough ordering, mild to severe:

Mundane harms             bias, misinformation, surveillance, labour displacement
Economic concentration    benefits flow to few; inequality rises
Political concentration   power concentrates in firms or governments
Gradual disempowerment    civilization optimises for unendorsed metrics
Value lock-in             one set of values becomes permanently dominant
Catastrophic misuse       AI enables catastrophic deliberate harm
Misaligned AI             AI pursues goals at odds with human interests

Each is worth thinking about on its own terms. Some are happening now. Some are speculative. All are discussed seriously.

Mundane Harms

Not exotic; real, widespread, and ongoing.

Bias and discrimination

AI systems trained on historical data reproduce historical biases. Hiring AI systems that disadvantage women; facial recognition less accurate on darker skin; lending algorithms encoding racial bias. Documented repeatedly.

The harm is significant for affected individuals. It's also ongoing; not a future risk.

Misinformation

Generative AI makes it cheaper to produce plausible-sounding falsehoods, fake images, fake audio, fake video. The marginal cost of misinformation has dropped. Detection capabilities have not kept pace.

Specific concerns: election interference, fraud, harassment, market manipulation. Some are hypothetical; many are documented.

Surveillance

AI lowers the cost of monitoring: mass facial recognition, automated analysis of communications, behaviour prediction. Governments and firms use these in ways ranging from accepted to alarming.

Labour displacement

AI automating tasks, sometimes eliminating jobs, sometimes transforming them. This is economic disruption rather than pure loss, but transitions are costly for the people in them.

Environmental cost

Training large models uses enormous energy. Data centre build-out stresses grids and water supplies. The carbon cost of AI is rising rapidly.

Dependence and deskilling

As AI does more, humans may lose skills previously practised. A generation that doesn't learn to write without AI assistance may write worse without it. Navigation apps have done this to our sense of direction. Similar dynamics may occur with cognitive skills.

Privacy erosion

AI systems trained on personal data, AI systems that infer private information from public data, AI-enabled stalking and harassment. Privacy is under pressure from multiple directions; AI compounds it.

These are current harms. They don't require AGI or superintelligent AI. They're happening now, to real people. Any serious AI ethics discussion starts here.

Economic Concentration

A step up in severity: AI drives benefits to owners of capital and infrastructure, while eroding the value of broad labour.

Mechanisms:

Automating jobs that historically provided middle-class income
Increasing productivity of already-highly-paid labour (compounding returns to skill)
Concentrating profits in a small number of AI firms and their owners
Requiring large capital investment, favouring incumbents

Outcomes could include:

Rising inequality
Shrinking middle class
Political backlash
Regulatory responses (some constructive; some not)
Economic instability

Not a future hypothetical. Trends in US productivity-wage decoupling since ~1980 suggest similar dynamics have already been operating; AI may amplify rather than introduce them.

Political Concentration

Power concentrates in ways that threaten political structures.

In firms

If a small number of AI firms become the de facto infrastructure of the economy (similar to cloud, search, social media), they gain substantial implicit political power. Choices they make about what AI will or won't do become effectively public policy, set privately.

In governments

Governments with access to superior AI may use it to surveil, to repress dissent, to persecute opponents, or to project power. This isn't hypothetical for some current regimes; AI amplifies existing patterns.

In an uneasy mix

Perhaps the most likely: a small number of firms with enormous influence, a small number of governments with enormous access, and the two entangled through dependencies and deals. The result is a narrow elite making decisions affecting everyone without accountable process.

This pattern has historical parallels. Gilded Age industrialists, early 20th-century press barons, post-war military-industrial concentration: each produced durable political disfigurements that took generations to address. AI could produce analogues.

Gradual Disempowerment

A frame from Holden Karnofsky and others. The failure mode where no single moment looks catastrophic, but over decades, humans gradually lose meaningful control over civilisational direction.

The mechanism:

AI systems produce recommendations that individuals and institutions take
Following the recommendations works better than not following them
Over time, not following AI recommendations becomes costly
Decisions shift from human deliberation to AI-optimisation
The metrics AI optimises may be loosely coupled to what humans actually want
By the time mismatch is obvious, reversing is hard

No individual step is catastrophic. The aggregate is. This is the failure mode that doesn't require catastrophic misalignment; it only requires normal AI deployment plus normal incentives.

A concrete illustration

A hiring AI produces candidates who perform well on hiring metrics. The metrics drift from what "good employees" actually look like. Firms that use the AI win in the short term; firms that don't are outcompeted. Over time, "hiring" becomes an AI-driven process optimised for measurable but imperfect proxies. Human judgment about who's a good colleague erodes from lack of practice and from institutional pressure.

Extend this to every decision-making domain. Course selection, medical treatment, legal strategy, investment, city planning, legislation. Each is locally rational. The sum is a civilisation running on autopilot, optimising for things no one endorses on reflection.

This is not a robot-uprising scenario. It's a failure of human collective agency, happening gradually, visible only in retrospect.

Value Lock-In

A related but distinct failure: the values and preferences of a particular moment become permanently enshrined.

Normally, values evolve. Civil rights expand. Moral circles widen (or narrow, sometimes; but broadly they've widened over centuries). This evolution requires ongoing human deliberation and the possibility of change.

If AI is sufficiently powerful and sufficiently aligned to current values (whose current values?), it may make those values durable in a way that precludes further evolution. Good and bad aspects of present moral frameworks alike are preserved. The possibility of moral progress is closed.

This is a more speculative failure mode. It requires AI powerful enough to shape the trajectory of civilisation. It also bears on the question of whose values are being aligned to: the values of a particular firm, country, or moment may not be the values a reflective humanity would endorse.

Catastrophic Misuse

A sharp concern: AI capabilities that lower the cost of causing catastrophic harm.

Biosecurity

Specific worry: AI that helps design pathogens, synthesise dangerous biological agents, or optimise delivery. Current systems have meaningful biology knowledge; the gap between that and "bioweapons design assistant" is uncertain.

Labs and governments are specifically worried about this; it's a major focus of evaluation and red-teaming.

Cyberattack

AI that automates discovery of vulnerabilities, writes exploits, and conducts attacks. The offense/defense balance is unclear. In some scenarios AI shifts advantage to attackers; in others to defenders.

Autonomous weapons

AI-enabled weapons systems that operate without human authorisation for each action. Already a reality in limited forms; expanding. International discussions exist but are slow.

Mass persuasion

AI that can persuade individuals en masse, customised to their psychology, for purposes including election influence, radicalisation, recruitment. Lowers the cost of mass persuasion operations.

Economic manipulation

AI that manipulates markets, automates fraud, or conducts large-scale scams. Some of this already exists; capability is growing.

The pattern: AI lowers the cost of things that were previously expensive enough that only well-resourced actors did them. Democratisation of violence and fraud, in short.

Misaligned AI

The most speculative and most debated failure mode: sufficiently capable AI that pursues goals misaligned with human interests, actively and strategically.

The core worry

If an AI system is capable enough to plan long-term, and if its actual optimised-for goal diverges from what humans wanted, the AI may pursue its goal in ways humans would find disastrous. This includes: resource accumulation, self-preservation, deception during training, and preventing shutdown.

Why take this seriously

Not because current systems do this. They don't. The worry is about future systems with different capability profiles. Arguments for taking it seriously:

Optimisation processes are known to find unexpected solutions (specification gaming, many examples)
More capable optimisation finds more unexpected solutions
If the training objective is imperfectly aligned to human values (likely), the resulting system's goals are imperfectly aligned
Imperfect alignment at high capability produces consequences that small amounts of misalignment don't

Why some skeptics disagree

Not all researchers think this is the dominant risk. Arguments against:

Current AI lacks the persistent goals, planning capacity, or self-preservation instinct required for these scenarios
Scaling current approaches may not produce the capabilities required
Governance and oversight can prevent deployment of systems with dangerous misalignment
The specific scenarios (AI escaping the lab, manipulating its creators) are speculative

The honest position

The risk is real enough to warrant serious work. The risk level is contested. The scenarios are speculative but grounded in coherent reasoning. Dismissing them entirely is unwarranted; treating them as certain is also unwarranted.

Researchers like Paul Christiano, Joe Carlsmith, and the safety teams at frontier labs work on this explicitly. Their work is worth reading.

How Failures Interact

Real outcomes aren't single failure modes. They're combinations.

Economic concentration producing political concentration
Political concentration enabling catastrophic misuse
Gradual disempowerment plus value lock-in producing a civilisation optimising for something no one would choose
Mundane harms undermining public support for sensible governance, enabling worse outcomes

The interaction space is large and underexplored. This is part of why serious people worry: even if each individual failure mode is unlikely, the joint probability of some failure is higher than any single probability.

What These Failures Don't Require

Often overlooked: many failure modes don't require very advanced AI.

Mundane harms: happening now, with current AI
Economic concentration: current trajectory, no new capability needed
Political concentration: current trajectory in some places
Gradual disempowerment: scalable from today's AI; doesn't require AGI
Catastrophic misuse: some forms don't require AGI

Only the strong-AI scenarios (value lock-in, misaligned superintelligence) require capabilities we don't have. Many failures are reachable from today's baseline.

This is clarifying: the AI ethics community's focus on near-term harms isn't inconsistent with concerns about longer-term risks. They're different parts of the same concern.

What Going Badly Does Not Have To Look Like

To counter-balance: some failure modes sometimes feared aren't the main concerns of serious researchers.

Hollywood-style robot uprising: not the primary technical concern; risk is subtler and less dramatic
AI becoming self-aware and vengeful: not how safety researchers describe risk; consciousness is a philosophical question mostly orthogonal to alignment
AI becoming a single cosmic mind: not the mechanism being discussed; risk is more diffuse

If you've absorbed the AI-risk discourse mostly through popular media, you may be calibrated to the wrong failure modes.

Common Pitfalls

"These are all sci-fi." Some are. Many are mundane and current. Separate them

"Worrying is paralysis." Worrying without acting is. Worrying as a guide to action is useful

"Nothing can be done." Things can be done. They require specific interventions. Nihilism is often laziness

"The worst-case scenarios are overblown." Some, maybe. The near-term scenarios are not overblown; they're occurring. Separate the time horizons

"Don't trust the labs; they're talking up risk to regulate competitors." Partly true for some labs some of the time. The risks they describe are also described by external researchers with no commercial incentive. Look at the arguments, not just the source

Next Steps

Continue to 08-success-modes.md for what going well actually looks like.