What Is Happening: Current Capabilities and Trajectory | The AI Transition Tutorial

This chapter describes what AI systems can and can't do as of the mid-2020s, what the scaling trajectory looks like, and where the honest uncertainty lies.

The Short Version

Current large language models (GPT-4/5, Claude 3/4, Gemini, open-source models like Llama and Mistral's) plus multimodal systems can do tasks that were solidly science fiction five years ago. They also fail at tasks that a bright 12-year-old finds easy. Both are true. The asymmetry matters.

A practical summary of current capabilities:

Writing                   competent drafts, often excellent with guidance
Code                      writes working software at the level of a competent junior
Summarisation             reliable for most content; occasional errors
Translation               generally excellent for major languages
Visual reasoning          models can describe images, pass visual tests, with gaps
Math                      improving fast; still unreliable for novel problems
Science reasoning         pattern-matching research at graduate level sometimes
Agents                    can execute multi-step tasks; reliability varies
Long-context reasoning    improving; still brittle past moderate complexity
Physical world reasoning  weaker than linguistic; grounding remains a gap

Capabilities are uneven. A system that writes a good essay may make basic arithmetic mistakes. This unevenness is itself a feature of current AI: skills don't come bundled in the way a human's do.

What Actually Works

Some domains where current systems are reliably useful:

Writing and editing

Drafts, revisions, tone adjustment, summarisation. Professional writers increasingly use AI to draft, critique, or rephrase. Not because AI writes better than good writers, but because it's fast, tireless, and often produces a useful starting point.

Software engineering

AI pair-programming has become a standard part of professional workflows. The impact varies by task: AI is strong on boilerplate, common patterns, and explaining existing code; weaker on novel architecture and deep debugging. Productivity gains are real and measurable; they're also hotly contested in magnitude.

Customer service and routine office work

Structured tasks with clear success criteria: scheduling, email triage, document processing. Automation has been possible for decades; AI extends it further.

Translation and localisation

Near-human quality for major language pairs in many domains. Specialised domains (legal, medical) still benefit from human review.

Research assistance

Summarising papers, finding connections, generating hypotheses. AI isn't replacing researchers; it's speeding up parts of research work.

Education

Tutoring, homework help, foreign-language practice. Highly variable in quality. The best implementations are meaningfully better than nothing; the worst are misleading.

What Doesn't Work

Domains where current AI is unreliable, often dangerously so:

Accurate factual retrieval

Models hallucinate: confidently generate plausible-sounding falsehoods. Rates vary by domain and model generation. For anything where the truth matters, verification is required.

Genuinely novel reasoning

Models are largely trained on human output. Novel problems (outside training distribution) often produce confident nonsense. Research tasks that require creative leaps vs pattern-matching are still human territory.

Long-horizon coherent action

An AI system executing a week-long project reliably remains hard. Memory limits, accumulating errors, and context-switching failures make sustained autonomous work unreliable.

Physical common sense

Models trained mostly on text struggle with physical reality. They often misrepresent simple physical scenarios.

Ethical judgment

Models generate responses optimised for approval, not necessarily wisdom. "What would you do in situation X" produces plausible-sounding guidance that you shouldn't bet important decisions on.

Consistency under pressure

A model can give good answers in normal conditions and wildly wrong ones when pushed, misled, or given adversarial input.

The Scaling Trajectory

A specific empirical observation shaped the field's trajectory. Scaling laws (Kaplan et al. 2020, Chinchilla 2022, various updates) found that as you scale model size, training data, and compute, performance on many tasks improves predictably. The curves go in the right direction over many orders of magnitude.

This is not a guarantee. It's an empirical regularity that has held for some time. Several consequences:

Emergence

At certain scales, models start doing things smaller models couldn't. Arithmetic, multi-step reasoning, code generation: each emerged at a specific scale threshold. Whether further emergent capabilities exist at further scale is the subject of heavy research.

Diminishing returns

Scaling continues to produce improvements; improvements per dollar may be declining. Doubling compute doesn't double capability. Progress is still real but not automatic.

The bet on scaling

Leading labs (OpenAI, Anthropic, Google DeepMind, Meta) all bet heavily on scaling, with various supplementary research directions. The bet is partly that current architectures, with more compute and data, continue to produce useful improvements. It may be right; it may plateau.

Compute

Roughly, the compute used to train frontier AI models has grown ~4-6x per year for years. Some specific data points:

GPT-3 (2020): ~3e23 FLOPs
GPT-4 (2023): ~2e25 FLOPs (roughly 100x GPT-3)
Reported 2024-2025 models: another order of magnitude

This has several consequences:

The frontier is expensive (hundreds of millions to billions of dollars per training run)
Only a small number of actors can afford the frontier
Infrastructure (data centres, chips, power) is increasingly a constraint
Geopolitics of chips (US export controls to China) matters

Compute economics partly drive the concentration dynamic chapter 04 discusses.

Where the Honest Uncertainty Is

Several questions are genuinely open:

Will scaling continue to produce improvements?

Plausibly yes. Plausibly decreasingly. The transition from "LLM just predicts tokens" to "LLM reasons" was a specific emergence; further such emergences are not guaranteed. Labs are working on architectural innovations (mixture of experts, test-time compute, reasoning models) that may open up further progress.

Honest forecasters disagree substantially. Nobody knows.

Are current architectures the right architectures?

LLM-based systems might be on a path to more general intelligence. They might be an impressive but fundamentally limited approach. Some researchers (Yann LeCun notably) argue current paradigm misses key aspects of intelligence; others think the paradigm will scale until we're surprised.

What happens between "useful" and "transformative"?

Current systems help with specific tasks. Transformative AI would do something qualitatively different: sustained autonomous economic activity, novel scientific research, large-scale job replacement. The gap between current and transformative is not nothing. It may be closable soon; it may require breakthroughs we don't have.

Are reliability problems fundamentally solvable?

Hallucinations, adversarial robustness, honest calibration: all hard current problems. Whether they're engineering problems (solvable with enough work) or fundamental (requiring different approaches) is contested.

What Skeptics Emphasise

A useful corrective to AI-boosterism: skeptics point to real limits.

Melanie Mitchell: current systems lack genuine understanding; pattern matching is not reasoning
Gary Marcus: LLMs have persistent failure modes that may require hybrid symbolic approaches
Emily Bender: "stochastic parrots" framing; LLMs produce text without comprehension
Arvind Narayanan, Sayash Kapoor: much commercial AI is snake oil; rigorous evaluation beats marketing

These critics aren't denying capabilities exist. They're questioning whether the current generation of AI is on a path to genuine understanding or hits a ceiling before that. The debate is real; so are the limits they describe.

What Believers Emphasise

The case for expecting transformation is built on:

Steep capability gains year over year
Emergent capabilities as compute grows
Economic traction in many real use cases
Significant funding driving further investment
Theoretical arguments that current approaches may continue scaling

Key believers:

Holden Karnofsky's "Most Important Century" argues we're plausibly in a transformative period
Researchers at frontier labs generally express serious expectations about further capabilities
Investors and policy people increasingly take transformative AI seriously as a working hypothesis

The Range of Plausible Trajectories

A rough map:

Plateau                  Current approaches plateau; AI is useful but not transformative
Gradual improvement      Steady progress over decades; integration in many fields
Rapid expansion          Scaling continues; capabilities cross important thresholds in 5-15 years
Explosive takeoff        A recursive improvement where AI helps build more capable AI fast

Serious people place probability mass on each of these. Nobody should be certain which is true.

How to Watch

A few practical habits:

Track capability benchmarks

AI capability improves in measurable ways: pass rates on exams, benchmark scores, agent task completion. These aren't the whole picture but they're objective data. Papers like those from METR or the AI Index Report track this.

Track compute

Where compute is flowing, what chips are going to whom, how data centres are being built. Energy and infrastructure are trailing indicators; they show where the industry is betting.

Track deployment

What AI is doing in real economic activity. Are companies using AI more or less intensively? Are jobs changing? What does productivity data show? Look for signal over narrative.

Track research

New architectures, new training methods, new safety approaches. These preview what might be possible in 1-3 years.

Common Pitfalls

"The demos always lie." Often true. Also: the demos sometimes understate. A demo that works in a curated setting may fail in deployment, or vice versa. Watch for sustained usage data, not demos alone

"It just predicts the next token." Accurate as mechanism; incomplete as explanation. The same is true of human speech in some reductive sense. What matters is what the system can accomplish, not the mechanical description of how

"GPT-4 was peak; it's been downhill." Not supported by benchmark data, though depending on usage, a particular user's experience may not be improving

"Everything I see about AI is hype." Some is. Some isn't. Distinguishing requires looking at real outputs, not headlines

"We're months away from AGI." Some say this; most don't. The odds it happens in the next few years are non-trivial but far from certain. Be skeptical of confident short timelines and long timelines both

Next Steps

Continue to 03-alignment.md for the central technical and philosophical problem.