What Is Happening: Current Capabilities and Trajectory
This chapter describes what AI systems can and can't do as of the mid-2020s, what the scaling trajectory looks like, and where the honest uncertainty lies.
The Short Version
Current large language models (GPT-4/5, Claude 3/4, Gemini, open-source models like Llama and Mistral's) plus multimodal systems can do tasks that were solidly science fiction five years ago. They also fail at tasks that a bright 12-year-old finds easy. Both are true. The asymmetry matters.
A practical summary of current capabilities:
Writing competent drafts, often excellent with guidance
Code writes working software at the level of a competent junior
Summarisation reliable for most content; occasional errors
Translation generally excellent for major languages
Visual reasoning models can describe images, pass visual tests, with gaps
Math improving fast; still unreliable for novel problems
Science reasoning pattern-matching research at graduate level sometimes
Agents can execute multi-step tasks; reliability varies
Long-context reasoning improving; still brittle past moderate complexity
Physical world reasoning weaker than linguistic; grounding remains a gap
Capabilities are uneven. A system that writes a good essay may make basic arithmetic mistakes. This unevenness is itself a feature of current AI: skills don't come bundled in the way a human's do.
What Actually Works
Some domains where current systems are reliably useful:
Writing and editing
Drafts, revisions, tone adjustment, summarisation. Professional writers increasingly use AI to draft, critique, or rephrase. Not because AI writes better than good writers, but because it's fast, tireless, and often produces a useful starting point.
Software engineering
AI pair-programming has become a standard part of professional workflows. The impact varies by task: AI is strong on boilerplate, common patterns, and explaining existing code; weaker on novel architecture and deep debugging. Productivity gains are real and measurable; they're also hotly contested in magnitude.
Customer service and routine office work
Structured tasks with clear success criteria: scheduling, email triage, document processing. Automation has been possible for decades; AI extends it further.
Translation and localisation
Near-human quality for major language pairs in many domains. Specialised domains (legal, medical) still benefit from human review.
Research assistance
Summarising papers, finding connections, generating hypotheses. AI isn't replacing researchers; it's speeding up parts of research work.
Education
Tutoring, homework help, foreign-language practice. Highly variable in quality. The best implementations are meaningfully better than nothing; the worst are misleading.
What Doesn't Work
Domains where current AI is unreliable, often dangerously so:
Accurate factual retrieval
Models hallucinate: confidently generate plausible-sounding falsehoods. Rates vary by domain and model generation. For anything where the truth matters, verification is required.
Genuinely novel reasoning
Models are largely trained on human output. Novel problems (outside training distribution) often produce confident nonsense. Research tasks that require creative leaps vs pattern-matching are still human territory.
Long-horizon coherent action
An AI system executing a week-long project reliably remains hard. Memory limits, accumulating errors, and context-switching failures make sustained autonomous work unreliable.
Physical common sense
Models trained mostly on text struggle with physical reality. They often misrepresent simple physical scenarios.
Ethical judgment
Models generate responses optimised for approval, not necessarily wisdom. "What would you do in situation X" produces plausible-sounding guidance that you shouldn't bet important decisions on.
Consistency under pressure
A model can give good answers in normal conditions and wildly wrong ones when pushed, misled, or given adversarial input.
The Scaling Trajectory
A specific empirical observation shaped the field's trajectory. Scaling laws (Kaplan et al. 2020, Chinchilla 2022, various updates) found that as you scale model size, training data, and compute, performance on many tasks improves predictably. The curves go in the right direction over many orders of magnitude.
This is not a guarantee. It's an empirical regularity that has held for some time. Several consequences:
Emergence
At certain scales, models start doing things smaller models couldn't. Arithmetic, multi-step reasoning, code generation: each emerged at a specific scale threshold. Whether further emergent capabilities exist at further scale is the subject of heavy research.
Diminishing returns
Scaling continues to produce improvements; improvements per dollar may be declining. Doubling compute doesn't double capability. Progress is still real but not automatic.
The bet on scaling
Leading labs (OpenAI, Anthropic, Google DeepMind, Meta) all bet heavily on scaling, with various supplementary research directions. The bet is partly that current architectures, with more compute and data, continue to produce useful improvements. It may be right; it may plateau.
Compute
Roughly, the compute used to train frontier AI models has grown ~4-6x per year for years. Some specific data points:
- GPT-3 (2020): ~3e23 FLOPs
- GPT-4 (2023): ~2e25 FLOPs (roughly 100x GPT-3)
- Reported 2024-2025 models: another order of magnitude
This has several consequences:
- The frontier is expensive (hundreds of millions to billions of dollars per training run)
- Only a small number of actors can afford the frontier
- Infrastructure (data centres, chips, power) is increasingly a constraint
- Geopolitics of chips (US export controls to China) matters
Compute economics partly drive the concentration dynamic chapter 04 discusses.
Where the Honest Uncertainty Is
Several questions are genuinely open:
Will scaling continue to produce improvements?
Plausibly yes. Plausibly decreasingly. The transition from "LLM just predicts tokens" to "LLM reasons" was a specific emergence; further such emergences are not guaranteed. Labs are working on architectural innovations (mixture of experts, test-time compute, reasoning models) that may open up further progress.
Honest forecasters disagree substantially. Nobody knows.
Are current architectures the right architectures?
LLM-based systems might be on a path to more general intelligence. They might be an impressive but fundamentally limited approach. Some researchers (Yann LeCun notably) argue current paradigm misses key aspects of intelligence; others think the paradigm will scale until we're surprised.
What happens between "useful" and "transformative"?
Current systems help with specific tasks. Transformative AI would do something qualitatively different: sustained autonomous economic activity, novel scientific research, large-scale job replacement. The gap between current and transformative is not nothing. It may be closable soon; it may require breakthroughs we don't have.
Are reliability problems fundamentally solvable?
Hallucinations, adversarial robustness, honest calibration: all hard current problems. Whether they're engineering problems (solvable with enough work) or fundamental (requiring different approaches) is contested.
What Skeptics Emphasise
A useful corrective to AI-boosterism: skeptics point to real limits.
- Melanie Mitchell: current systems lack genuine understanding; pattern matching is not reasoning
- Gary Marcus: LLMs have persistent failure modes that may require hybrid symbolic approaches
- Emily Bender: "stochastic parrots" framing; LLMs produce text without comprehension
- Arvind Narayanan, Sayash Kapoor: much commercial AI is snake oil; rigorous evaluation beats marketing
These critics aren't denying capabilities exist. They're questioning whether the current generation of AI is on a path to genuine understanding or hits a ceiling before that. The debate is real; so are the limits they describe.
What Believers Emphasise
The case for expecting transformation is built on:
- Steep capability gains year over year
- Emergent capabilities as compute grows
- Economic traction in many real use cases
- Significant funding driving further investment
- Theoretical arguments that current approaches may continue scaling
Key believers:
- Holden Karnofsky's "Most Important Century" argues we're plausibly in a transformative period
- Researchers at frontier labs generally express serious expectations about further capabilities
- Investors and policy people increasingly take transformative AI seriously as a working hypothesis
The Range of Plausible Trajectories
A rough map:
Plateau Current approaches plateau; AI is useful but not transformative
Gradual improvement Steady progress over decades; integration in many fields
Rapid expansion Scaling continues; capabilities cross important thresholds in 5-15 years
Explosive takeoff A recursive improvement where AI helps build more capable AI fast
Serious people place probability mass on each of these. Nobody should be certain which is true.
How to Watch
A few practical habits:
Track capability benchmarks
AI capability improves in measurable ways: pass rates on exams, benchmark scores, agent task completion. These aren't the whole picture but they're objective data. Papers like those from METR or the AI Index Report track this.
Track compute
Where compute is flowing, what chips are going to whom, how data centres are being built. Energy and infrastructure are trailing indicators; they show where the industry is betting.
Track deployment
What AI is doing in real economic activity. Are companies using AI more or less intensively? Are jobs changing? What does productivity data show? Look for signal over narrative.
Track research
New architectures, new training methods, new safety approaches. These preview what might be possible in 1-3 years.
Common Pitfalls
"The demos always lie." Often true. Also: the demos sometimes understate. A demo that works in a curated setting may fail in deployment, or vice versa. Watch for sustained usage data, not demos alone
"It just predicts the next token." Accurate as mechanism; incomplete as explanation. The same is true of human speech in some reductive sense. What matters is what the system can accomplish, not the mechanical description of how
"GPT-4 was peak; it's been downhill." Not supported by benchmark data, though depending on usage, a particular user's experience may not be improving
"Everything I see about AI is hype." Some is. Some isn't. Distinguishing requires looking at real outputs, not headlines
"We're months away from AGI." Some say this; most don't. The odds it happens in the next few years are non-trivial but far from certain. Be skeptical of confident short timelines and long timelines both
Next Steps
Continue to 03-alignment.md for the central technical and philosophical problem.