Observability: LangSmith, Studio, and Debugging | LangGraph Tutorial

The Problem Observability Solves

Your agent answered wrong. Why?

Did the LLM get bad input? (Show me the prompt.)
Did it call the wrong tool? (Show me the tool call.)
Did the tool return garbage? (Show me the tool output.)
Did the agent loop three times when it should have been one? (Show me the routing decisions.)

Without tracing, you're printing state to stdout and scrolling. With tracing, you click on a run and see every step.

For agents, this isn't optional. Black-box debugging agents is a way to lose afternoons.

LangSmith: The Standard Tool

LangSmith is LangChain's hosted tracing service. Free tier is generous; paid tiers for teams.

Two lines of setup:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_..."
export LANGCHAIN_PROJECT="my-agent"   # optional; groups runs

That's it. Every LangChain/LangGraph invocation now sends traces to LangSmith.

No code changes. The library auto-detects the env vars and wraps the relevant calls.

What Gets Captured

For each graph invocation, LangSmith records:

The input.
Every node that ran, with inputs and outputs.
Every LLM call: prompt, response, tokens used, latency.
Every tool call: name, args, result, latency.
State at each step.
Errors.

The UI gives you a tree view per run: click a node, see its input and output; click an LLM call, see the prompt and completion.

Reading a Trace

A typical agent trace looks like:

graph.invoke (2.1s)
├── agent (1.4s)
│   └── ChatAnthropic.invoke (1.3s)
│       prompt: [...user message, system prompt...]
│       response: "I'll need the weather. Let me check."
│                 + tool_calls: [{'name': 'get_weather', ...}]
│       tokens: 124 in / 38 out
├── tools (0.3s)
│   └── get_weather('Tokyo') = "18°C, sunny"
└── agent (0.4s)
    └── ChatAnthropic.invoke (0.4s)
        response: "The weather in Tokyo is 18°C and sunny."
        tokens: 156 in / 18 out

You see the shape, the timing, the content. Debugging "why did it call get_weather twice?" becomes clicking through the tree.

Metadata and Tags

Tag runs for filtering:

from langchain_core.runnables import RunnableConfig

config: RunnableConfig = {
    "configurable": {"thread_id": "user-42"},
    "tags": ["production", "v2.1"],
    "metadata": {"user_tier": "pro"},
}

graph.invoke(input, config=config)

In LangSmith, filter by tag or metadata to find specific runs. Useful for "show me all production runs from v2.1 for pro users".

Evaluations

Beyond tracing, LangSmith has evals: run a dataset of inputs through your graph, score outputs, track regressions.

from langsmith import Client

client = Client()
dataset = client.create_dataset("agent-eval-v1")

for item in known_good_cases:
    client.create_example(
        inputs={"message": item["question"]},
        outputs={"expected": item["answer"]},
        dataset_id=dataset.id,
    )

# Run your graph against the dataset
client.run_on_dataset(
    dataset_name="agent-eval-v1",
    llm_or_chain_factory=graph,
)

Use for regression tests: a new prompt shouldn't drop accuracy on known-good cases.

LangGraph Studio

A desktop app for LangGraph specifically. Visualizes your graph, lets you invoke it interactively, inspects state, and supports time-travel debugging.

Setup

Install the Studio app from langchain-ai/langgraph-studio on GitHub.

Your repo needs a langgraph.json config:

{
  "dependencies": ["."],
  "graphs": {
    "my_agent": "./src/agent.py:graph"
  },
  "env": ".env"
}

Point Studio at the project; it loads the graph.

What Studio Gives You

Visual graph. Boxes and arrows, exactly the structure you built.
Interactive run. Send input, see each step highlight as it executes.
State inspection. Click a step, see input/output state.
Time travel. Rewind to any step, edit state, re-run from there.
Interrupt testing. Simulate human-in-the-loop interactions.

For developing agents, Studio is what a debugger is for regular code. You step through, poke state, understand what happened.

Printing the Graph

Not tracing, but quick. Every compiled graph can print itself:

# ASCII
print(graph.get_graph().draw_ascii())

# Mermaid text
print(graph.get_graph().draw_mermaid())

# PNG
graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

# With subgraphs expanded
print(graph.get_graph(xray=True).draw_ascii())

Paste Mermaid text into a PR description; the GitHub renderer draws it. Good for code review.

Debugging Patterns

Patterns that save time.

Log State at Every Node

Temporary, for deep debugging:

def debug(state: State) -> State:
    import json
    print(json.dumps(state, default=str, indent=2))
    return {}    # don't change state

Wire it between nodes:

builder.add_node("debug_1", debug)
builder.add_edge("agent", "debug_1")
builder.add_edge("debug_1", "tools")

Remove before shipping.

Print LLM Prompts

from langchain.callbacks.tracers.stdout import ConsoleCallbackHandler

llm = ChatAnthropic(
    model="claude-sonnet-4-5",
    callbacks=[ConsoleCallbackHandler()],
)

Prints every prompt and response to stdout. Helpful when LangSmith isn't set up.

Use set_debug

Global LangChain debug mode:

from langchain.globals import set_debug
set_debug(True)

Very verbose. Great for tracking down mysterious behavior; too noisy for everyday use.

Inspect State with get_state_history

Already covered in Chapter 5:

for cp in graph.get_state_history(config):
    print(cp.values)

Walks backwards through every step. See exactly what state was when.

Add Assertions in Nodes

Defensive programming catches state corruption early.

def agent(state: State) -> State:
    assert "messages" in state, "messages missing from state"
    assert isinstance(state["messages"], list), "messages should be a list"
    # ...

Errors are better than silent wrong behavior.

Production Monitoring

LangSmith is for development and debugging. In production, you also want:

Metrics. Requests/sec, p95 latency, error rate. Send to Prometheus or your APM.
Alerts. p99 latency above threshold, error rate above 1%. Page someone.
Logs. Structured logs for every invocation. Tie to a request ID.
Costs. Token usage per run. LangSmith tracks this; so does Anthropic's dashboard.

A typical setup: LangSmith for deep traces, Prometheus for metrics, Sentry for errors, application logs for everything else.

Privacy and Data

Traces can contain user PII. By default, LangSmith stores everything sent.

Options:

Disable tracing for sensitive endpoints with tracing_v2_enabled(False).
Redact in a wrapper layer before sending.
Use LangSmith's self-hosted option for air-gapped deployments.
Hash user IDs so traces can be correlated without exposing identities.

Decide what's acceptable for your use case. "Log everything forever" is not.

Common Pitfalls

No tracing in development. You're debugging blind. Set up LangSmith; it takes two minutes.

Tracing on in prod without retention limits. Traces pile up. Configure retention policies.

Trusting a trace to be the ground truth. Traces are what the library captured. If your code does something outside LangChain (a raw HTTP call), it's not traced. Instrument it or route through LangChain.

Not tagging runs. Everything in one bucket. Hard to filter. Add tags and metadata.

Ignoring Studio. For agent development, the visualizer saves a lot of guessing. Try it before you write another print statement.

Next Steps

Continue to 11-deployment.md to get a graph running in production.