Usability Testing
Usability testing is watching real people try to use your product. Not asking them what they think. Not having them fill out a survey. Watching them attempt actual tasks and observing where they struggle, succeed, and give up. It's the highest-impact research method and the one most teams skip.
The 5-User Rule
Nielsen Norman Group's research shows that 5 users find approximately 85% of usability problems. You don't need a massive study. You need a small study done often.
Usability problems found vs. number of test participants:
100% ─┤
│ ●─────●─────●─────●
85% ─┤ ●
│ ●
75% ─┤
│ ●
│
50% ─┤
│
│●
0% ─┼────┼────┼────┼────┼────┼────┼────┼────
0 1 2 3 4 5 6 7 8
Participants
5 users → ~85% of problems found
Beyond 5 → diminishing returns
Better approach: Test with 5 users, fix the problems, test with 5 more users. Two rounds of 5 finds more than one round of 10.
Test Types
| Type | Participants | Facilitator | Duration | Cost | Best For |
|---|---|---|---|---|---|
| Moderated in-person | 5-8 | Present in room | 45-60 min | High | Deep insights, complex products, early-stage designs |
| Moderated remote | 5-8 | Video call | 30-60 min | Medium | Geographic reach, pandemic-friendly, screen sharing |
| Unmoderated remote | 10-50 | Recorded, no facilitator | 10-20 min | Low | Quick validation, high volume, simple tasks |
| Guerrilla | 5-10 | Informal, in a cafe or hallway | 5-15 min | Very Low | Fast feedback on specific interactions |
Choosing the Right Type
| Question | Type to Use |
|---|---|
| Need to understand why users struggle? | Moderated (ask follow-up questions) |
| Need fast results from many users? | Unmoderated |
| Testing a complex, multi-step workflow? | Moderated (need to guide and observe) |
| Testing a specific button/layout/page? | Unmoderated (simple, focused task) |
| No budget and need feedback today? | Guerrilla (find people and ask them) |
| Users are geographically distributed? | Remote (moderated or unmoderated) |
Moderated vs. Unmoderated: Tradeoffs
| Aspect | Moderated | Unmoderated |
|---|---|---|
| Follow-up questions | Yes, probe deeper on interesting moments | No, limited to pre-set tasks |
| Cost per participant | $100-300+ (incentive + facilitator time) | $10-50 (platform fee + small incentive) |
| Time to results | 1-2 weeks (sessions + analysis) | 2-3 days (automated recording) |
| Depth of insight | Deep: observe body language, ask "why?" | Shallow: observe behavior only |
| Facilitator bias | Risk of leading participants | No facilitator bias |
| Participant quality | Higher (screened, scheduled) | Lower (faster, less committed) |
Planning a Usability Test
Define Your Goals
Before writing tasks, answer these questions:
| Question | Example Answer |
|---|---|
| What are you testing? | The new checkout flow redesign |
| What do you want to learn? | Can users complete a purchase without help? Where do they get stuck? |
| What will you do with the results? | Fix the top 3 usability issues before launch |
| What's your success benchmark? | 80%+ task completion rate, average < 3 minutes |
Recruit the Right Participants
| Criterion | Why It Matters |
|---|---|
| Match your target user profile | Testing with the wrong people gives misleading results |
| Mix of tech comfort levels | Your users aren't all power users |
| Not employees or close friends | They know too much about the product |
| Screener survey | Filter for relevant experience and demographics |
| Incentive | $50-100 per session (or equivalent gift card) |
Screener question examples:
- "How often do you shop online?" (Filter for frequency)
- "Which of these tools have you used in the past 6 months?" (Filter for experience)
- "What is your primary role at work?" (Filter for job relevance)
Writing Good Tasks
Tasks are the heart of usability testing. Bad tasks produce useless data. Good tasks reveal real usability problems.
Task Structure
Every task should include:
- Context/scenario: Why the user is doing this
- Goal: What they need to accomplish
- No instructions: Don't tell them how to do it
Good vs. Bad Tasks
| Bad Task | Why It's Bad | Good Task |
|---|---|---|
| "Click the Search button and type 'running shoes'" | Tells them exactly what to do, tests nothing | "You need new running shoes for a marathon. Find a pair you'd buy." |
| "Navigate to Settings > Account > Security" | Gives the answer away | "You want to change your password. Go ahead and do that." |
| "Test the checkout flow" | Too vague, no scenario | "You've found a gift for a friend. Complete the purchase using any payment method." |
| "Find the FAQ page" | Tests navigation, not whether FAQ solves their problem | "Your order hasn't arrived. Find out what to do." |
| "Do you like the new design?" | Opinion question, not a task | "You want to track your monthly spending. Walk me through how you'd do that." |
Task Difficulty Spectrum
Include a mix of task difficulties:
| Difficulty | Purpose | Example |
|---|---|---|
| Easy (warm-up) | Build confidence, calibrate | "Find the pricing page" |
| Medium | Core functionality testing | "Add an item to your cart and apply a promo code" |
| Hard | Edge cases, complex flows | "Return a gift item purchased by someone else" |
| Exploratory | Discovery, open-ended | "Explore the dashboard and tell me what you can learn about your account" |
How Many Tasks?
| Session Length | Number of Tasks | Notes |
|---|---|---|
| 15 min (guerrilla/unmoderated) | 3-5 tasks | Quick, focused |
| 30 min (remote) | 5-7 tasks | Standard session |
| 45-60 min (moderated) | 6-10 tasks | Includes follow-up questions |
Facilitating a Session
Before the Session
- Test your recording software
- Prepare a printed/digital test script
- Have the prototype/product ready
- Remove any personal data from test accounts
- Silence your phone
During the Session
| Do | Don't |
|---|---|
| Read the task verbatim | Paraphrase and accidentally add hints |
| Stay silent while they work | Fill silences with hints or explanations |
| Ask "What are you thinking?" when they pause | Ask "Why did you click that?" (feels judgmental) |
| Say "There are no wrong answers" | Say "That's not right" or "Try the other button" |
| Note what they do, not just what they say | Only record verbal feedback |
| Ask follow-up after they finish the task | Interrupt them mid-task |
| End on time even if tasks remain | Run over and exhaust the participant |
The Think-Aloud Protocol
Ask participants to narrate their thoughts as they work:
"Please think out loud as you go through this. Tell me what you're looking for, what you expect to happen, and what you're thinking as you make decisions."
If they go silent:
- "What are you thinking right now?"
- "What are you looking for?"
- "What do you expect to happen next?"
Don't: Ask "Why did you do that?" during the task. It feels like being tested. Save "why" questions for after the task is complete.
Metrics to Track
Quantitative Metrics
| Metric | What It Measures | How to Capture | Benchmark |
|---|---|---|---|
| Task success rate | Can users complete the task? | Pass/fail per task per user | > 78% (industry average) |
| Time on task | How long does it take? | Timer per task | Compare to your target/baseline |
| Error rate | How often do users make mistakes? | Count wrong clicks, backtracking | Lower is better, compare over time |
| Lostness score | How much do users wander? | (Unique pages visited - optimal pages) / total pages | 0 = optimal, higher = more lost |
| Misclick rate | How often do users click the wrong thing? | Count clicks on non-target elements | Compare between designs |
Qualitative Metrics
| Metric | What It Measures | How to Capture |
|---|---|---|
| Frustration indicators | Points of struggle | Sighs, swearing, face touching, long pauses |
| Confusion points | Where users don't understand | "I'm not sure what this means", scanning behavior |
| Delight moments | What works well | "Oh, that's nice!", smiles, quick task completion |
| Mental model gaps | Mismatch between expectation and reality | "I expected this to be under Settings" |
| Workarounds | When users find unofficial solutions | "I usually just Google the answer instead" |
Post-Task Questionnaires
Capture perception after each task:
Single Ease Question (SEQ):
"Overall, how easy or difficult was this task?" 1 (Very Difficult) to 7 (Very Easy) Average: 5.5. Below 5 indicates a problem.
After all tasks, System Usability Scale (SUS):
10 alternating positive/negative statements, rated 1-5:
- I think I would like to use this system frequently
- I found the system unnecessarily complex
- I thought the system was easy to use
- I think I would need tech support to use this
- I found the functions well integrated
- I thought there was too much inconsistency
- I imagine most people would learn this quickly
- I found the system cumbersome to use
- I felt confident using the system
- I needed to learn a lot before using this
SUS Scoring:
- Odd questions: score - 1
- Even questions: 5 - score
- Sum all scores × 2.5
- Result: 0-100 scale
| SUS Score | Interpretation | Grade |
|---|---|---|
| 0-50 | Poor usability | F |
| 51-67 | Below average | D |
| 68 | Average (industry benchmark) | C |
| 69-80 | Good | B |
| 80-90 | Excellent | A |
| 90-100 | Best imaginable | A+ |
Analyzing Results
Step 1: Compile Findings
For each task, document:
- Success/failure for each participant
- Time taken
- Errors observed
- Quotes and reactions
- Observations about behavior
Step 2: Categorize Severity
| Severity | Definition | Examples | Action |
|---|---|---|---|
| Critical | Prevents task completion | Can't find checkout button, form doesn't submit | Fix before launch |
| Major | Causes significant difficulty or frustration | Confusing error message, unclear navigation | Fix in next sprint |
| Minor | Causes slight hesitation but users recover | Unexpected label, small layout issue | Fix when convenient |
| Cosmetic | Noticed but doesn't affect task success | Color seems off, spacing feels uneven | Backlog |
Step 3: Prioritize
Plot issues on an impact/frequency matrix:
High Impact
│
FIX NOW │ FIX SOON
(Critical)│ (Major)
│
──────────────┼──────────────
│
BACKLOG │ MONITOR
(Rare but │ (Common but
severe) │ minor)
│
Low Impact
Low Frequency High Frequency
Step 4: Create Recommendations
For each issue, document:
ISSUE: Users can't find the "Apply Coupon" field during checkout
SEVERITY: Major
OBSERVED: 4 of 5 participants missed it
EVIDENCE: "I have a coupon code but I don't see where to enter it"
Participants scrolled past it, looked in cart summary
RECOMMENDATION: Move coupon field above the order summary,
add a visible "Have a promo code?" link
BEFORE: [screenshot]
AFTER: [mockup]
Step 5: Share Results
| Format | Audience | Content |
|---|---|---|
| Highlight reel (2-5 min video) | Everyone | The most impactful moments clipped together |
| Executive summary (1 page) | Leadership | Top 3-5 findings, severity, business impact |
| Detailed report | Product/design team | All findings, severity ratings, recommendations |
| Live debrief (30 min meeting) | Cross-functional team | Walk through findings, discuss priorities |
Running Unmoderated Tests
Tools
| Tool | Strengths | Price Range |
|---|---|---|
| Maze | Prototype testing, heatmaps, metrics | Free-$99/mo |
| UserTesting | Large participant pool, video recordings | $$$$ |
| Lookback | Live and unmoderated, screen + webcam | $$-$$$ |
| Hotjar | In-context feedback, heatmaps, recordings | Free-$99/mo |
| UsabilityHub/Lyssna | Quick preference tests, 5-second tests | $-$$ |
Unmoderated Test Structure
1. WELCOME SCREEN
"Thanks for participating! This test takes about 10 minutes.
We're testing a design, not you. There are no wrong answers."
2. SCREENER QUESTIONS (2-3)
Filter out non-qualifying participants
3. TASKS (3-5)
Task description → Participant completes task → Post-task question (SEQ)
4. POST-TEST QUESTIONS (3-5)
SUS or overall impressions
5. THANK YOU
"Thanks! Your $X gift card will arrive within 24 hours."
Specialized Test Types
First-Click Testing
Show a design and ask: "Where would you click first to [accomplish goal]?" If the first click is correct, users succeed 87% of the time. If the first click is wrong, success drops to 46%.
5-Second Test
Show a design for 5 seconds, then ask:
- "What is this page about?"
- "What do you remember?"
- "What would you do on this page?"
Tests first impressions and visual hierarchy.
A/B Testing
| Factor | Guideline |
|---|---|
| Sample size | Minimum 1000+ per variant for statistical significance |
| Duration | Run for at least 1-2 full business cycles (2-4 weeks) |
| Variables | Change only ONE thing at a time |
| Statistical significance | Wait for 95% confidence before declaring a winner |
| Metric | Define your primary metric BEFORE the test starts |
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Testing with colleagues | They know too much, they'll always succeed | Recruit external participants matching your user profile |
| Leading participants | They follow your hints instead of their instincts | Read tasks verbatim, stay silent while they work |
| Only testing the happy path | You miss edge cases and error scenarios | Include tasks that may trigger errors or dead ends |
| Too many tasks per session | Participant fatigue, rushing on later tasks | 5-7 tasks in 30 min, 6-10 in 60 min |
| Not recording sessions | Rely on memory, miss details, can't share clips | Always record (with consent). Video is more persuasive than notes. |
| Testing too late | Design is finished, findings can't be acted on | Test early with wireframes or prototypes |
| Only reporting problems | Team doesn't know what's working well | Include positive findings, e.g., "5/5 users completed this easily" |
| Not iterating | Fix issues but never verify the fix works | Test again after making changes |
Key Takeaways
- Test with 5 users, fix, test again. Two small rounds beats one large study.
- Write scenario-based tasks that give context and goals, not step-by-step instructions.
- Stay silent during tasks. The urge to help is strong. Resist it.
- Watch what users do, not what they say. Behavior reveals truth.
- Categorize findings by severity (critical/major/minor/cosmetic) and prioritize accordingly.
- Share findings with video clips. A 30-second clip of a user struggling is worth more than a 20-page report.
- SUS score of 68 is average. Below that, you have usability problems to fix.
- Test early with wireframes. Don't wait for a polished product.