Regular Languages and Expressions

Introduction

Regular expressions provide an algebraic and declarative way to describe regular languages, complementing the operational view provided by finite automata. They are ubiquitous in computer science, appearing in text editors, programming languages, command-line tools, and formal verification systems. This reading establishes the mathematical foundation of regular expressions and explores the theoretical limits of regular languages.

Learning Objectives

By the end of this reading, you will be able to:

  • Write and interpret regular expressions using formal syntax
  • Convert regular expressions to NFAs and vice versa
  • Prove equivalence of regular expressions and finite automata
  • Apply the pumping lemma to prove languages are non-regular
  • Understand closure properties of regular languages
  • Solve practical problems using regular expressions
  • Recognize the limitations of regular languages

Regular Expressions: Formal Definition

Syntax and Semantics

Definition: The regular expressions over an alphabet Σ are defined inductively:

Base cases:

  1. ∅ is a regular expression (denotes the empty language)
  2. ε is a regular expression (denotes the language {ε})
  3. For each a ∈ Σ, a is a regular expression (denotes the language {a})

Inductive cases: If R₁ and R₂ are regular expressions, then: 4. (R₁ ∪ R₂) is a regular expression (union) 5. (R₁ ∘ R₂) or (R₁R₂) is a regular expression (concatenation) 6. (R₁)* is a regular expression (Kleene star)

Language denoted: Each regular expression R denotes a language L(R):

L(∅) = ∅
L(ε) = {ε}
L(a) = {a} for a ∈ Σ
L(R₁ ∪ R₂) = L(R₁) ∪ L(R₂)
L(R₁R₂) = {xy : x ∈ L(R₁), y ∈ L(R₂)}
L(R*) = {x₁x₂...xₖ : k ≥ 0, each xᵢ ∈ L(R)}

Note: We often write R₁ + R₂ instead of R₁ ∪ R₂, and omit the concatenation operator: R₁R₂.

Examples

  1. L(ab): All strings of zero or more a's followed by zero or more b's

    • Matches: ε, a, b, aa, ab, aab, bbb, aaabbb
    • Doesn't match: ba, aba
  2. L((a ∪ b)*): All strings over {a, b}

    • This is Σ* where Σ = {a, b}
  3. L((a ∪ b)*abb): All strings ending in "abb"

    • Matches: abb, aabb, babb, aaabb, ababb
  4. L(abababa): All strings with exactly three b's

    • Matches: bbb, abbb, babab, aaabaaabab
  5. L((ε ∪ a)(bb)*): Strings that are either empty or start with a, followed by an even number of b's

    • Matches: ε, a, abb, abbbb, bbbbb

Operator Precedence

To reduce parentheses, we use the following precedence (highest to lowest):

  1. Kleene star (*)
  2. Concatenation (∘)
  3. Union (∪ or +)

Example: a ∪ bc* means a ∪ (b(c*)), not ((a ∪ b)c)* or (a ∪ (bc))*

Extended Regular Expression Syntax

Many practical implementations include additional operators (syntactic sugar):

  • R+: One or more repetitions (R+ = RR*)
  • R?: Zero or one occurrence (R? = ε ∪ R)
  • R{n}: Exactly n repetitions
  • R{n,m}: Between n and m repetitions
  • [abc]: Character class (a ∪ b ∪ c)
  • [a-z]: Character range
  • [^abc]: Negation (anything except a, b, or c)
  • .: Any character
  • ^: Start of line
  • $: End of line

These can all be expressed in terms of the basic operators.

Equivalence of Regular Expressions and Finite Automata

Fundamental Theorem

Theorem (Kleene's Theorem): A language is regular (accepted by a DFA/NFA) if and only if it can be described by a regular expression.

We prove both directions:

  1. RE → NFA: Every regular expression can be converted to an NFA
  2. NFA → RE: Every NFA can be converted to a regular expression

Part 1: Regular Expression to NFA

We construct NFAs inductively following the structure of regular expressions.

Base cases:

  1. ∅:
  ┌────┐
→ │ q₀ │
  └────┘
  (no accept state, no transitions)
  1. ε:
  ┌────┐
→ │(q₀)│
  └────┘
  (start state is accept state)
  1. a ∈ Σ:
  ┌────┐  a  ┌────┐
→ │ q₀ │────►│(q₁)│
  └────┘     └────┘

Inductive cases:

  1. Union (R₁ ∪ R₂):
                 ε
         ┌──────────────┐
         │              ▼
      ┌────┐       ┌────────┐
      │    │  ε    │  NFA   │  ε  ┌────┐
   →  │ q₀ │──────►│  for   │────►│(qf)│
      │    │       │   R₁   │     └────┘
      └────┘       └────────┘        ▲
         │                           │
         │         ┌────────┐        │
         │    ε    │  NFA   │   ε    │
         └────────►│  for   │────────┘
                   │   R₂   │
                   └────────┘
  1. Concatenation (R₁R₂):
                      ε
  ┌────┐       ┌────────┐       ┌────────┐       ┌────┐
→ │ q₀ │──────►│  NFA   │──────►│  NFA   │──────►│(qf)│
  └────┘   ε   │  for   │   ε   │  for   │   ε   └────┘
               │   R₁   │       │   R₂   │
               └────────┘       └────────┘
  1. Kleene Star (R₁)*:
                ┌──────────ε─────────┐
                │                    │
                │                    ▼
      ┌────┐    │  ε    ┌────────┐  │   ┌────┐
   →  │ q₀ │────┼──────►│  NFA   │──┼──►│(qf)│
      └────┘    │       │  for   │  │   └────┘
                │       │   R₁   │  │      ▲
                │       └────────┘  │      │
                │            │      │      │
                └────────────┘      └──────┘
                       ε                ε

Example: Convert (a ∪ b)*abb to NFA

  1. Start with a ∪ b
  2. Apply Kleene star
  3. Concatenate with a, then b, then b

The resulting NFA will have ε-transitions and multiple paths.

Part 2: NFA to Regular Expression

We use the state elimination method (also called Kleene's algorithm).

Algorithm:

  1. Convert NFA to a generalized NFA (GNFA) where transitions are labeled with regular expressions
  2. Add new start state with ε-transition to old start
  3. Add new accept state with ε-transitions from all old accept states
  4. Eliminate states one by one until only start and accept remain
  5. The remaining transition label is the regular expression

State elimination rule: When eliminating state q_rip:

  • For each pair of states q_i, q_j (where q_i ≠ q_rip ≠ q_j):
  • Replace transition from q_i to q_j with:
    R_ij = (original q_i → q_j) ∪ (q_i → q_rip)(q_rip → q_rip)*(q_rip → q_j)
    

Example: Convert this NFA to a regular expression:

States: {q₁, q₂}, Start: q₁, Accept: {q₂}
δ(q₁, a) = {q₁, q₂}
δ(q₁, b) = {q₁}
δ(q₂, b) = {q₂}

State diagram:

        a,b       a         b
    ┌────────┐       ┌─────────┐
    │        ▼       │         │
  ┌────┐       ┌────┐          │
→ │ q₁ │──────►│(q₂)│──────────┘
  └────┘   a   └────┘

Step 1: Add new start q₀ and accept q₃:

  ┌────┐  ε  ┌────┐  a,b  ┌────┐  a  ┌────┐  ε  ┌────┐
→ │ q₀ │────►│ q₁ │◄──────┤ q₁ │────►│ q₂ │────►│ q₃ │
  └────┘     └────┘       └────┘     └────┘     └────┘
                                        │ b
                                        └──┘

Step 2: Eliminate q₁ (the self-loop):

  • R(q₁ → q₁) = (a ∪ b)
  • R(q₀ → q₂) = ε(a ∪ b)*a

Step 3: Eliminate q₂:

  • R(q₂ → q₂) = b
  • R(q₀ → q₃) = ε(a ∪ b)a·b·ε = (a ∪ b)ab

Result: Regular expression is (a ∪ b)ab

Complexity Analysis

  • RE → NFA: Linear in the size of the regular expression
  • NFA → RE: Can produce exponentially large regular expressions
  • The state elimination order affects expression size

The Pumping Lemma for Regular Languages

Motivation

Not all languages are regular. How do we prove a language is not regular?

Examples of non-regular languages:

  • L₁ = {0ⁿ1ⁿ : n ≥ 0} (equal numbers of 0s and 1s)
  • L₂ = {ww : w ∈ {0,1}*} (strings repeated twice)
  • L₃ = {0^(n²) : n ≥ 0} (strings of length perfect squares)

The pumping lemma provides a necessary condition for regularity.

The Pumping Lemma

Theorem (Pumping Lemma for Regular Languages):

If L is a regular language, then there exists a positive integer p (the "pumping length") such that every string s ∈ L with |s| ≥ p can be divided into three parts, s = xyz, satisfying:

  1. |xy| ≤ p (the first two parts are not too long)
  2. |y| > 0 (the middle part is non-empty)
  3. For all i ≥ 0, xy^i z ∈ L (we can "pump" the middle part)

Intuition: If L is regular, it's accepted by a DFA with finitely many states. Any sufficiently long string must cause the DFA to revisit some state, creating a "loop" that can be repeated.

Proof of the Pumping Lemma

Proof: Let L be regular and D be a DFA with p states accepting L. Consider any string s ∈ L with |s| ≥ p.

As D processes s = s₁s₂...sₙ (where n ≥ p), it visits n+1 states: q₀, q₁, ..., qₙ.

By the pigeonhole principle, among the first p+1 states (q₀, ..., qₚ), some state must repeat. Say q_i = q_j where i < j ≤ p.

Define:

  • x = s₁...s_i (portion before the loop)
  • y = s_(i+1)...s_j (the loop)
  • z = s_(j+1)...sₙ (portion after the loop)

Then:

  1. |xy| = j ≤ p ✓
  2. |y| = j - i > 0 ✓
  3. We can repeat the loop any number of times: xy^i z ∈ L for all i ≥ 0 ✓

Using the Pumping Lemma

To prove a language L is not regular using the pumping lemma:

  1. Assume L is regular (for contradiction)
  2. Let p be the pumping length
  3. Choose a specific string s ∈ L with |s| ≥ p (this choice is crucial!)
  4. Consider all possible divisions s = xyz satisfying conditions 1 and 2
  5. For each division, find an i such that xy^i z ∉ L
  6. This contradicts the pumping lemma, so L is not regular

Example 1: L = {0ⁿ1ⁿ : n ≥ 0} is not regular

Proof: Assume L is regular with pumping length p.

Choose: s = 0^p 1^p ∈ L (clearly |s| = 2p ≥ p)

Consider any division s = xyz where |xy| ≤ p and |y| > 0:

  • Since |xy| ≤ p, both x and y consist only of 0s
  • So y = 0^k for some k > 0

Pump: Consider i = 2:

  • xy²z = x·y·y·z = 0^(p-k)·0^k·0^k·1^p = 0^(p+k)1^p
  • This has p+k zeros and p ones (where k > 0)
  • Therefore xy²z ∉ L

This contradicts the pumping lemma, so L is not regular. ∎

Example 2: L = {ww : w ∈ {0,1}*} is not regular

Proof: Assume L is regular with pumping length p.

Choose: s = 0^p 1^p 0^p 1^p ∈ L (we have w = 0^p 1^p)

Consider any division s = xyz where |xy| ≤ p and |y| > 0:

  • Since |xy| ≤ p, both x and y are in the first block of 0s
  • So y = 0^k for some 0 < k ≤ p

Pump: Consider i = 2:

  • xy²z = 0^(p+k) 1^p 0^p 1^p
  • For this to be in L, it must equal ww for some w
  • If w has length (2p+k)/2, then w is not an integer (contradiction)
  • If we split differently, the two halves don't match

Therefore xy²z ∉ L, contradicting the pumping lemma. ∎

Example 3: L = {0^(n²) : n ≥ 0} is not regular

Proof: Assume L is regular with pumping length p.

Choose: s = 0^(p²) ∈ L

Consider any division s = xyz where |xy| ≤ p and |y| > 0:

  • We have y = 0^k for some 0 < k ≤ p

Pump: Consider i = 2:

  • xy²z = 0^(p²+k)
  • For this to be in L, we need p² + k = m² for some integer m
  • We have p² < p² + k ≤ p² + p < p² + 2p + 1 = (p+1)²
  • So p² < p² + k < (p+1)², meaning p² + k is strictly between consecutive perfect squares
  • Therefore xy²z ∉ L

This contradicts the pumping lemma. ∎

Common Mistakes

  1. Don't choose the division: The adversary chooses xyz (you must show it fails for ALL divisions)
  2. Don't choose p: The pumping length is given (you work with arbitrary p)
  3. Be strategic in choosing s: Pick a string where any valid division leads to contradiction
  4. Remember i can be 0: Sometimes pumping down (i=0) gives the contradiction

Closure Properties of Regular Languages

Regular languages form a Boolean algebra and have strong closure properties.

Proven Closure Properties

Theorem: The class of regular languages is closed under:

  1. Union: L₁ ∪ L₂
  2. Intersection: L₁ ∩ L₂
  3. Complement: L̄ = Σ* \ L
  4. Concatenation: L₁L₂
  5. Kleene star: L*
  6. Reversal: L^R = {w^R : w ∈ L}
  7. Difference: L₁ \ L₂ = L₁ ∩ L̄₂
  8. Homomorphism: h(L) for homomorphism h
  9. Inverse homomorphism: h⁻¹(L)

Proof Techniques

Union, Concatenation, Star: Use regular expression operations or NFA constructions.

Intersection (Product Construction): Given DFAs D₁ = (Q₁, Σ, δ₁, q₁, F₁) and D₂ = (Q₂, Σ, δ₂, q₂, F₂):

Construct D = (Q₁ × Q₂, Σ, δ, (q₁, q₂), F₁ × F₂) where:

  • δ((p, q), a) = (δ₁(p, a), δ₂(q, a))

Complement: Given DFA D = (Q, Σ, δ, q₀, F): Construct D̄ = (Q, Σ, δ, q₀, Q \ F) (swap accept and non-accept states)

Reversal: Given NFA N = (Q, Σ, δ, q₀, F): Construct N^R = (Q ∪ {q_new}, Σ, δ^R, q_new, {q₀}) where:

  • Add ε-transitions from q_new to all states in F
  • Reverse all transitions: a ∈ δ^R(q, a') iff q ∈ δ(a, a')
  • Old start becomes sole accept state

Homomorphism: A homomorphism h: Σ* → Γ* is defined by h(a) for each a ∈ Σ and extended to strings:

  • h(ε) = ε
  • h(wa) = h(w)h(a)

If L is regular with DFA D, construct DFA for h(L) by replacing each transition labeled a with h(a).

Applications of Closure Properties

Example: Prove L = {w ∈ {0,1}* : w has equal numbers of 01 and 10 substrings} is regular.

Solution:

  • Observation: A string has equal 01s and 10s iff it's of the form 01 or 10
  • L = 01 ∪ 10, both are regular
  • By closure under union, L is regular

Example: Is L = {0^i 1^j : i ≠ j} regular?

Solution:

  • We know L₁ = {0^n 1^n : n ≥ 0} is not regular
  • L₂ = {0^i 1^j : i, j ≥ 0} is regular (01)
  • Notice that L = L₂ \ L₁
  • If L were regular, then L̄ = L̄₂ ∪ L₁ would be regular (closure under complement and union)
  • This would imply L₁ = (L̄₂ ∪ L₁) ∩ L₂ is regular (closure under intersection)
  • Contradiction! So L is not regular.

Decision Problems for Regular Languages

Decidable Problems

For regular languages represented as DFAs/NFAs/REs, the following are decidable:

  1. Membership: Given L and string w, is w ∈ L?

    • Algorithm: Simulate DFA on w (O(n) time)
  2. Emptiness: Is L = ∅?

    • Algorithm: Check if any accept state is reachable from start (DFS/BFS)
  3. Finiteness: Is |L| < ∞?

    • Algorithm: Check if there's a cycle on any path from start to accept state
  4. Equivalence: Given L₁ and L₂, is L₁ = L₂?

    • Algorithm: Check if L₁ ⊕ L₂ = (L₁ \ L₂) ∪ (L₂ \ L₁) is empty
  5. Inclusion: Is L₁ ⊆ L₂?

    • Algorithm: Check if L₁ \ L₂ is empty
  6. Intersection: Is L₁ ∩ L₂ = ∅?

    • Algorithm: Construct product DFA and check emptiness

All these can be solved in polynomial time for DFAs.

Complexity

  • Membership: O(n) for DFA, O(n·|Q|²) for NFA
  • Emptiness: O(|Q| + |δ|) using graph search
  • Equivalence: O(n log n) using minimization
  • DFA minimization: O(n log n) using Hopcroft's algorithm

Practical Regular Expression Patterns

Common Patterns

  1. Email validation (simplified):

    [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
    
  2. URL matching:

    https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/.*)?
    
  3. IPv4 address:

    ([0-9]{1,3}\.){3}[0-9]{1,3}
    
  4. Phone number (US):

    \d{3}-\d{3}-\d{4}
    
  5. Date (YYYY-MM-DD):

    \d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])
    

Limitations of Practical Regex Engines

Modern regex engines (Perl, Python, Java) include features beyond regular languages:

  • Backreferences: \1, \2 (make language context-free or beyond)
  • Lookahead/Lookbehind: (?=...), (?<=...)
  • Recursion: (?R) or (?n)

These make the language not regular in the theoretical sense!

Example: ^(a*)b\1$ matches {a^n b a^n : n ≥ 0}, which is not regular.

Exercises

Basic Exercises

  1. Regular Expression Practice: Write regular expressions for:

    • a) Binary strings with at least three 1s
    • b) Strings over {a,b,c} containing "abc" as a substring
    • c) Binary numbers divisible by 4 (as binary strings)
  2. Language Description: Describe the language of each regular expression:

    • a) (0 ∪ 1)*0(0 ∪ 1)(0 ∪ 1)
    • b) ((ε ∪ 0)(1*))*
    • c) (01 ∪ 10)*
  3. Conversion: Convert the regular expression (a ∪ b)*a(a ∪ b) to an NFA using the construction from this reading.

Intermediate Exercises

  1. State Elimination: Convert this DFA to a regular expression:
States: {q₀, q₁, q₂}, Start: q₀, Accept: {q₂}
δ(q₀, a) = q₁,  δ(q₀, b) = q₀
δ(q₁, a) = q₂,  δ(q₁, b) = q₁
δ(q₂, a) = q₂,  δ(q₂, b) = q₂
  1. Pumping Lemma: Use the pumping lemma to prove these languages are not regular:

    • a) L = {0^n 1^m 0^n : n, m ≥ 0}
    • b) L = {w ∈ {0,1}* : w has equal numbers of 0s and 1s}
    • c) L = {0^p : p is prime}
  2. Closure Properties:

    • a) Prove that if L is regular, then HALF(L) = {x : ∃y, |x| = |y| and xy ∈ L} may not be regular
    • b) Is the set of palindromes over {a,b} regular? Prove your answer.

Advanced Exercises

  1. Minimality: Prove that the DFA accepting (a ∪ b)*a(a ∪ b)^(n-1) requires at least 2^n states.

  2. Pumping Lemma Variant: Prove that if L is infinite and regular, then there exist strings x, y, z with |y| > 0 such that {xy^i z : i ≥ 0} ⊆ L.

  3. Myhill-Nerode Application: Use the Myhill-Nerode theorem to prove L = {a^n b^n : n ≥ 0} is not regular by showing infinitely many equivalence classes.

  4. Regular Expression Inequivalence: Find a language L such that there exist two regular expressions R₁ and R₂ with L(R₁) = L(R₂) but R₁ and R₂ are syntactically different and neither can be obtained from the other by algebraic laws.

  5. State Complexity: Given DFAs D₁ with n states and D₂ with m states:

    • a) What's the maximum number of states needed for D₁ ∪ D₂?
    • b) What's the maximum number of states needed for D₁ ∩ D₂?
    • c) What's the maximum number of states needed for D₁*?
  6. Decidability: Design an algorithm to determine if a given DFA accepts an infinite language. Prove your algorithm is correct and analyze its complexity.

Summary

In this reading, we explored regular languages and expressions:

  1. Regular Expressions provide an algebraic notation for regular languages with three basic operations: union, concatenation, and Kleene star

  2. Kleene's Theorem establishes the equivalence of regular expressions, DFAs, and NFAs - they all recognize exactly the regular languages

  3. Conversion Algorithms:

    • RE → NFA: Structural induction (Thompson's construction)
    • NFA → RE: State elimination (Kleene's algorithm)
  4. The Pumping Lemma provides a necessary condition for regularity and is the primary tool for proving languages are non-regular

  5. Closure Properties: Regular languages are closed under union, intersection, complement, concatenation, star, reversal, and homomorphisms

  6. Decision Problems: Membership, emptiness, equivalence, and other properties are decidable for regular languages

  7. Limitations: Many natural languages (balanced parentheses, equal 0s and 1s, palindromes) are not regular

Key Insights:

  • Regular expressions and automata are dual views of the same concept
  • The pumping lemma exploits the finite memory of DFAs
  • Closure properties allow complex proofs via language operations
  • Practical regex engines often exceed theoretical regular languages

Looking Ahead: Regular languages can't count unboundedly or match nested structures. We need more powerful models: context-free grammars and pushdown automata.

Next Steps

In the next reading, we'll explore Context-Free Grammars, which extend regular languages by adding:

  • Recursive production rules
  • Stack-based memory (pushdown automata)
  • Ability to recognize nested structures like {0^n 1^n}
  • Parse trees and ambiguity
  • Applications to programming language syntax

← Previous: Finite Automata | Continue to Context-Free Grammars →

References and Further Reading

  1. Sipser, Michael. "Introduction to the Theory of Computation" (Chapter 1)
  2. Hopcroft, John E., et al. "Introduction to Automata Theory" (Chapters 3-4)
  3. Friedl, Jeffrey E. F. "Mastering Regular Expressions" (O'Reilly)
  4. Thompson, Ken. "Regular Expression Search Algorithm" (1968)
  5. Brzozowski, Janusz. "Derivatives of Regular Expressions" (1964)