Regular Languages and Expressions
Introduction
Regular expressions provide an algebraic and declarative way to describe regular languages, complementing the operational view provided by finite automata. They are ubiquitous in computer science, appearing in text editors, programming languages, command-line tools, and formal verification systems. This reading establishes the mathematical foundation of regular expressions and explores the theoretical limits of regular languages.
Learning Objectives
By the end of this reading, you will be able to:
- Write and interpret regular expressions using formal syntax
- Convert regular expressions to NFAs and vice versa
- Prove equivalence of regular expressions and finite automata
- Apply the pumping lemma to prove languages are non-regular
- Understand closure properties of regular languages
- Solve practical problems using regular expressions
- Recognize the limitations of regular languages
Regular Expressions: Formal Definition
Syntax and Semantics
Definition: The regular expressions over an alphabet Σ are defined inductively:
Base cases:
- ∅ is a regular expression (denotes the empty language)
- ε is a regular expression (denotes the language {ε})
- For each a ∈ Σ, a is a regular expression (denotes the language {a})
Inductive cases: If R₁ and R₂ are regular expressions, then: 4. (R₁ ∪ R₂) is a regular expression (union) 5. (R₁ ∘ R₂) or (R₁R₂) is a regular expression (concatenation) 6. (R₁)* is a regular expression (Kleene star)
Language denoted: Each regular expression R denotes a language L(R):
L(∅) = ∅
L(ε) = {ε}
L(a) = {a} for a ∈ Σ
L(R₁ ∪ R₂) = L(R₁) ∪ L(R₂)
L(R₁R₂) = {xy : x ∈ L(R₁), y ∈ L(R₂)}
L(R*) = {x₁x₂...xₖ : k ≥ 0, each xᵢ ∈ L(R)}
Note: We often write R₁ + R₂ instead of R₁ ∪ R₂, and omit the concatenation operator: R₁R₂.
Examples
L(ab): All strings of zero or more a's followed by zero or more b's
- Matches: ε, a, b, aa, ab, aab, bbb, aaabbb
- Doesn't match: ba, aba
L((a ∪ b)*): All strings over {a, b}
- This is Σ* where Σ = {a, b}
L((a ∪ b)*abb): All strings ending in "abb"
- Matches: abb, aabb, babb, aaabb, ababb
L(abababa): All strings with exactly three b's
- Matches: bbb, abbb, babab, aaabaaabab
L((ε ∪ a)(bb)*): Strings that are either empty or start with a, followed by an even number of b's
- Matches: ε, a, abb, abbbb, bbbbb
Operator Precedence
To reduce parentheses, we use the following precedence (highest to lowest):
- Kleene star (*)
- Concatenation (∘)
- Union (∪ or +)
Example: a ∪ bc* means a ∪ (b(c*)), not ((a ∪ b)c)* or (a ∪ (bc))*
Extended Regular Expression Syntax
Many practical implementations include additional operators (syntactic sugar):
- R+: One or more repetitions (R+ = RR*)
- R?: Zero or one occurrence (R? = ε ∪ R)
- R{n}: Exactly n repetitions
- R{n,m}: Between n and m repetitions
- [abc]: Character class (a ∪ b ∪ c)
- [a-z]: Character range
- [^abc]: Negation (anything except a, b, or c)
- .: Any character
- ^: Start of line
- $: End of line
These can all be expressed in terms of the basic operators.
Equivalence of Regular Expressions and Finite Automata
Fundamental Theorem
Theorem (Kleene's Theorem): A language is regular (accepted by a DFA/NFA) if and only if it can be described by a regular expression.
We prove both directions:
- RE → NFA: Every regular expression can be converted to an NFA
- NFA → RE: Every NFA can be converted to a regular expression
Part 1: Regular Expression to NFA
We construct NFAs inductively following the structure of regular expressions.
Base cases:
- ∅:
┌────┐
→ │ q₀ │
└────┘
(no accept state, no transitions)
- ε:
┌────┐
→ │(q₀)│
└────┘
(start state is accept state)
- a ∈ Σ:
┌────┐ a ┌────┐
→ │ q₀ │────►│(q₁)│
└────┘ └────┘
Inductive cases:
- Union (R₁ ∪ R₂):
ε
┌──────────────┐
│ ▼
┌────┐ ┌────────┐
│ │ ε │ NFA │ ε ┌────┐
→ │ q₀ │──────►│ for │────►│(qf)│
│ │ │ R₁ │ └────┘
└────┘ └────────┘ ▲
│ │
│ ┌────────┐ │
│ ε │ NFA │ ε │
└────────►│ for │────────┘
│ R₂ │
└────────┘
- Concatenation (R₁R₂):
ε
┌────┐ ┌────────┐ ┌────────┐ ┌────┐
→ │ q₀ │──────►│ NFA │──────►│ NFA │──────►│(qf)│
└────┘ ε │ for │ ε │ for │ ε └────┘
│ R₁ │ │ R₂ │
└────────┘ └────────┘
- Kleene Star (R₁)*:
┌──────────ε─────────┐
│ │
│ ▼
┌────┐ │ ε ┌────────┐ │ ┌────┐
→ │ q₀ │────┼──────►│ NFA │──┼──►│(qf)│
└────┘ │ │ for │ │ └────┘
│ │ R₁ │ │ ▲
│ └────────┘ │ │
│ │ │ │
└────────────┘ └──────┘
ε ε
Example: Convert (a ∪ b)*abb to NFA
- Start with a ∪ b
- Apply Kleene star
- Concatenate with a, then b, then b
The resulting NFA will have ε-transitions and multiple paths.
Part 2: NFA to Regular Expression
We use the state elimination method (also called Kleene's algorithm).
Algorithm:
- Convert NFA to a generalized NFA (GNFA) where transitions are labeled with regular expressions
- Add new start state with ε-transition to old start
- Add new accept state with ε-transitions from all old accept states
- Eliminate states one by one until only start and accept remain
- The remaining transition label is the regular expression
State elimination rule: When eliminating state q_rip:
- For each pair of states q_i, q_j (where q_i ≠ q_rip ≠ q_j):
- Replace transition from q_i to q_j with:
R_ij = (original q_i → q_j) ∪ (q_i → q_rip)(q_rip → q_rip)*(q_rip → q_j)
Example: Convert this NFA to a regular expression:
States: {q₁, q₂}, Start: q₁, Accept: {q₂}
δ(q₁, a) = {q₁, q₂}
δ(q₁, b) = {q₁}
δ(q₂, b) = {q₂}
State diagram:
a,b a b
┌────────┐ ┌─────────┐
│ ▼ │ │
┌────┐ ┌────┐ │
→ │ q₁ │──────►│(q₂)│──────────┘
└────┘ a └────┘
Step 1: Add new start q₀ and accept q₃:
┌────┐ ε ┌────┐ a,b ┌────┐ a ┌────┐ ε ┌────┐
→ │ q₀ │────►│ q₁ │◄──────┤ q₁ │────►│ q₂ │────►│ q₃ │
└────┘ └────┘ └────┘ └────┘ └────┘
│ b
└──┘
Step 2: Eliminate q₁ (the self-loop):
- R(q₁ → q₁) = (a ∪ b)
- R(q₀ → q₂) = ε(a ∪ b)*a
Step 3: Eliminate q₂:
- R(q₂ → q₂) = b
- R(q₀ → q₃) = ε(a ∪ b)a·b·ε = (a ∪ b)ab
Result: Regular expression is (a ∪ b)ab
Complexity Analysis
- RE → NFA: Linear in the size of the regular expression
- NFA → RE: Can produce exponentially large regular expressions
- The state elimination order affects expression size
The Pumping Lemma for Regular Languages
Motivation
Not all languages are regular. How do we prove a language is not regular?
Examples of non-regular languages:
- L₁ = {0ⁿ1ⁿ : n ≥ 0} (equal numbers of 0s and 1s)
- L₂ = {ww : w ∈ {0,1}*} (strings repeated twice)
- L₃ = {0^(n²) : n ≥ 0} (strings of length perfect squares)
The pumping lemma provides a necessary condition for regularity.
The Pumping Lemma
Theorem (Pumping Lemma for Regular Languages):
If L is a regular language, then there exists a positive integer p (the "pumping length") such that every string s ∈ L with |s| ≥ p can be divided into three parts, s = xyz, satisfying:
- |xy| ≤ p (the first two parts are not too long)
- |y| > 0 (the middle part is non-empty)
- For all i ≥ 0, xy^i z ∈ L (we can "pump" the middle part)
Intuition: If L is regular, it's accepted by a DFA with finitely many states. Any sufficiently long string must cause the DFA to revisit some state, creating a "loop" that can be repeated.
Proof of the Pumping Lemma
Proof: Let L be regular and D be a DFA with p states accepting L. Consider any string s ∈ L with |s| ≥ p.
As D processes s = s₁s₂...sₙ (where n ≥ p), it visits n+1 states: q₀, q₁, ..., qₙ.
By the pigeonhole principle, among the first p+1 states (q₀, ..., qₚ), some state must repeat. Say q_i = q_j where i < j ≤ p.
Define:
- x = s₁...s_i (portion before the loop)
- y = s_(i+1)...s_j (the loop)
- z = s_(j+1)...sₙ (portion after the loop)
Then:
- |xy| = j ≤ p ✓
- |y| = j - i > 0 ✓
- We can repeat the loop any number of times: xy^i z ∈ L for all i ≥ 0 ✓
Using the Pumping Lemma
To prove a language L is not regular using the pumping lemma:
- Assume L is regular (for contradiction)
- Let p be the pumping length
- Choose a specific string s ∈ L with |s| ≥ p (this choice is crucial!)
- Consider all possible divisions s = xyz satisfying conditions 1 and 2
- For each division, find an i such that xy^i z ∉ L
- This contradicts the pumping lemma, so L is not regular
Example 1: L = {0ⁿ1ⁿ : n ≥ 0} is not regular
Proof: Assume L is regular with pumping length p.
Choose: s = 0^p 1^p ∈ L (clearly |s| = 2p ≥ p)
Consider any division s = xyz where |xy| ≤ p and |y| > 0:
- Since |xy| ≤ p, both x and y consist only of 0s
- So y = 0^k for some k > 0
Pump: Consider i = 2:
- xy²z = x·y·y·z = 0^(p-k)·0^k·0^k·1^p = 0^(p+k)1^p
- This has p+k zeros and p ones (where k > 0)
- Therefore xy²z ∉ L
This contradicts the pumping lemma, so L is not regular. ∎
Example 2: L = {ww : w ∈ {0,1}*} is not regular
Proof: Assume L is regular with pumping length p.
Choose: s = 0^p 1^p 0^p 1^p ∈ L (we have w = 0^p 1^p)
Consider any division s = xyz where |xy| ≤ p and |y| > 0:
- Since |xy| ≤ p, both x and y are in the first block of 0s
- So y = 0^k for some 0 < k ≤ p
Pump: Consider i = 2:
- xy²z = 0^(p+k) 1^p 0^p 1^p
- For this to be in L, it must equal ww for some w
- If w has length (2p+k)/2, then w is not an integer (contradiction)
- If we split differently, the two halves don't match
Therefore xy²z ∉ L, contradicting the pumping lemma. ∎
Example 3: L = {0^(n²) : n ≥ 0} is not regular
Proof: Assume L is regular with pumping length p.
Choose: s = 0^(p²) ∈ L
Consider any division s = xyz where |xy| ≤ p and |y| > 0:
- We have y = 0^k for some 0 < k ≤ p
Pump: Consider i = 2:
- xy²z = 0^(p²+k)
- For this to be in L, we need p² + k = m² for some integer m
- We have p² < p² + k ≤ p² + p < p² + 2p + 1 = (p+1)²
- So p² < p² + k < (p+1)², meaning p² + k is strictly between consecutive perfect squares
- Therefore xy²z ∉ L
This contradicts the pumping lemma. ∎
Common Mistakes
- Don't choose the division: The adversary chooses xyz (you must show it fails for ALL divisions)
- Don't choose p: The pumping length is given (you work with arbitrary p)
- Be strategic in choosing s: Pick a string where any valid division leads to contradiction
- Remember i can be 0: Sometimes pumping down (i=0) gives the contradiction
Closure Properties of Regular Languages
Regular languages form a Boolean algebra and have strong closure properties.
Proven Closure Properties
Theorem: The class of regular languages is closed under:
- Union: L₁ ∪ L₂
- Intersection: L₁ ∩ L₂
- Complement: L̄ = Σ* \ L
- Concatenation: L₁L₂
- Kleene star: L*
- Reversal: L^R = {w^R : w ∈ L}
- Difference: L₁ \ L₂ = L₁ ∩ L̄₂
- Homomorphism: h(L) for homomorphism h
- Inverse homomorphism: h⁻¹(L)
Proof Techniques
Union, Concatenation, Star: Use regular expression operations or NFA constructions.
Intersection (Product Construction): Given DFAs D₁ = (Q₁, Σ, δ₁, q₁, F₁) and D₂ = (Q₂, Σ, δ₂, q₂, F₂):
Construct D = (Q₁ × Q₂, Σ, δ, (q₁, q₂), F₁ × F₂) where:
- δ((p, q), a) = (δ₁(p, a), δ₂(q, a))
Complement: Given DFA D = (Q, Σ, δ, q₀, F): Construct D̄ = (Q, Σ, δ, q₀, Q \ F) (swap accept and non-accept states)
Reversal: Given NFA N = (Q, Σ, δ, q₀, F): Construct N^R = (Q ∪ {q_new}, Σ, δ^R, q_new, {q₀}) where:
- Add ε-transitions from q_new to all states in F
- Reverse all transitions: a ∈ δ^R(q, a') iff q ∈ δ(a, a')
- Old start becomes sole accept state
Homomorphism: A homomorphism h: Σ* → Γ* is defined by h(a) for each a ∈ Σ and extended to strings:
- h(ε) = ε
- h(wa) = h(w)h(a)
If L is regular with DFA D, construct DFA for h(L) by replacing each transition labeled a with h(a).
Applications of Closure Properties
Example: Prove L = {w ∈ {0,1}* : w has equal numbers of 01 and 10 substrings} is regular.
Solution:
- Observation: A string has equal 01s and 10s iff it's of the form 01 or 10
- L = 01 ∪ 10, both are regular
- By closure under union, L is regular
Example: Is L = {0^i 1^j : i ≠ j} regular?
Solution:
- We know L₁ = {0^n 1^n : n ≥ 0} is not regular
- L₂ = {0^i 1^j : i, j ≥ 0} is regular (01)
- Notice that L = L₂ \ L₁
- If L were regular, then L̄ = L̄₂ ∪ L₁ would be regular (closure under complement and union)
- This would imply L₁ = (L̄₂ ∪ L₁) ∩ L₂ is regular (closure under intersection)
- Contradiction! So L is not regular.
Decision Problems for Regular Languages
Decidable Problems
For regular languages represented as DFAs/NFAs/REs, the following are decidable:
Membership: Given L and string w, is w ∈ L?
- Algorithm: Simulate DFA on w (O(n) time)
Emptiness: Is L = ∅?
- Algorithm: Check if any accept state is reachable from start (DFS/BFS)
Finiteness: Is |L| < ∞?
- Algorithm: Check if there's a cycle on any path from start to accept state
Equivalence: Given L₁ and L₂, is L₁ = L₂?
- Algorithm: Check if L₁ ⊕ L₂ = (L₁ \ L₂) ∪ (L₂ \ L₁) is empty
Inclusion: Is L₁ ⊆ L₂?
- Algorithm: Check if L₁ \ L₂ is empty
Intersection: Is L₁ ∩ L₂ = ∅?
- Algorithm: Construct product DFA and check emptiness
All these can be solved in polynomial time for DFAs.
Complexity
- Membership: O(n) for DFA, O(n·|Q|²) for NFA
- Emptiness: O(|Q| + |δ|) using graph search
- Equivalence: O(n log n) using minimization
- DFA minimization: O(n log n) using Hopcroft's algorithm
Practical Regular Expression Patterns
Common Patterns
Email validation (simplified):
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}URL matching:
https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/.*)?IPv4 address:
([0-9]{1,3}\.){3}[0-9]{1,3}Phone number (US):
\d{3}-\d{3}-\d{4}Date (YYYY-MM-DD):
\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])
Limitations of Practical Regex Engines
Modern regex engines (Perl, Python, Java) include features beyond regular languages:
- Backreferences: \1, \2 (make language context-free or beyond)
- Lookahead/Lookbehind: (?=...), (?<=...)
- Recursion: (?R) or (?n)
These make the language not regular in the theoretical sense!
Example: ^(a*)b\1$ matches {a^n b a^n : n ≥ 0}, which is not regular.
Exercises
Basic Exercises
Regular Expression Practice: Write regular expressions for:
- a) Binary strings with at least three 1s
- b) Strings over {a,b,c} containing "abc" as a substring
- c) Binary numbers divisible by 4 (as binary strings)
Language Description: Describe the language of each regular expression:
- a) (0 ∪ 1)*0(0 ∪ 1)(0 ∪ 1)
- b) ((ε ∪ 0)(1*))*
- c) (01 ∪ 10)*
Conversion: Convert the regular expression (a ∪ b)*a(a ∪ b) to an NFA using the construction from this reading.
Intermediate Exercises
- State Elimination: Convert this DFA to a regular expression:
States: {q₀, q₁, q₂}, Start: q₀, Accept: {q₂}
δ(q₀, a) = q₁, δ(q₀, b) = q₀
δ(q₁, a) = q₂, δ(q₁, b) = q₁
δ(q₂, a) = q₂, δ(q₂, b) = q₂
Pumping Lemma: Use the pumping lemma to prove these languages are not regular:
- a) L = {0^n 1^m 0^n : n, m ≥ 0}
- b) L = {w ∈ {0,1}* : w has equal numbers of 0s and 1s}
- c) L = {0^p : p is prime}
Closure Properties:
- a) Prove that if L is regular, then HALF(L) = {x : ∃y, |x| = |y| and xy ∈ L} may not be regular
- b) Is the set of palindromes over {a,b} regular? Prove your answer.
Advanced Exercises
Minimality: Prove that the DFA accepting (a ∪ b)*a(a ∪ b)^(n-1) requires at least 2^n states.
Pumping Lemma Variant: Prove that if L is infinite and regular, then there exist strings x, y, z with |y| > 0 such that {xy^i z : i ≥ 0} ⊆ L.
Myhill-Nerode Application: Use the Myhill-Nerode theorem to prove L = {a^n b^n : n ≥ 0} is not regular by showing infinitely many equivalence classes.
Regular Expression Inequivalence: Find a language L such that there exist two regular expressions R₁ and R₂ with L(R₁) = L(R₂) but R₁ and R₂ are syntactically different and neither can be obtained from the other by algebraic laws.
State Complexity: Given DFAs D₁ with n states and D₂ with m states:
- a) What's the maximum number of states needed for D₁ ∪ D₂?
- b) What's the maximum number of states needed for D₁ ∩ D₂?
- c) What's the maximum number of states needed for D₁*?
Decidability: Design an algorithm to determine if a given DFA accepts an infinite language. Prove your algorithm is correct and analyze its complexity.
Summary
In this reading, we explored regular languages and expressions:
Regular Expressions provide an algebraic notation for regular languages with three basic operations: union, concatenation, and Kleene star
Kleene's Theorem establishes the equivalence of regular expressions, DFAs, and NFAs - they all recognize exactly the regular languages
Conversion Algorithms:
- RE → NFA: Structural induction (Thompson's construction)
- NFA → RE: State elimination (Kleene's algorithm)
The Pumping Lemma provides a necessary condition for regularity and is the primary tool for proving languages are non-regular
Closure Properties: Regular languages are closed under union, intersection, complement, concatenation, star, reversal, and homomorphisms
Decision Problems: Membership, emptiness, equivalence, and other properties are decidable for regular languages
Limitations: Many natural languages (balanced parentheses, equal 0s and 1s, palindromes) are not regular
Key Insights:
- Regular expressions and automata are dual views of the same concept
- The pumping lemma exploits the finite memory of DFAs
- Closure properties allow complex proofs via language operations
- Practical regex engines often exceed theoretical regular languages
Looking Ahead: Regular languages can't count unboundedly or match nested structures. We need more powerful models: context-free grammars and pushdown automata.
Next Steps
In the next reading, we'll explore Context-Free Grammars, which extend regular languages by adding:
- Recursive production rules
- Stack-based memory (pushdown automata)
- Ability to recognize nested structures like {0^n 1^n}
- Parse trees and ambiguity
- Applications to programming language syntax
← Previous: Finite Automata | Continue to Context-Free Grammars →
References and Further Reading
- Sipser, Michael. "Introduction to the Theory of Computation" (Chapter 1)
- Hopcroft, John E., et al. "Introduction to Automata Theory" (Chapters 3-4)
- Friedl, Jeffrey E. F. "Mastering Regular Expressions" (O'Reilly)
- Thompson, Ken. "Regular Expression Search Algorithm" (1968)
- Brzozowski, Janusz. "Derivatives of Regular Expressions" (1964)