Introduction to Databases

Learning Objectives

By the end of this reading, you will be able to:

  • Understand what databases are and why they are essential in modern computing
  • Differentiate between various database management systems (DBMS)
  • Explain different data models (hierarchical, network, relational, NoSQL)
  • Understand the relational model and its fundamental concepts
  • Describe the advantages of using a DBMS over file-based storage
  • Identify when to use different types of databases

Introduction

In our digital world, data is everywhere. From social media posts to bank transactions, from medical records to weather forecasts, managing data efficiently is crucial. Databases provide the foundation for storing, retrieving, and managing this data in an organized and efficient manner.

A database is an organized collection of structured information or data, typically stored electronically in a computer system. A Database Management System (DBMS) is software that interacts with users, applications, and the database itself to capture and analyze data.

Why Use Databases?

Problems with File-Based Storage

Before databases became widespread, organizations stored data in file systems. This approach had several limitations:

  1. Data Redundancy: The same data is stored in multiple places
  2. Data Inconsistency: Updates in one file may not reflect in others
  3. Difficult Data Access: Writing new programs for each query
  4. Data Isolation: Data scattered in various files and formats
  5. Integrity Problems: Hard to enforce constraints
  6. Atomicity Problems: Difficult to ensure all-or-nothing operations
  7. Concurrent Access Anomalies: Multiple users cause conflicts
  8. Security Problems: Hard to provide selective access

Advantages of DBMS

A Database Management System solves these problems by providing:

  1. Data Independence: Applications are isolated from how data is stored
  2. Efficient Data Access: Optimized queries and indexing
  3. Data Integrity and Security: Enforced constraints and access control
  4. Data Administration: Centralized control of data
  5. Concurrent Access: Multiple users can access data simultaneously
  6. Crash Recovery: Mechanisms to recover from failures
  7. Reduced Application Development Time: Standard interfaces and query languages

Data Models

A data model is a collection of concepts for describing data, relationships, semantics, and constraints. Several data models have evolved over time:

1. Hierarchical Model

Data is organized in a tree-like structure with parent-child relationships.

Company
├── Department A
│   ├── Employee 1
│   └── Employee 2
└── Department B
    ├── Employee 3
    └── Employee 4

Advantages: Efficient for one-to-many relationships, fast access Disadvantages: Rigid structure, difficult to represent many-to-many relationships

2. Network Model

Extends the hierarchical model by allowing multiple parent-child relationships.

Product 1 ──┬── Supplier A
            └── Supplier B
Product 2 ──┬── Supplier A
            └── Supplier C

Advantages: More flexible than hierarchical, supports complex relationships Disadvantages: Complex to design and maintain

3. Relational Model

Data is organized in tables (relations) with rows and columns. This is the most widely used model today.

Employees Table
+----+--------+-------------+--------+
| ID | Name   | Department  | Salary |
+----+--------+-------------+--------+
| 1  | Alice  | Engineering | 75000  |
| 2  | Bob    | Marketing   | 65000  |
| 3  | Carol  | Engineering | 80000  |
+----+--------+-------------+--------+

Advantages: Simple, flexible, strong mathematical foundation Disadvantages: Can be slower for hierarchical data

4. NoSQL Models

Modern databases designed for specific use cases:

  • Document Stores: MongoDB, CouchDB (JSON-like documents)
  • Key-Value Stores: Redis, DynamoDB (simple key-value pairs)
  • Column-Family Stores: Cassandra, HBase (columns grouped into families)
  • Graph Databases: Neo4j, Amazon Neptune (nodes and relationships)

Advantages: Scalability, flexibility, performance for specific workloads Disadvantages: Less standardization, eventual consistency in some cases

The Relational Model: Fundamental Concepts

The relational model, introduced by E.F. Codd in 1970, is based on mathematical set theory and predicate logic. Let's explore its core concepts:

Tables (Relations)

A relation (or table) is a two-dimensional structure with:

  • Rows (tuples): Individual records
  • Columns (attributes): Properties of the records
Students Relation
+------------+----------+-----+-------+
| StudentID  | Name     | Age | Major |
+------------+----------+-----+-------+
| S001       | Alice    | 20  | CS    |
| S002       | Bob      | 22  | Math  |
| S003       | Carol    | 21  | CS    |
+------------+----------+-----+-------+

Domain

A domain is the set of permitted values for an attribute. For example:

  • Age domain: integers from 0 to 120
  • Major domain: {'CS', 'Math', 'Physics', 'Biology', ...}

Schema vs. Instance

  • Schema: The structure of the database (table names, column names, data types)
  • Instance: The actual data in the database at a particular moment
-- Schema (structure)
CREATE TABLE Students (
    StudentID VARCHAR(10) PRIMARY KEY,
    Name VARCHAR(100) NOT NULL,
    Age INTEGER CHECK (Age >= 0 AND Age <= 120),
    Major VARCHAR(50)
);

-- Instance (data)
INSERT INTO Students VALUES ('S001', 'Alice', 20, 'CS');
INSERT INTO Students VALUES ('S002', 'Bob', 22, 'Math');

Keys

Keys are crucial for identifying and relating data:

Primary Key

A primary key uniquely identifies each row in a table.

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,  -- Primary key
    Name VARCHAR(100),
    Email VARCHAR(100)
);

Properties of a primary key:

  • Must be unique
  • Cannot be NULL
  • Each table has at most one primary key
  • Can be composite (multiple columns)

Composite Primary Key

CREATE TABLE Enrollments (
    StudentID VARCHAR(10),
    CourseID VARCHAR(10),
    Semester VARCHAR(20),
    Grade CHAR(2),
    PRIMARY KEY (StudentID, CourseID, Semester)
);

Foreign Key

A foreign key is a column that references the primary key of another table, establishing relationships.

CREATE TABLE Courses (
    CourseID VARCHAR(10) PRIMARY KEY,
    CourseName VARCHAR(100),
    Credits INT
);

CREATE TABLE Enrollments (
    StudentID VARCHAR(10),
    CourseID VARCHAR(10),
    Grade CHAR(2),
    PRIMARY KEY (StudentID, CourseID),
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
    FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);

Candidate Key

A candidate key is any column or combination of columns that could serve as a primary key.

-- Both EmployeeID and Email could be primary keys
CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,     -- Chosen as primary key
    Email VARCHAR(100) UNIQUE,      -- Candidate key (unique constraint)
    Name VARCHAR(100)
);

Constraints

Constraints enforce rules on data:

CREATE TABLE Products (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(100) NOT NULL,        -- Cannot be NULL
    Price DECIMAL(10,2) CHECK (Price > 0),   -- Must be positive
    Category VARCHAR(50) DEFAULT 'General',   -- Default value
    SupplierID INT,
    FOREIGN KEY (SupplierID) REFERENCES Suppliers(SupplierID),
    UNIQUE (ProductName, SupplierID)          -- Combination must be unique
);

Database Languages

Data Definition Language (DDL)

Used to define and modify database structure:

-- Create a table
CREATE TABLE Departments (
    DeptID INT PRIMARY KEY,
    DeptName VARCHAR(100) NOT NULL,
    Location VARCHAR(100)
);

-- Modify a table
ALTER TABLE Departments ADD Budget DECIMAL(15,2);

-- Delete a table
DROP TABLE Departments;

Data Manipulation Language (DML)

Used to query and modify data:

-- Insert data
INSERT INTO Departments VALUES (1, 'Engineering', 'Building A', 500000);

-- Update data
UPDATE Departments SET Budget = 550000 WHERE DeptID = 1;

-- Delete data
DELETE FROM Departments WHERE DeptID = 1;

-- Query data
SELECT * FROM Departments WHERE Budget > 400000;

Data Control Language (DCL)

Used to control access to data:

-- Grant privileges
GRANT SELECT, INSERT ON Departments TO user1;

-- Revoke privileges
REVOKE INSERT ON Departments FROM user1;

Types of Database Systems

Relational DBMS (RDBMS)

Examples: PostgreSQL, MySQL, Oracle, SQL Server

Best for:

  • Structured data with clear relationships
  • ACID compliance requirements
  • Complex queries and reporting
  • Financial transactions
-- Example: Banking system
CREATE TABLE Accounts (
    AccountID INT PRIMARY KEY,
    CustomerID INT,
    Balance DECIMAL(15,2),
    AccountType VARCHAR(20),
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

CREATE TABLE Transactions (
    TransactionID INT PRIMARY KEY,
    AccountID INT,
    Amount DECIMAL(15,2),
    TransactionDate TIMESTAMP,
    Type VARCHAR(20),
    FOREIGN KEY (AccountID) REFERENCES Accounts(AccountID)
);

NoSQL Databases

Document Databases (MongoDB)

// Example: Blog posts
{
    "_id": "post1",
    "title": "Introduction to Databases",
    "author": "Alice",
    "tags": ["database", "tutorial", "beginner"],
    "comments": [
        {
            "user": "Bob",
            "text": "Great article!",
            "date": "2024-01-15"
        }
    ]
}

Best for:

  • Semi-structured data
  • Flexible schemas
  • Rapid development

Key-Value Stores (Redis)

SET user:1000:name "Alice"
SET user:1000:email "alice@example.com"
GET user:1000:name  // Returns "Alice"

Best for:

  • Caching
  • Session management
  • Real-time analytics

Graph Databases (Neo4j)

// Create nodes and relationships
CREATE (alice:Person {name: 'Alice'})
CREATE (bob:Person {name: 'Bob'})
CREATE (alice)-[:KNOWS]->(bob)

// Query
MATCH (p:Person)-[:KNOWS]->(friend)
RETURN p.name, friend.name

Best for:

  • Social networks
  • Recommendation engines
  • Network analysis

Database Architecture

Three-Schema Architecture

The ANSI-SPARC architecture provides data independence through three levels:

┌─────────────────────────────────────┐
│  External Level (Views)             │  <- User-specific views
│  View1  View2  View3                │
└─────────────────────────────────────┘
           ↕ (External/Conceptual Mapping)
┌─────────────────────────────────────┐
│  Conceptual Level (Logical Schema)  │  <- Complete database structure
│  Tables, Relationships, Constraints │
└─────────────────────────────────────┘
           ↕ (Conceptual/Internal Mapping)
┌─────────────────────────────────────┐
│  Internal Level (Physical Schema)   │  <- How data is stored
│  Files, Indexes, Storage            │
└─────────────────────────────────────┘

Example:

-- Conceptual Level: Complete schema
CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    Name VARCHAR(100),
    Salary DECIMAL(10,2),
    SSN VARCHAR(11),
    Department VARCHAR(50)
);

-- External Level: Views for different users
CREATE VIEW PublicEmployees AS
SELECT EmployeeID, Name, Department
FROM Employees;  -- No salary or SSN

CREATE VIEW ManagerView AS
SELECT EmployeeID, Name, Salary, Department
FROM Employees;  -- Salary but no SSN

Client-Server Architecture

┌──────────┐  ┌──────────┐  ┌──────────┐
│ Client 1 │  │ Client 2 │  │ Client 3 │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │             │
     └─────────┬───┴─────────────┘
               │
        ┌──────▼──────┐
        │   Network   │
        └──────┬──────┘
               │
        ┌──────▼──────┐
        │  DB Server  │
        │   (DBMS)    │
        └──────┬──────┘
               │
        ┌──────▼──────┐
        │  Database   │
        └─────────────┘

Practical Example: University Database

Let's design a simple university database:

-- Create database
CREATE DATABASE University;

-- Students table
CREATE TABLE Students (
    StudentID VARCHAR(10) PRIMARY KEY,
    FirstName VARCHAR(50) NOT NULL,
    LastName VARCHAR(50) NOT NULL,
    Email VARCHAR(100) UNIQUE,
    DateOfBirth DATE,
    Major VARCHAR(50)
);

-- Courses table
CREATE TABLE Courses (
    CourseID VARCHAR(10) PRIMARY KEY,
    CourseName VARCHAR(100) NOT NULL,
    Credits INT CHECK (Credits > 0),
    Department VARCHAR(50)
);

-- Instructors table
CREATE TABLE Instructors (
    InstructorID VARCHAR(10) PRIMARY KEY,
    FirstName VARCHAR(50) NOT NULL,
    LastName VARCHAR(50) NOT NULL,
    Email VARCHAR(100) UNIQUE,
    Department VARCHAR(50)
);

-- Enrollments table (many-to-many relationship)
CREATE TABLE Enrollments (
    StudentID VARCHAR(10),
    CourseID VARCHAR(10),
    Semester VARCHAR(20),
    Grade CHAR(2),
    PRIMARY KEY (StudentID, CourseID, Semester),
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
    FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);

-- Course Assignments table (which instructor teaches which course)
CREATE TABLE CourseAssignments (
    CourseID VARCHAR(10),
    InstructorID VARCHAR(10),
    Semester VARCHAR(20),
    PRIMARY KEY (CourseID, Semester),
    FOREIGN KEY (CourseID) REFERENCES Courses(CourseID),
    FOREIGN KEY (InstructorID) REFERENCES Instructors(InstructorID)
);

-- Insert sample data
INSERT INTO Students VALUES
    ('S001', 'Alice', 'Johnson', 'alice.j@university.edu', '2002-05-15', 'Computer Science'),
    ('S002', 'Bob', 'Smith', 'bob.s@university.edu', '2001-08-22', 'Mathematics'),
    ('S003', 'Carol', 'Davis', 'carol.d@university.edu', '2002-11-30', 'Computer Science');

INSERT INTO Courses VALUES
    ('CS101', 'Introduction to Programming', 3, 'Computer Science'),
    ('CS201', 'Data Structures', 4, 'Computer Science'),
    ('MATH101', 'Calculus I', 4, 'Mathematics');

INSERT INTO Instructors VALUES
    ('I001', 'Dr. Jane', 'Wilson', 'j.wilson@university.edu', 'Computer Science'),
    ('I002', 'Dr. John', 'Brown', 'j.brown@university.edu', 'Mathematics');

INSERT INTO Enrollments VALUES
    ('S001', 'CS101', 'Fall 2023', 'A'),
    ('S001', 'MATH101', 'Fall 2023', 'B+'),
    ('S002', 'MATH101', 'Fall 2023', 'A'),
    ('S003', 'CS101', 'Fall 2023', 'A-');

INSERT INTO CourseAssignments VALUES
    ('CS101', 'I001', 'Fall 2023'),
    ('MATH101', 'I002', 'Fall 2023');

-- Simple queries
SELECT * FROM Students WHERE Major = 'Computer Science';

SELECT s.FirstName, s.LastName, c.CourseName, e.Grade
FROM Students s
JOIN Enrollments e ON s.StudentID = e.StudentID
JOIN Courses c ON e.CourseID = c.CourseID
WHERE s.StudentID = 'S001';

Exercises

Basic Exercises

  1. Define Database Terms: Explain the difference between a database and a DBMS.

  2. Identify Keys: Given the following table, identify potential primary keys and explain your choice:

    Books
    +---------+------------------+-------------+------+-------+
    | ISBN    | Title            | Author      | Year | Pages |
    +---------+------------------+-------------+------+-------+
    | 1234567 | Database Systems | A. Silbers  | 2020 | 450   |
    | 7654321 | SQL Mastery      | B. Johnson  | 2019 | 380   |
    +---------+------------------+-------------+------+-------+
    
  3. Create Tables: Write SQL to create a table for a library system that stores book information (ISBN, title, author, publication year, available copies).

  4. Insert Data: Write SQL to insert 5 books into your library table.

Intermediate Exercises

  1. Design a Schema: Design a database schema for a simple e-commerce system with customers, products, and orders. Include:

    • At least 3 tables
    • Primary keys for each table
    • At least one foreign key relationship
    • Appropriate constraints
  2. Data Models: Compare and contrast when you would use a relational database vs. a document database. Give specific examples.

  3. Constraints: Create a table for a bank account system with appropriate constraints:

    • Account number (primary key)
    • Customer ID (foreign key)
    • Balance (must be non-negative)
    • Account type (must be 'Checking' or 'Savings')
    • Date opened (cannot be in the future)
  4. Views: Given the Employees table:

    CREATE TABLE Employees (
        EmployeeID INT PRIMARY KEY,
        Name VARCHAR(100),
        Salary DECIMAL(10,2),
        Department VARCHAR(50),
        SSN VARCHAR(11)
    );
    

    Create two views:

    • One for HR that shows all columns
    • One for public that shows only name and department

Advanced Exercises

  1. Three-Schema Architecture: Explain how the three-schema architecture provides data independence. Give an example of how you could change the internal schema without affecting the external schema.

  2. Complex Schema Design: Design a database for a hospital system that includes:

    • Patients with medical history
    • Doctors with specializations
    • Appointments
    • Prescriptions
    • Medical records

    Include all necessary tables, relationships, and constraints. Explain your design decisions.

  3. DBMS Advantages: For each of the following scenarios, explain which DBMS advantage is most relevant:

    • A bank needs to ensure that when money is transferred, it's deducted from one account and added to another, or neither happens
    • A company wants to run reports without writing custom programs each time
    • Multiple airline agents need to book seats on the same flight simultaneously
    • A healthcare system needs to ensure only authorized personnel can access patient records
  4. NoSQL vs SQL: You're building a social media platform. Decide which parts of the system should use:

    • Relational database
    • Document database
    • Key-value store
    • Graph database

    Justify each choice with specific features and requirements.

Summary

In this reading, we explored the fundamental concepts of databases:

  • Databases are organized collections of data, while a DBMS is the software that manages them
  • DBMSs provide advantages over file-based storage including data independence, integrity, concurrent access, and security
  • Various data models exist: hierarchical, network, relational, and NoSQL
  • The relational model organizes data in tables with rows and columns
  • Keys (primary, foreign, candidate) are essential for identifying records and establishing relationships
  • Constraints enforce business rules and data integrity
  • Different types of databases (RDBMS, document, key-value, graph) serve different purposes
  • The three-schema architecture provides data independence through external, conceptual, and internal levels

Understanding these foundations is crucial as we move forward to explore SQL, database design, and optimization in subsequent readings.

Next Reading

Continue to 02-relational-sql.md to learn about the relational model in depth and master SQL queries including SELECT, JOIN, GROUP BY, and subqueries.