Genes and Genomes: How DNA Is Organised

What a Gene Is

Informal definition: a gene is a stretch of DNA that encodes a functional product, usually a protein.

Formal definition: it depends, because genes are more complex than they look.

The simple version

A gene has three main parts:

Promoter     the start signal (where RNA polymerase binds)
Coding       the DNA that actually specifies the protein (exons and introns)
Terminator   the stop signal

The coding region is copied into mRNA (transcription), processed, and translated into protein. Simple enough.

The messy version

Real genes have regulatory regions scattered around them: enhancers (which can sit far away), silencers, insulators, alternative splicing signals. The same gene can produce different protein variants in different tissues through alternative splicing, which chooses different combinations of exons.

The human dystrophin gene, for instance, is about 2.2 million DNA letters long but produces proteins of various sizes depending on which muscle it's in. Calling it "a gene" is like calling a novel "a sentence": technically true in one sense, misleading in another.

The Human Genome

Some numbers to anchor:

Total DNA             ~3 billion base pairs (3 Gb)
Chromosomes           46 (23 pairs)
Genes                 ~20,000 protein-coding
Protein-coding DNA    ~2% of the genome
Non-coding DNA        ~98%
Genes producing       ~100,000 distinct proteins (thanks to splicing)

That 2% fact is worth dwelling on. Only a small fraction of your DNA directly encodes proteins. The rest is non-coding, and for decades was dismissed as "junk". It turns out to be anything but.

Non-Coding DNA

The 98% breaks down into several kinds:

Regulatory DNA

Promoters, enhancers, silencers: sequences that control which genes are on in which cells at which times. A human and a chimpanzee share over 98% of their coding DNA sequence but express those genes differently; much of the difference is in regulatory DNA.

RNA-producing genes

Genes that make RNA as their final product (not mRNA to be translated into protein). Includes rRNA, tRNA, miRNA, lncRNA, and more. Collectively, thousands of these exist in the human genome.

Repetitive DNA

Large stretches of DNA consisting of repeated sequences. Some are functional (centromeres, telomeres). Some are transposon "fossils" (see below). Some appear to be simply repetitive without obvious function.

Transposable elements

Remnants of DNA sequences that can copy-and-paste themselves around the genome. Around 45% of the human genome is transposon-derived. Most are inactive; some still move occasionally, which is a source of mutation.

Pseudogenes

Former genes that have acquired mutations making them non-functional. They sometimes retain regulatory roles despite being "broken".

Chromosomes

Human DNA is packaged into 23 pairs of chromosomes:

  • Autosomes: pairs 1 through 22. Each pair is numbered from longest to shortest
  • Sex chromosomes: pair 23. XX in females, XY in males (with exceptions)

Chromosomes exist visibly only during cell division. Most of the time, DNA is in a looser, more extended state called chromatin, wrapped around histones.

Telomeres and centromeres

Each chromosome has a centromere (the pinched middle) and telomeres (the ends). Telomeres are repetitive sequences that protect the chromosome ends. They shorten with each cell division, which is one contributor to cellular ageing. Stem cells and cancer cells use the enzyme telomerase to maintain their telomeres.

Genome Size vs Complexity

A common beginner surprise: genome size does not correlate with organism complexity.

Organism                       Genome size
E. coli (bacterium)            4.6 million bp
Saccharomyces (yeast)          12 million bp
Arabidopsis (plant)            135 million bp
C. elegans (worm)              100 million bp
Fruit fly                      140 million bp
Human                          3 billion bp
Onion                          16 billion bp
Marbled lungfish               130 billion bp
Amoeba (Polychaos)             670 billion bp

An onion has more than five times as much DNA as you. A lungfish has over forty times more. This is the C-value paradox, and the explanation is that most of that extra DNA is repetitive or non-coding. Complexity is not about raw DNA amount.

Epigenetics

Not every difference between cells is encoded in DNA sequence. Epigenetic modifications change how genes are expressed without changing the sequence:

DNA methylation

Adding a small chemical group (a methyl group) to cytosine bases. Methylated regions tend to be silenced. Patterns of methylation are tissue-specific and can be inherited across cell divisions.

Histone modifications

Adding chemical groups to the histones DNA wraps around. Some modifications open up the DNA for transcription; others close it down.

Non-coding RNAs

Some RNAs directly modulate gene expression without being translated.

Epigenetics explains why a liver cell and a neuron, with identical DNA, behave so differently. It also partly explains how the environment affects gene expression. It is not, despite some popular claims, a way to inherit acquired traits at large scale; most epigenetic marks are reset between generations.

The Mitochondrial Genome

A quirk worth knowing: mitochondria have their own DNA.

  • Circular, not linear
  • About 16,000 base pairs (human)
  • 37 genes
  • Inherited only from your mother (sperm mitochondria are eliminated after fertilisation)

Mitochondrial DNA is useful for tracing maternal ancestry. It also mutates faster than nuclear DNA, which is medically relevant: various inherited mitochondrial diseases exist.

The Genome in Biotech

Understanding the genome matters for:

Diagnostics

Genetic tests check specific sequences for known disease-causing variants. BRCA1/2 for breast cancer risk, CFTR for cystic fibrosis, HFE for hemochromatosis. Direct-to-consumer tests (23andMe, AncestryDNA) look at a subset of the genome.

Drug targeting

Some drugs work only on patients with specific genetic profiles. Trastuzumab (Herceptin) works on breast cancers that overexpress HER2. Imatinib (Gleevec) works on cancers with the BCR-ABL fusion.

Disease association studies

Genome-wide association studies (GWAS) scan hundreds of thousands of markers across thousands of people to find variants associated with a disease. They identify many small-effect variants. Translation into clinical use has been slower than hoped.

Gene therapy

Fixing a broken gene, or adding a missing one, at the DNA level. Chapter 9 goes deeper.

Reading a Gene Name

Gene names follow conventions. A few rules:

  • Human genes are uppercase italics: BRCA1
  • Mouse genes are lowercase italics: Brca1
  • Proteins from those genes use plain text: BRCA1

Gene symbols are usually short (3 to 6 letters) and often acronyms of something. TP53 is "tumour protein 53". CFTR is "cystic fibrosis transmembrane conductance regulator". Sometimes they're whimsical: the fruit-fly gene sonic hedgehog was named as a joke; its human homolog is now a critical developmental gene.

The Human Pangenome

A recent development: rather than one "reference" human genome, scientists are now assembling a pangenome representing human genetic diversity. The first pangenome reference, released in 2023, includes sequences from 47 people across global populations.

This matters because the original reference genome over-represented some populations. A pangenome captures variation that single-reference analysis misses, which affects disease studies and clinical genomics.

Common Pitfalls

"We have 20,000 genes; that's how complex we are." Complexity is not in gene count. It's in the regulatory networks and the combinatorial expression

"Junk DNA is junk." Some of it is. Much of it isn't. The blanket dismissal has been retired

"Genes cause diseases." Some do directly (Huntington's, sickle-cell). Most disease has complex genetic contributions modulated by environment. "Gene for X" is usually inaccurate for complex diseases

"Your genome is your destiny." Overstated. Heritability varies across traits. For some (eye colour) the genome is deterministic; for others (most complex diseases, intelligence, personality) genetic influence is real but partial

"Non-coding means non-functional." Non-coding means doesn't-make-protein. Regulatory sequences and RNA-encoding sequences are non-coding and very functional

Next Steps

Continue to 05-proteins-in-action.md for what the products of genes actually do.