Information Theory: Entropy, Compression, and Communication

Applied Mathematics Applied Mathematics 9 min read 1821 words Intermediate ExcellentWiki Editorial Team

Introduction

Information theory, founded by Claude Shannon in his 1948 paper A Mathematical Theory of Communication, quantifies information itself. Before Shannon, information was a vague concept. After Shannon, it became a precisely measurable quantity with units of bits — the amount of information gained from learning the outcome of a fair coin flip.

Shannon’s theory addresses two fundamental problems in communication: source coding (compressing data to its essential size) and channel coding (reliably transmitting data over noisy channels). The solutions reveal deep connections between information, probability, and computation that extend far beyond communication engineering.

Entropy and Information

Entropy measures the uncertainty or information content of a random variable. For a discrete random variable X with probability distribution p(x), the entropy H(X) = −Σ p(x) log₂ p(x) bits. A fair coin has entropy 1 bit. A biased coin has less entropy because the outcome is more predictable.

Quantifying Information

The self-information of an event x is I(x) = −log₂ p(x) bits. Rare events carry more information than common events. Learning that a highly improbable event occurred provides more surprise and more information than learning that a routine event happened. The entropy is the expected value of self-information.

The binary entropy function H(p) = −p log₂ p − (1−p) log₂(1−p) gives the entropy of a Bernoulli random variable with success probability p. It peaks at p = 1/2 with value 1 bit and approaches 0 as p approaches 0 or 1. This function appears throughout information theory, coding theory, and machine learning.

The Kullback-Leibler Divergence

The Kullback-Leibler (KL) divergence D(p||q) = Σ p(x) log₂(p(x)/q(x)) measures the inefficiency of using distribution q to approximate distribution p. It is always nonnegative and equals zero only when p = q. The KL divergence is not symmetric and does not satisfy the triangle inequality, so it is not a true distance metric.

KL divergence appears in machine learning as the loss function for variational inference and as the regularization term in variational autoencoders. The cross-entropy H(p,q) = H(p) + D(p||q) measures the average number of bits needed to encode samples from p using a code optimized for q. Minimizing cross-entropy is equivalent to minimizing KL divergence when p is fixed, making it the standard loss function for classification.

Properties of Entropy

Entropy is nonnegative and maximized by the uniform distribution. For n equally likely outcomes, H = log₂ n bits. Entropy is additive for independent random variables: H(X,Y) = H(X) + H(Y). Joint entropy satisfies H(X,Y) ≤ H(X) + H(Y) with equality only for independent variables.

Conditional entropy H(Y|X) = H(X,Y) − H(X) measures the remaining uncertainty about Y after knowing X. The chain rule H(X₁,X₂,…,Xₙ) = Σ H(Xᵢ|X₁,…,Xᵢ₋₁) generalizes this decomposition.

Differential Entropy

For continuous random variables, differential entropy h(X) = −∫ f(x) log₂ f(x) dx extends the concept. Unlike discrete entropy, differential entropy can be negative and depends on the measurement scale. The Gaussian distribution maximizes differential entropy for a given variance, making it the least informative distribution among those with fixed second moment.

Mutual Information

Mutual information I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) measures the amount of information one random variable contains about another. It is symmetric, nonnegative, and zero only for independent variables. Mutual information is the fundamental quantity for communication: it measures how much information the channel output reveals about the input.

Data Processing Inequality

If X → Y → Z form a Markov chain (Z depends on X only through Y), then I(X;Z) ≤ I(X;Y). Processing cannot increase information: no amount of data manipulation can create information that was not present in the data. This inequality has profound implications for statistical inference and machine learning.

Source Coding

Source coding (data compression) represents data using as few bits as possible. Lossless compression preserves the original data exactly. Lossy compression discards some information, accepting distortion in exchange for lower bit rates.

Shannon’s Source Coding Theorem

The source coding theorem states that the minimum expected codeword length for lossless compression of a source X is at least H(X) bits per symbol and can be made arbitrarily close to H(X). Entropy is the fundamental limit of lossless compression — no scheme can beat it on average.

Huffman coding achieves the entropy bound for dyadic distributions (probabilities that are powers of 1/2) and comes close otherwise. Arithmetic coding handles noninteger code lengths efficiently and is the basis for JPEG, MP3, and video compression standards.

Lossy Compression

Rate-distortion theory addresses lossy compression. The rate-distortion function R(D) gives the minimum bit rate needed to achieve distortion at most D. The tradeoff between rate and distortion is fundamental: higher quality requires more bits.

For a Gaussian source with mean squared error distortion, the rate-distortion function is R(D) = (1/2) log₂(σ²/D) for D ≤ σ². This formula quantifies how many bits per sample are needed to represent the source with a given level of distortion. The Blahut-Arimoto algorithm computes rate-distortion functions numerically for arbitrary sources and distortion measures.

JPEG image compression exploits the fact that human vision is less sensitive to high-frequency details. The discrete cosine transform converts image blocks to frequency coefficients, which are quantized (discarding high-frequency information) and then entropy-coded.

Channel Capacity

Channel capacity C = max_{p(x)} I(X;Y) is the maximum rate at which information can be reliably transmitted over a noisy channel. Shannon’s channel coding theorem states that for any rate R < C, there exists a code achieving arbitrarily small error probability. For rates above C, reliable communication is impossible.

Binary Symmetric Channel

The binary symmetric channel (BSC) flips each bit with crossover probability p. Its capacity is C = 1 − H(p) bits per channel use. When p = 0 (no noise), capacity is 1. When p = 1/2 (maximally noisy), capacity is 0 — no information gets through.

Gaussian Channel

The additive white Gaussian noise (AWGN) channel adds Gaussian noise to the transmitted signal. With power constraint P and noise variance σ², the capacity is C = (1/2) log₂(1 + P/σ²) bits per channel use. This formula drives the design of modems, wireless systems, and satellite communications.

The Shannon-Hartley theorem C = B log₂(1 + S/N) expresses capacity for a channel with bandwidth B and signal-to-noise ratio S/N. Doubling bandwidth doubles capacity in the low-SNR regime, while increasing power is more effective in the high-SNR regime.

Error-Correcting Codes

Channel coding adds redundancy to protect against errors. Block codes process fixed-length chunks of data. Convolutional codes process streaming data with shift registers. Turbo codes and low-density parity-check (LDPC) codes approach the Shannon limit in practice.

Linear Block Codes

A linear block code maps k information bits to n coded bits using a generator matrix G. The code rate is k/n. The minimum Hamming distance d_min determines error correction capability: a code can correct up to ⌊(d_min − 1)/2⌋ errors.

Hamming codes (n = 2ʳ − 1, k = 2ʳ − r − 1, d_min = 3) correct single errors with minimal redundancy. Reed-Solomon codes handle burst errors in CDs, QR codes, and deep-space communications.

Modern Codes

Turbo codes, invented in 1993, achieve performance within a fraction of a dB of the Shannon limit by iteratively decoding two constituent convolutional codes. LDPC codes, rediscovered around the same time, achieve similar performance with lower decoding complexity. Both are used in 5G, WiFi, and satellite communications.

Polar codes, invented by Erdal Arikan in 2009, are the first codes to provably achieve the symmetric capacity of binary-input memoryless channels with low encoding and decoding complexity. They exploit channel polarization: as code length increases, virtual channels become either perfectly noisy or perfectly noiseless. Information is transmitted over the noiseless channels while frozen bits are sent over the noisy ones.

Applications in Machine Learning

Information theory provides foundational concepts for machine learning. The cross-entropy loss function measures the difference between predicted and true distributions. Maximum likelihood estimation is equivalent to minimizing KL divergence between the empirical and model distributions.

The information bottleneck method finds optimal representations by maximizing mutual information with the target while compressing the input. Variational autoencoders use the ELBO (evidence lower bound), which decomposes into a reconstruction term and a KL divergence term.

Decision tree learning uses information gain (reduction in entropy) to select splitting attributes. Feature selection uses mutual information to identify the most relevant features for prediction.

What is entropy in simple terms? Entropy measures average uncertainty or information content. A fair coin flip has 1 bit of entropy because the outcome is maximally uncertain. A predictable source has low entropy.

What does Shannon’s channel coding theorem say? Any communication channel has a capacity C. For any rate below C, reliable communication is possible using appropriate coding. For rates above C, reliable communication is impossible regardless of coding.

What is the difference between lossless and lossy compression? Lossless compression preserves exact original data; the decompressed data matches the original bit-for-bit. Lossy compression discards some information to achieve higher compression ratios, accepting some distortion.

How do error-correcting codes work? Error-correcting codes add structured redundancy to data. The receiver uses the structure to detect and correct errors in the received message, as long as the number of errors does not exceed the code’s correction capability.

Probability Theory Guide — Data Science Mathematics — Fourier Analysis

Frequently Asked Questions

What are the fundamental principles behind information theory?

Scientific understanding builds on observation, hypothesis formation, experimentation, and peer review. The core principles vary by discipline but share a foundation of empirical evidence and reproducible results. Understanding these fundamentals helps evaluate new claims and apply knowledge to real-world situations.

How does this connect to everyday life?

Science explains phenomena we encounter daily — from weather patterns and cooking chemistry to the physics of motion. Understanding these connections makes science more relevant and memorable. Many technological advances that shape modern life originated from basic scientific research that seemed abstract at the time.

What are the most significant recent discoveries in this area?

Recent advances continue to refine our understanding. New research methods and technologies enable observations that were previously impossible. Staying current with peer-reviewed journals and reputable science news sources helps track the evolving understanding of natural phenomena.

How do scientists study information theory?

Scientists use a combination of direct observation, controlled experiments, computer modeling, and statistical analysis. The specific methods depend on the scale and nature of the phenomenon. Field studies, laboratory experiments, and theoretical modeling each contribute different types of understanding.

Share this article

X LinkedIn Facebook Email