Information Theory: Entropy, Compression, and Communication
Introduction
Information theory, founded by Claude Shannon in his 1948 paper A Mathematical Theory of Communication, quantifies information itself. Before Shannon, information was a vague concept. After Shannon, it became a precisely measurable quantity with units of bits — the amount of information gained from learning the outcome of a fair coin flip.
Shannon’s theory addresses two fundamental problems in communication: source coding (compressing data to its essential size) and channel coding (reliably transmitting data over noisy channels). The solutions reveal deep connections between information, probability, and computation that extend far beyond communication engineering.
Entropy and Information
Entropy measures the uncertainty or information content of a random variable. For a discrete random variable X with probability distribution p(x), the entropy H(X) = −Σ p(x) log₂ p(x) bits. A fair coin has entropy 1 bit. A biased coin has less entropy because the outcome is more predictable.
Quantifying Information
The self-information of an event x is I(x) = −log₂ p(x) bits. Rare events carry more information than common events. Learning that a highly improbable event occurred provides more surprise and more information than learning that a routine event happened. The entropy is the expected value of self-information.
The binary entropy function H(p) = −p log₂ p − (1−p) log₂(1−p) gives the entropy of a Bernoulli random variable with success probability p. It peaks at p = 1/2 with value 1 bit and approaches 0 as p approaches 0 or 1. This function appears throughout information theory, coding theory, and machine learning.
Entropy measures the uncertainty or information content of a random variable. For a discrete random variable X with probability distribution p(x), the entropy H(X) = −Σ p(x) log₂ p(x) bits. A fair coin has entropy 1 bit. A biased coin has less entropy because the outcome is more predictable.
The Kullback-Leibler Divergence
The Kullback-Leibler (KL) divergence D(p||q) = Σ p(x) log₂(p(x)/q(x)) measures the inefficiency of using distribution q to approximate distribution p. It is always nonnegative and equals zero only when p = q. The KL divergence is not symmetric and does not satisfy the triangle inequality, so it is not a true distance metric.
KL divergence appears in machine learning as the loss function for variational inference and as the regularization term in variational autoencoders. The cross-entropy H(p,q) = H(p) + D(p||q) measures the average number of bits needed to encode samples from p using a code optimized for q. Minimizing cross-entropy is equivalent to minimizing KL divergence when p is fixed, making it the standard loss function for classification.
Properties of Entropy
Entropy is nonnegative and maximized by the uniform distribution. For n equally likely outcomes, H = log₂ n bits. Entropy is additive for independent random variables: H(X,Y) = H(X) + H(Y). Joint entropy satisfies H(X,Y) ≤ H(X) + H(Y) with equality only for independent variables.
Conditional entropy H(Y|X) = H(X,Y) − H(X) measures the remaining uncertainty about Y after knowing X. The chain rule H(X₁,X₂,…,Xₙ) = Σ H(Xᵢ|X₁,…,Xᵢ₋₁) generalizes this decomposition.
Differential Entropy
For continuous random variables, differential entropy h(X) = −∫ f(x) log₂ f(x) dx extends the concept. Unlike discrete entropy, differential entropy can be negative and depends on the measurement scale. The Gaussian distribution maximizes differential entropy for a given variance, making it the least informative distribution among those with fixed second moment.
Mutual Information
Mutual information I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) measures the amount of information one random variable contains about another. It is symmetric, nonnegative, and zero only for independent variables. Mutual information is the fundamental quantity for communication: it measures how much information the channel output reveals about the input.
Data Processing Inequality
If X → Y → Z form a Markov chain (Z depends on X only through Y), then I(X;Z) ≤ I(X;Y). Processing cannot increase information: no amount of data manipulation can create information that was not present in the data. This inequality has profound implications for statistical inference and machine learning.
Source Coding
Source coding (data compression) represents data using as few bits as possible. Lossless compression preserves the original data exactly. Lossy compression discards some information, accepting distortion in exchange for lower bit rates.
Shannon’s Source Coding Theorem
The source coding theorem states that the minimum expected codeword length for lossless compression of a source X is at least H(X) bits per symbol and can be made arbitrarily close to H(X). Entropy is the fundamental limit of lossless compression — no scheme can beat it on average.
Huffman coding achieves the entropy bound for dyadic distributions (probabilities that are powers of 1/2) and comes close otherwise. Arithmetic coding handles noninteger code lengths efficiently and is the basis for JPEG, MP3, and video compression standards.
Lossy Compression
Rate-distortion theory addresses lossy compression. The rate-distortion function R(D) gives the minimum bit rate needed to achieve distortion at most D. The tradeoff between rate and distortion is fundamental: higher quality requires more bits.
For a Gaussian source with mean squared error distortion, the rate-distortion function is R(D) = (1/2) log₂(σ²/D) for D ≤ σ². This formula quantifies how many bits per sample are needed to represent the source with a given level of distortion. The Blahut-Arimoto algorithm computes rate-distortion functions numerically for arbitrary sources and distortion measures.
JPEG image compression exploits the fact that human vision is less sensitive to high-frequency details. The discrete cosine transform converts image blocks to frequency coefficients, which are quantized (discarding high-frequency information) and then entropy-coded.
Channel Capacity
Channel capacity C = max_{p(x)} I(X;Y) is the maximum rate at which information can be reliably transmitted over a noisy channel. Shannon’s channel coding theorem states that for any rate R < C, there exists a code achieving arbitrarily small error probability. For rates above C, reliable communication is impossible.
Binary Symmetric Channel
The binary symmetric channel (BSC) flips each bit with crossover probability p. Its capacity is C = 1 − H(p) bits per channel use. When p = 0 (no noise), capacity is 1. When p = 1/2 (maximally noisy), capacity is 0 — no information gets through.
Gaussian Channel
The additive white Gaussian noise (AWGN) channel adds Gaussian noise to the transmitted signal. With power constraint P and noise variance σ², the capacity is C = (1/2) log₂(1 + P/σ²) bits per channel use. This formula drives the design of modems, wireless systems, and satellite communications.
The Shannon-Hartley theorem C = B log₂(1 + S/N) expresses capacity for a channel with bandwidth B and signal-to-noise ratio S/N. Doubling bandwidth doubles capacity in the low-SNR regime, while increasing power is more effective in the high-SNR regime.
Error-Correcting Codes
Channel coding adds redundancy to protect against errors. Block codes process fixed-length chunks of data. Convolutional codes process streaming data with shift registers. Turbo codes and low-density parity-check (LDPC) codes approach the Shannon limit in practice.
Linear Block Codes
A linear block code maps k information bits to n coded bits using a generator matrix G. The code rate is k/n. The minimum Hamming distance d_min determines error correction capability: a code can correct up to ⌊(d_min − 1)/2⌋ errors.
Hamming codes (n = 2ʳ − 1, k = 2ʳ − r − 1, d_min = 3) correct single errors with minimal redundancy. Reed-Solomon codes handle burst errors in CDs, QR codes, and deep-space communications.
Modern Codes
Turbo codes, invented in 1993, achieve performance within a fraction of a dB of the Shannon limit by iteratively decoding two constituent convolutional codes. LDPC codes, rediscovered around the same time, achieve similar performance with lower decoding complexity. Both are used in 5G, WiFi, and satellite communications.
Polar codes, invented by Erdal Arikan in 2009, are the first codes to provably achieve the symmetric capacity of binary-input memoryless channels with low encoding and decoding complexity. They exploit channel polarization: as code length increases, virtual channels become either perfectly noisy or perfectly noiseless. Information is transmitted over the noiseless channels while frozen bits are sent over the noisy ones.
Turbo codes, invented in 1993, achieve performance within a fraction of a dB of the Shannon limit by iteratively decoding two constituent convolutional codes. LDPC codes, rediscovered around the same time, achieve similar performance with lower decoding complexity. Both are used in 5G, WiFi, and satellite communications.
Applications in Machine Learning
Information theory provides foundational concepts for machine learning. The cross-entropy loss function measures the difference between predicted and true distributions. Maximum likelihood estimation is equivalent to minimizing KL divergence between the empirical and model distributions.
The information bottleneck method finds optimal representations by maximizing mutual information with the target while compressing the input. Variational autoencoders use the ELBO (evidence lower bound), which decomposes into a reconstruction term and a KL divergence term.
Decision tree learning uses information gain (reduction in entropy) to select splitting attributes. Feature selection uses mutual information to identify the most relevant features for prediction.
What is entropy in simple terms? Entropy measures average uncertainty or information content. A fair coin flip has 1 bit of entropy because the outcome is maximally uncertain. A predictable source has low entropy.
What does Shannon’s channel coding theorem say? Any communication channel has a capacity C. For any rate below C, reliable communication is possible using appropriate coding. For rates above C, reliable communication is impossible regardless of coding.
What is the difference between lossless and lossy compression? Lossless compression preserves exact original data; the decompressed data matches the original bit-for-bit. Lossy compression discards some information to achieve higher compression ratios, accepting some distortion.
How do error-correcting codes work? Error-correcting codes add structured redundancy to data. The receiver uses the structure to detect and correct errors in the received message, as long as the number of errors does not exceed the code’s correction capability.
Probability Theory Guide — Data Science Mathematics — Fourier Analysis