Statistics Guide: Data Analysis and Inference Methods
Introduction
Statistics transforms data into insight. In a world awash with information, statistical methods separate signal from noise, quantify uncertainty, and support evidence-based decision making. From clinical trials that determine whether new drugs work to opinion polls that track political sentiment, statistics provides the tools for drawing reliable conclusions from imperfect data.
The field divides into two branches: descriptive statistics summarizes data through numbers and graphics, while inferential statistics draws conclusions about populations from samples. Both branches rest on the foundation of probability theory, which models the random processes that generate data.
Descriptive Statistics
Descriptive statistics provides the first look at any dataset. Measures of central tendency describe typical values. Measures of dispersion describe variability. Graphical displays reveal patterns, outliers, and distributions.
Measures of Central Tendency
The mean (average) is the sum of values divided by the count. The median is the middle value when data is ordered. The mode is the most frequent value. The mean is sensitive to outliers — a single extreme value can pull it far from the typical range. The median is robust, changing little when outliers are present. The choice between mean and median depends on the data distribution and the question being asked.
The geometric mean (the nth root of the product of n values) is appropriate for rates of change, growth rates, and ratios. The geometric mean of investment returns over multiple periods gives the average compound growth rate. The harmonic mean (the reciprocal of the arithmetic mean of reciprocals) applies to averaging rates, such as average speed over journeys with the same distance but different speeds.
For symmetric distributions without outliers, the mean is the natural measure. For skewed distributions like income or housing prices, the median better represents the typical case. Governments reporting median household income rather than mean income acknowledge this distinction.
Measures of Dispersion
The range (maximum minus minimum) is the simplest measure of spread but depends entirely on two extreme values. The interquartile range (IQR = Q3 − Q1) captures the middle 50% of data and is resistant to outliers. The variance and standard deviation measure spread around the mean, using all data points but sensitive to outliers.
Box plots visualize the five-number summary: minimum, Q1, median, Q3, maximum. Outliers beyond 1.5×IQR from the quartiles are plotted individually. Histograms show the shape of the distribution, revealing skewness, multimodality, and gaps.
Standard deviation has the same units as the original data, making it the preferred measure of spread for interpretation. The coefficient of variation CV = σ/μ expresses relative variability, useful for comparing dispersion across datasets with different scales or units.
The range (maximum minus minimum) is the simplest measure of spread but depends entirely on two extreme values. The interquartile range (IQR = Q3 − Q1) captures the middle 50% of data and is resistant to outliers. The variance and standard deviation measure spread around the mean, using all data points but sensitive to outliers.
Box plots visualize the five-number summary: minimum, Q1, median, Q3, maximum. Outliers beyond 1.5×IQR from the quartiles are plotted individually. Histograms show the shape of the distribution, revealing skewness, multimodality, and gaps.
Skewness and Kurtosis
Skewness measures the asymmetry of a distribution. Positive skew (right tail) occurs when the mean exceeds the median, common in income distributions and housing prices. Negative skew (left tail) occurs when the median exceeds the mean, such as age at death in developed countries.
Kurtosis measures the tail heaviness of a distribution. High kurtosis (leptokurtic) indicates more extreme outliers than the normal distribution. Low kurtosis (platykurtic) indicates fewer or less extreme outliers. The normal distribution has kurtosis of 3 (excess kurtosis of 0). Financial return distributions typically exhibit high kurtosis — more extreme events than the normal distribution predicts.
Correlation
Correlation measures the strength and direction of linear relationships between two variables. The Pearson correlation coefficient r ranges from −1 (perfect negative linear relationship) through 0 (no linear relationship) to +1 (perfect positive linear relationship). Correlation does not imply causation — a fundamental caution that applies whenever observational data is interpreted.
Sampling Distributions
Statistical inference begins with understanding how sample statistics vary. The sampling distribution of the sample mean describes how the mean of a random sample behaves across repeated sampling. Its standard deviation, called the standard error, quantifies the precision of the sample mean as an estimator.
The Central Limit Theorem in Practice
The Central Limit Theorem ensures that the sampling distribution of the mean is approximately normal for large sample sizes, regardless of the population distribution. This theorem justifies the normal-based confidence intervals and hypothesis tests that form the backbone of classical statistics.
The standard error of the mean equals σ/√n, where σ is the population standard deviation. Since σ is usually unknown, we estimate it with the sample standard deviation s, giving the estimated standard error s/√n. The t-distribution replaces the normal when using estimated standard errors, with thicker tails to account for the additional uncertainty.
Confidence Intervals
A confidence interval provides a range of plausible values for an unknown population parameter. A 95% confidence interval for the population mean is x̄ ± t* × s/√n, where t* is the critical value from the t-distribution with n−1 degrees of freedom.
The interpretation is subtle: 95% of confidence intervals constructed from repeated samples will contain the true population mean. A single interval either contains the mean or does not — the confidence is in the procedure, not the specific interval. This frequentist interpretation contrasts with Bayesian credible intervals, which directly state the probability that the parameter lies in the interval.
Hypothesis Testing
Hypothesis testing evaluates evidence against a null hypothesis H₀, typically a statement of no effect. The alternative hypothesis Hₐ represents the effect of interest. The p-value is the probability of observing data as extreme as or more extreme than the actual data, assuming the null hypothesis is true.
The Logic of Significance Testing
A small p-value (conventionally below 0.05) provides evidence against the null hypothesis. A p-value of 0.03 means that if the null hypothesis were true, data this extreme would occur only 3% of the time by random chance. This is considered sufficient evidence to reject the null.
Type I error (false positive) occurs when a true null hypothesis is rejected. The significance level α controls the Type I error rate. Type II error (false negative) occurs when a false null hypothesis is not rejected. The power of a test is the probability of correctly rejecting a false null hypothesis. Sample size calculations balance these error rates against practical constraints.
Common Tests
The one-sample t-test compares a sample mean to a known value. The two-sample t-test compares means from two independent groups. The paired t-test compares measurements on the same subjects before and after treatment. The chi-squared test assesses independence between categorical variables. ANOVA extends the t-test to compare means across three or more groups.
Each test makes assumptions — normality, equal variances, independence — that should be checked before drawing conclusions. Transformations or nonparametric alternatives address violations of these assumptions.
Effect Size and Power
Statistical significance does not imply practical importance. Effect size measures the magnitude of an effect independent of sample size. Cohen’s d expresses the difference between two means in standard deviation units. A large sample can make a trivial effect statistically significant, making effect size reporting essential for interpreting results.
Power analysis determines the sample size needed to detect an effect of a given size. The power of a test is the probability of correctly rejecting a false null hypothesis. Studies with low power waste resources and have a high probability of missing real effects. Power depends on the effect size, sample size, significance level, and test type. A-priori power analysis ensures that studies are adequately sized before data collection begins.
Regression Analysis
Regression models the relationship between a response variable and one or more predictor variables. Simple linear regression fits a straight line y = β₀ + β₁x + ε, where β₀ is the intercept, β₁ is the slope, and ε is random error.
Fitting and Interpreting Regression
Least squares estimation chooses β₀ and β₁ to minimize the sum of squared residuals. The slope β₁ represents the expected change in y for a one-unit increase in x. The coefficient of determination R² measures the proportion of variance in y explained by the model.
Residual analysis checks regression assumptions. Residuals should be approximately normally distributed with constant variance (homoscedasticity) and no systematic patterns. Plots of residuals versus fitted values reveal heteroscedasticity (fan-shaped patterns) and nonlinearity (curved patterns). Q-Q plots assess normality of residuals.
Least squares estimation chooses β₀ and β₁ to minimize the sum of squared residuals. The slope β₁ represents the expected change in y for a one-unit increase in x. The coefficient of determination R² measures the proportion of variance in y explained by the model.
Multiple regression extends to several predictors: y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε. Each coefficient represents the expected change in y for a one-unit change in that predictor, holding all others constant. Model selection, multicollinearity, and interaction terms are important considerations in multiple regression analysis.
Bayesian Statistics
Bayesian statistics treats parameters as random variables with prior distributions that reflect initial beliefs. Bayes’ theorem updates the prior to the posterior distribution given observed data. The posterior distribution represents all current knowledge about the parameter.
Bayesian methods offer several advantages: intuitive interpretation of probability intervals, natural incorporation of prior information, and exact inference for complex models without relying on asymptotic approximations. Markov chain Monte Carlo (MCMC) methods make Bayesian computation feasible for realistic problems. Modern data science mathematics relies heavily on Bayesian frameworks for machine learning and predictive modeling.
What is the difference between descriptive and inferential statistics? Descriptive statistics summarizes observed data. Inferential statistics draws conclusions about populations from sample data, quantifying uncertainty in those conclusions.
What does a p-value actually tell you? A p-value is the probability of observing data as extreme as yours if the null hypothesis were true. It is not the probability that the null hypothesis is false.
When should you use a t-test instead of a z-test? Use a t-test when the population standard deviation is unknown and estimated from the sample. The t-test with n−1 degrees of freedom accounts for this extra uncertainty.
What is the difference between correlation and causation? Correlation measures association between variables. Causation requires that changing one variable directly changes the other. Confounding variables, reverse causation, and spurious correlations make establishing causation from observational data difficult.
Probability Theory Guide — Data Science Mathematics — Mathematical Modeling