Topic 1: Introduction to Statistics | Statistics for Data Science

🎯 Learning Objectives

State a concise definition of statistics in the IITM curriculum.
Describe how the field evolved from accounting ledgers to modern data science.
Differentiate descriptive and inferential statistics with practical examples.
Recognise the connection between statistics and decision making under uncertainty.

1.1 What is Statistics?

Course definition

Statistics is the art and science of learning from data: we collect, summarise, analyse, interpret and finally communicate findings so that decisions can be made even when uncertainty is present.

Why "art" and "science"?

Rigorous probability theory provides the science. Choosing the right model, summarising the story concisely, and convincing stakeholders require judgement — the art.

Key stages

Design a study that captures the question of interest.
Collect data with minimal bias and adequate quality.
Summarise using tables, charts, and numerical descriptors.
Infer patterns beyond the sample and quantify risk.

1.2 Evolution of the field

19th century

Governments recorded births, deaths, and harvests. Statistics meant "state arithmetic" — keeping careful tallies.

Early 20th century

Fisher, Pearson, and Gosset formalised experimental design, sampling distributions, and hypothesis tests. Statistics became a toolkit for inference.

Late 20th century

Computers enabled complex models, resampling, and simulation. Data volumes grew and so did the demand for automation.

Today

Statistics underpins machine learning workflows: data wrangling, exploratory analysis, feature engineering, and uncertainty quantification.

Keep this historical arc in mind: modern buzzwords still rest on the same principles of careful data collection and valid inference.

1.3 Two complementary branches

Descriptive statistics

Uses tables, charts, and summary numbers to describe the data we collected.

Measures of centre (mean, median, mode).
Measures of variation (range, variance, standard deviation).
Data visualisation: bar charts, histograms, scatter plots.

Example: summarising daily COVID-19 cases in Chennai during July.

Inferential statistics

Draws conclusions about a wider population by analysing a sample.

Confidence intervals quantify plausible ranges for unknown parameters.
Hypothesis testing assesses evidence for competing claims.
Predictive models estimate future outcomes with known uncertainty.

Example: using a sample of customer ratings to estimate the true satisfaction level of all customers.

Neither branch stands alone. In practice we start with descriptive statistics to understand the sample, then deploy inferential methods to generalise responsibly.

1.4 Statistics and data science

Within data science pipelines, statistics keeps us honest:

Exploratory data analysis (EDA): descriptive tools reveal anomalies that could break a model.
Modelling assumptions: probability distributions describe residuals and guide diagnostics.
Evaluation: statistical tests compare competing models using hold-out data.
Communication: confidence statements and risk intervals make stakeholders comfortable acting on results.

If you are ever unsure of which algorithm to run next, pause and revisit the statistical question: What are we measuring? Who does the sample represent? What uncertainty matters?

📊 Topic 1: Introduction to Statistics

🎯 Learning Objectives

1.1 What is Statistics?

Course definition

Why "art" and "science"?

Key stages

1.2 Evolution of the field

1.3 Two complementary branches

Descriptive statistics

Inferential statistics

1.4 Statistics and data science