Statistics for Data Science - Week 1

1. Introduction to Statistics

Statistics is the discipline of learning from data. The IITM definition emphasises the full cycle: design a study, collect observations, summarise the evidence, analyse patterns, interpret uncertainty, and communicate conclusions responsibly.

Why statistics matters for data science

EDA: descriptive summaries reveal anomalies before modelling.
Modelling: probability distributions describe noise and help validate assumptions.
Evaluation: statistical tests compare alternative models.
Communication: intervals and risk statements make stakeholders comfortable acting on results.

19th century

State arithmetic: census counts and agriculture ledgers.

Early 20th century

Fisher, Pearson, Gosset formalise experimental design and hypothesis tests.

Late 20th century

Computers enable complex models and simulation.

Present

Statistics anchors machine learning workflows with rigorous uncertainty estimation.

Descriptive statistics

Describe the sample you collected: tables, graphs, averages, spreads.

Example: summarising daily COVID-19 cases in Chennai during July.

Inferential statistics

Generalise beyond the sample using probability theory.

Example: estimating true customer satisfaction from a survey sample.

Read the full topic →

2. Populations and Samples

Every analysis must specify the population (all units of interest) and the sample (the units observed). A sample supports valid inference only when it mirrors key characteristics of the population.

Population examples

All registered voters in Tamil Nadu.
Every device manufactured on a production line this week.
All transactions processed by a fintech app yesterday.

Sample examples

1200 randomly selected voters interviewed by phone.
50 devices chosen for destructive testing.
10% of transactions flagged for manual audit.

Watch out for bias

Undercoverage: some groups were never sampled.
Self-selection: only vocal participants respond.
Non-response: invited units decline to participate.

Simple random sampling

Every unit has equal chance of selection.

Systematic sampling

Select every kth unit after a random start.

Stratified sampling

Divide into homogeneous strata and sample within each.

Cluster sampling

Select clusters (schools, districts) and survey all units inside.

Read the full topic →

3. Purpose of Statistical Analysis

Statistics serves two complementary goals: describing the sample and inferring conclusions about the population. Both steps appear in every data science workflow.

Descriptive vs inferential

Descriptive: "What does the collected data show?"
Inferential: "What does this imply about the wider group?"

Profile the sample to understand context and detect errors.
Formulate the inferential question using statistical language.
Select models and tests compatible with the data type.
Communicate results with uncertainty and limitations.

Read the full topic →

4. Data Foundations

Data are recorded facts gathered to answer a question. Recording context (who, what, when, how) is as important as the values themselves.

Motivations for data collection

Monitor performance of processes or policies.
Explore behaviour patterns and clusters.
Ensure regulatory compliance.
Build predictive models for forecasting.

Source types

Primary data are collected firsthand via surveys, experiments, or observations. Secondary data are published by governments, researchers, or open portals.

Source	Advantages	Challenges
Primary	High control, custom definitions.	Expensive, slower to gather.
Secondary	Immediate access, often large coverage.	Must adapt to someone else’s definitions.

Structured data

Tabular format with rows and columns; easy to query and analyse.

Unstructured data

Text, audio, video; requires preprocessing or feature extraction.

Read the full topic →

5. Organising Data

Well-organised data accelerate analysis and reduce errors. Adopt a tidy format: each row is one case, each column one variable.

Case | Student | Programme | EntranceScore | HostelResident
---- | ------- | --------- | ------------- | --------------
1    | Anjali  | BSc DS    | 88            | Yes
2    | Karim   | BSc DS    | 73            | No
3    | Meera   | Diploma   | 91            | Yes

Data dictionary essentials

Variable name and plain-language description.
Data type and measurement units.
Allowed values or coding schemes.
Notes on missing value conventions.

Quality checklist

Flag missing values and record reasons.
Ensure consistent units (all temperatures in deg C, not mixed with Fahrenheit).
Identify out-of-range entries and duplicates.

Read the full topic →

6. Types of Data

Classifying variables correctly determines which summaries and models are appropriate.

Categorical

Nominal: unordered labels (blood group).
Ordinal: ordered labels without fixed spacing (survey ratings).

Numerical

Discrete: counts such as number of calls handled.
Continuous: measurements like temperature or height.

Data structure

Cross-sectional: many units at one time (eg, incomes of all students in 2025).
Time-series: one unit across time (eg, daily rainfall in July).
Panel data: multiple units tracked over time.

Type	Summaries	Visuals	Techniques
Nominal	Counts, proportions	Bar chart, stacked bar	Chi-square tests
Ordinal	Medians, percentiles	Ordered bar, cumulative plot	Non-parametric tests
Discrete	Mean, variance	Dot plot, histogram	Poisson/Binomial models
Continuous	Mean, standard deviation	Histogram, box plot, density	t-tests, regression, ANOVA

Read the full topic →

7. Measurement Scales & Recap

Measurement scales indicate which mathematical operations make sense.

Scale	Order	Differences	Ratios	Examples
Nominal	No	No	No	Blood group, browser type
Ordinal	Yes	No fixed spacing	No	Customer satisfaction, ranks
Interval	Yes	Yes	No	Temperature in deg C, calendar years
Ratio	Yes	Yes	Yes	Height, weight, income, duration

Foundations

Statistics merges descriptive storytelling with inferential decision making.

Sampling

Representative samples prevent bias and support inference.

Data management

Document variables, maintain tidy tables, and manage quality.

Classification

Data types and measurement scales guide the choice of statistical tools.

Read the full topic →

Week 1 — Foundations of Statistics

1. Introduction to Statistics

Why statistics matters for data science

Descriptive statistics

Inferential statistics

2. Populations and Samples

Population examples

Sample examples

Watch out for bias

Simple random sampling

Systematic sampling

Stratified sampling

Cluster sampling

3. Purpose of Statistical Analysis

Descriptive vs inferential

4. Data Foundations

Motivations for data collection

Source types

Structured data

Unstructured data

5. Organising Data

Data dictionary essentials

Quality checklist

6. Types of Data

Categorical

Numerical

Data structure

7. Measurement Scales & Recap

Foundations

Sampling

Data management

Classification