Week 1 — Foundations of Statistics

Instructor: IITM Online BS in Data Science | Focus: Descriptive foundations, sampling, and data vocabulary

Week 1 of 4 completed · 7 topics published

1. Introduction to Statistics

Statistics is the discipline of learning from data. The IITM definition emphasises the full cycle: design a study, collect observations, summarise the evidence, analyse patterns, interpret uncertainty, and communicate conclusions responsibly.

Why statistics matters for data science

  • EDA: descriptive summaries reveal anomalies before modelling.
  • Modelling: probability distributions describe noise and help validate assumptions.
  • Evaluation: statistical tests compare alternative models.
  • Communication: intervals and risk statements make stakeholders comfortable acting on results.
19th century

State arithmetic: census counts and agriculture ledgers.

Early 20th century

Fisher, Pearson, Gosset formalise experimental design and hypothesis tests.

Late 20th century

Computers enable complex models and simulation.

Present

Statistics anchors machine learning workflows with rigorous uncertainty estimation.

Descriptive statistics

Describe the sample you collected: tables, graphs, averages, spreads.

Example: summarising daily COVID-19 cases in Chennai during July.

Inferential statistics

Generalise beyond the sample using probability theory.

Example: estimating true customer satisfaction from a survey sample.

Read the full topic →

2. Populations and Samples

Every analysis must specify the population (all units of interest) and the sample (the units observed). A sample supports valid inference only when it mirrors key characteristics of the population.

Population examples

  • All registered voters in Tamil Nadu.
  • Every device manufactured on a production line this week.
  • All transactions processed by a fintech app yesterday.

Sample examples

  • 1200 randomly selected voters interviewed by phone.
  • 50 devices chosen for destructive testing.
  • 10% of transactions flagged for manual audit.

Watch out for bias

  • Undercoverage: some groups were never sampled.
  • Self-selection: only vocal participants respond.
  • Non-response: invited units decline to participate.

Simple random sampling

Every unit has equal chance of selection.

Systematic sampling

Select every kth unit after a random start.

Stratified sampling

Divide into homogeneous strata and sample within each.

Cluster sampling

Select clusters (schools, districts) and survey all units inside.

Read the full topic →

3. Purpose of Statistical Analysis

Statistics serves two complementary goals: describing the sample and inferring conclusions about the population. Both steps appear in every data science workflow.

Descriptive vs inferential

  • Descriptive: "What does the collected data show?"
  • Inferential: "What does this imply about the wider group?"
  1. Profile the sample to understand context and detect errors.
  2. Formulate the inferential question using statistical language.
  3. Select models and tests compatible with the data type.
  4. Communicate results with uncertainty and limitations.

Read the full topic →

4. Data Foundations

Data are recorded facts gathered to answer a question. Recording context (who, what, when, how) is as important as the values themselves.

Motivations for data collection

  • Monitor performance of processes or policies.
  • Explore behaviour patterns and clusters.
  • Ensure regulatory compliance.
  • Build predictive models for forecasting.

Source types

Primary data are collected firsthand via surveys, experiments, or observations. Secondary data are published by governments, researchers, or open portals.

Source Advantages Challenges
Primary High control, custom definitions. Expensive, slower to gather.
Secondary Immediate access, often large coverage. Must adapt to someone else’s definitions.

Structured data

Tabular format with rows and columns; easy to query and analyse.

Unstructured data

Text, audio, video; requires preprocessing or feature extraction.

Read the full topic →

5. Organising Data

Well-organised data accelerate analysis and reduce errors. Adopt a tidy format: each row is one case, each column one variable.

Case | Student | Programme | EntranceScore | HostelResident
---- | ------- | --------- | ------------- | --------------
1    | Anjali  | BSc DS    | 88            | Yes
2    | Karim   | BSc DS    | 73            | No
3    | Meera   | Diploma   | 91            | Yes

Data dictionary essentials

  • Variable name and plain-language description.
  • Data type and measurement units.
  • Allowed values or coding schemes.
  • Notes on missing value conventions.

Quality checklist

  • Flag missing values and record reasons.
  • Ensure consistent units (all temperatures in deg C, not mixed with Fahrenheit).
  • Identify out-of-range entries and duplicates.

Read the full topic →

6. Types of Data

Classifying variables correctly determines which summaries and models are appropriate.

Categorical

  • Nominal: unordered labels (blood group).
  • Ordinal: ordered labels without fixed spacing (survey ratings).

Numerical

  • Discrete: counts such as number of calls handled.
  • Continuous: measurements like temperature or height.

Data structure

  • Cross-sectional: many units at one time (eg, incomes of all students in 2025).
  • Time-series: one unit across time (eg, daily rainfall in July).
  • Panel data: multiple units tracked over time.
Type Summaries Visuals Techniques
Nominal Counts, proportions Bar chart, stacked bar Chi-square tests
Ordinal Medians, percentiles Ordered bar, cumulative plot Non-parametric tests
Discrete Mean, variance Dot plot, histogram Poisson/Binomial models
Continuous Mean, standard deviation Histogram, box plot, density t-tests, regression, ANOVA

Read the full topic →

7. Measurement Scales & Recap

Measurement scales indicate which mathematical operations make sense.

Scale Order Differences Ratios Examples
Nominal No No No Blood group, browser type
Ordinal Yes No fixed spacing No Customer satisfaction, ranks
Interval Yes Yes No Temperature in deg C, calendar years
Ratio Yes Yes Yes Height, weight, income, duration

Foundations

Statistics merges descriptive storytelling with inferential decision making.

Sampling

Representative samples prevent bias and support inference.

Data management

Document variables, maintain tidy tables, and manage quality.

Classification

Data types and measurement scales guide the choice of statistical tools.

Read the full topic →