1. Introduction to Statistics
Statistics is the discipline of learning from data. The IITM definition emphasises the full cycle: design a study, collect observations, summarise the evidence, analyse patterns, interpret uncertainty, and communicate conclusions responsibly.
Why statistics matters for data science
- EDA: descriptive summaries reveal anomalies before modelling.
- Modelling: probability distributions describe noise and help validate assumptions.
- Evaluation: statistical tests compare alternative models.
- Communication: intervals and risk statements make stakeholders comfortable acting on results.
State arithmetic: census counts and agriculture ledgers.
Fisher, Pearson, Gosset formalise experimental design and hypothesis tests.
Computers enable complex models and simulation.
Statistics anchors machine learning workflows with rigorous uncertainty estimation.
Descriptive statistics
Describe the sample you collected: tables, graphs, averages, spreads.
Example: summarising daily COVID-19 cases in Chennai during July.
Inferential statistics
Generalise beyond the sample using probability theory.
Example: estimating true customer satisfaction from a survey sample.
2. Populations and Samples
Every analysis must specify the population (all units of interest) and the sample (the units observed). A sample supports valid inference only when it mirrors key characteristics of the population.
Population examples
- All registered voters in Tamil Nadu.
- Every device manufactured on a production line this week.
- All transactions processed by a fintech app yesterday.
Sample examples
- 1200 randomly selected voters interviewed by phone.
- 50 devices chosen for destructive testing.
- 10% of transactions flagged for manual audit.
Watch out for bias
- Undercoverage: some groups were never sampled.
- Self-selection: only vocal participants respond.
- Non-response: invited units decline to participate.
Simple random sampling
Every unit has equal chance of selection.
Systematic sampling
Select every kth unit after a random start.
Stratified sampling
Divide into homogeneous strata and sample within each.
Cluster sampling
Select clusters (schools, districts) and survey all units inside.
3. Purpose of Statistical Analysis
Statistics serves two complementary goals: describing the sample and inferring conclusions about the population. Both steps appear in every data science workflow.
Descriptive vs inferential
- Descriptive: "What does the collected data show?"
- Inferential: "What does this imply about the wider group?"
- Profile the sample to understand context and detect errors.
- Formulate the inferential question using statistical language.
- Select models and tests compatible with the data type.
- Communicate results with uncertainty and limitations.
4. Data Foundations
Data are recorded facts gathered to answer a question. Recording context (who, what, when, how) is as important as the values themselves.
Motivations for data collection
- Monitor performance of processes or policies.
- Explore behaviour patterns and clusters.
- Ensure regulatory compliance.
- Build predictive models for forecasting.
Source types
Primary data are collected firsthand via surveys, experiments, or observations. Secondary data are published by governments, researchers, or open portals.
| Source | Advantages | Challenges |
|---|---|---|
| Primary | High control, custom definitions. | Expensive, slower to gather. |
| Secondary | Immediate access, often large coverage. | Must adapt to someone else’s definitions. |
Structured data
Tabular format with rows and columns; easy to query and analyse.
Unstructured data
Text, audio, video; requires preprocessing or feature extraction.
5. Organising Data
Well-organised data accelerate analysis and reduce errors. Adopt a tidy format: each row is one case, each column one variable.
Case | Student | Programme | EntranceScore | HostelResident ---- | ------- | --------- | ------------- | -------------- 1 | Anjali | BSc DS | 88 | Yes 2 | Karim | BSc DS | 73 | No 3 | Meera | Diploma | 91 | Yes
Data dictionary essentials
- Variable name and plain-language description.
- Data type and measurement units.
- Allowed values or coding schemes.
- Notes on missing value conventions.
Quality checklist
- Flag missing values and record reasons.
- Ensure consistent units (all temperatures in deg C, not mixed with Fahrenheit).
- Identify out-of-range entries and duplicates.
6. Types of Data
Classifying variables correctly determines which summaries and models are appropriate.
Categorical
- Nominal: unordered labels (blood group).
- Ordinal: ordered labels without fixed spacing (survey ratings).
Numerical
- Discrete: counts such as number of calls handled.
- Continuous: measurements like temperature or height.
Data structure
- Cross-sectional: many units at one time (eg, incomes of all students in 2025).
- Time-series: one unit across time (eg, daily rainfall in July).
- Panel data: multiple units tracked over time.
| Type | Summaries | Visuals | Techniques |
|---|---|---|---|
| Nominal | Counts, proportions | Bar chart, stacked bar | Chi-square tests |
| Ordinal | Medians, percentiles | Ordered bar, cumulative plot | Non-parametric tests |
| Discrete | Mean, variance | Dot plot, histogram | Poisson/Binomial models |
| Continuous | Mean, standard deviation | Histogram, box plot, density | t-tests, regression, ANOVA |
7. Measurement Scales & Recap
Measurement scales indicate which mathematical operations make sense.
| Scale | Order | Differences | Ratios | Examples |
|---|---|---|---|---|
| Nominal | No | No | No | Blood group, browser type |
| Ordinal | Yes | No fixed spacing | No | Customer satisfaction, ranks |
| Interval | Yes | Yes | No | Temperature in deg C, calendar years |
| Ratio | Yes | Yes | Yes | Height, weight, income, duration |
Foundations
Statistics merges descriptive storytelling with inferential decision making.
Sampling
Representative samples prevent bias and support inference.
Data management
Document variables, maintain tidy tables, and manage quality.
Classification
Data types and measurement scales guide the choice of statistical tools.