Topic 2: Populations and Samples | Statistics for Data Science

🎯 Learning Objectives

Define population, census, and sample in precise terms.
Explain what makes a sample representative.
Compare probability and non-probability sampling methods.
Recognise sampling bias and understand how to mitigate it.

2.1 Key definitions

Population

The complete set of units we wish to describe or understand. Examples include all registered voters in Tamil Nadu, every device manufactured on a production line, or every transaction processed by an app last month.

Sample

A subset selected from the population for practical analysis. We rely on the sample because observing the entire population is usually impossible or too expensive.

Census

When we collect data from every unit of the population we perform a census. This is rare and reserved for small populations or critical counts such as national censuses held every decade.

2.2 Representativeness matters

A sample supports valid inference only when it mirrors key characteristics of the population. Three warning signs:

Undercoverage: parts of the population were never sampled (e.g., surveying only urban voters).
Self-selection: only motivated respondents provide data (e.g., online product reviews skewed toward extreme opinions).
Non-response: selected units choose not to participate, shifting the sample composition.

Document the sampling frame (the list from which you sample) and compare known population totals where possible.

2.3 Sampling methods

Simple random sample

Every unit has equal chance of selection. Implement using random number generators or lottery methods.

Example: using Python's random.sample to pick 200 student IDs from a list of 10,000.

Systematic sample

Select every kth unit after a random start. Works well on ordered lists without periodic patterns.

Example: inspecting every 50th widget leaving an assembly line.

Stratified sample

Divide the population into homogeneous groups (strata) and sample within each group. Ensures representation across key categories.

Example: sampling students by programme (BSc, BS, Diplomas) to compare performance.

Cluster sample

Randomly select clusters (e.g., classrooms, districts) and survey every unit inside selected clusters. Reduces travel cost.

Example: choose five randomly selected schools and interview all teachers in those schools.

2.4 Sample size considerations

Margin of error

Larger samples reduce sampling variability, shrinking the margin of error. However, doubling the sample size does not halve the error — pay attention to diminishing returns.

Practical constraints

Budget, time, and data collection logistics often limit the sample size. Balance these with the need for precision.

Finite population correction

When sampling without replacement from a small population (say fewer than 10,000 units), adjust the variance estimates using the finite population correction factor.

2.5 Quick self-check

Have you written a one-line description of the population and the sample?
What units are in the sampling frame? Who might be missing?
Which sampling method is feasible for your study?
How large should the sample be to detect meaningful differences?

📊 Topic 2: Populations & Samples

🎯 Learning Objectives

2.1 Key definitions

Population

Sample

2.2 Representativeness matters

2.3 Sampling methods

Simple random sample

Systematic sample

Stratified sample

Cluster sample

2.4 Sample size considerations

Margin of error

Practical constraints

Finite population correction

2.5 Quick self-check