Define populations and samples clearly, preserve representativeness, and choose sampling strategies that minimise bias before any inference begins.
The complete set of units we wish to describe or understand. Examples include all registered voters in Tamil Nadu, every device manufactured on a production line, or every transaction processed by an app last month.
A subset selected from the population for practical analysis. We rely on the sample because observing the entire population is usually impossible or too expensive.
Census
When we collect data from every unit of the population we perform a census. This is rare and reserved for small populations or critical counts such as national censuses held every decade.
A sample supports valid inference only when it mirrors key characteristics of the population. Three warning signs:
Document the sampling frame (the list from which you sample) and compare known population totals where possible.
Every unit has equal chance of selection. Implement using random number generators or lottery methods.
Example: using Python's random.sample to pick 200 student IDs from a list of 10,000.
Select every kth unit after a random start. Works well on ordered lists without periodic patterns.
Example: inspecting every 50th widget leaving an assembly line.
Divide the population into homogeneous groups (strata) and sample within each group. Ensures representation across key categories.
Example: sampling students by programme (BSc, BS, Diplomas) to compare performance.
Randomly select clusters (e.g., classrooms, districts) and survey every unit inside selected clusters. Reduces travel cost.
Example: choose five randomly selected schools and interview all teachers in those schools.
Larger samples reduce sampling variability, shrinking the margin of error. However, doubling the sample size does not halve the error — pay attention to diminishing returns.
Budget, time, and data collection logistics often limit the sample size. Balance these with the need for precision.
When sampling without replacement from a small population (say fewer than 10,000 units), adjust the variance estimates using the finite population correction factor.