Build tidy tables, maintain a clear data dictionary, and run essential quality checks before modelling.
Each row corresponds to a single unit of analysis: a student, a household, a machine reading. Keep one row per unit to maintain tidy data.
Columns record characteristics of the unit: exam score, income, temperature. Use consistent types and units.
Case | Student | Programme | EntranceScore | HostelResident ---- | ------- | --------- | ------------- | -------------- 1 | Anjali | BSc DS | 88 | Yes 2 | Karim | BSc DS | 73 | No 3 | Meera | Diploma | 91 | Yes
A data dictionary documents every variable so that future analysts understand its meaning.
| Variable | Description | Type | Valid values / units | Notes |
|---|---|---|---|---|
| Student | Full name of learner | Categorical (text) | String | Capitalise first letter |
| Programme | Academic track | Categorical (nominal) | {"BSc DS", "BS DS", "Diploma"} | Use standard abbreviations |
| EntranceScore | Score in entrance exam | Numerical (ratio) | 0–100 | Two decimal places allowed |
| HostelResident | Whether student lives in hostel | Categorical (binary) | {"Yes", "No"} | Map to 1/0 if model requires numeric |
Organise categories consistently to avoid typos and analysis errors. Strategies include: