Home > Stats for DS > Topics > Topic 5
Topic 5 of 7

📊 Topic 5: Organising Data

Topic 5/7 ⭐⭐ Applied ⏱️ ~20 min

Build tidy tables, maintain a clear data dictionary, and run essential quality checks before modelling.

🎯 Learning Objectives

  • Differentiate between cases (rows) and variables (columns).
  • Create a simple data dictionary.
  • Identify and flag data quality issues (missing values, inconsistent units).
  • Organise categorical information using proper codes.

5.1 Anatomy of a data table

Cases (observations)

Each row corresponds to a single unit of analysis: a student, a household, a machine reading. Keep one row per unit to maintain tidy data.

Variables (attributes)

Columns record characteristics of the unit: exam score, income, temperature. Use consistent types and units.

Case | Student | Programme | EntranceScore | HostelResident
---- | ------- | --------- | ------------- | --------------
1    | Anjali  | BSc DS    | 88            | Yes
2    | Karim   | BSc DS    | 73            | No
3    | Meera   | Diploma   | 91            | Yes

5.2 Building a data dictionary

A data dictionary documents every variable so that future analysts understand its meaning.

Variable Description Type Valid values / units Notes
Student Full name of learner Categorical (text) String Capitalise first letter
Programme Academic track Categorical (nominal) {"BSc DS", "BS DS", "Diploma"} Use standard abbreviations
EntranceScore Score in entrance exam Numerical (ratio) 0–100 Two decimal places allowed
HostelResident Whether student lives in hostel Categorical (binary) {"Yes", "No"} Map to 1/0 if model requires numeric

5.3 Data quality checks

  • Missing values: track how many and why they are missing (not recorded vs not applicable).
  • Inconsistent units: ensure measurement units are aligned (all weights in kg, not a mix of kg and pounds).
  • Out-of-range entries: validate with domain limits (probabilities cannot exceed 1).
  • Duplicate rows: identify duplicates using unique identifiers.

5.4 Coding categorical variables

Organise categories consistently to avoid typos and analysis errors. Strategies include:

  • Use enumerated codes (1 = Yes, 0 = No) with a lookup table.
  • Store ordered categories with explicit order labels (Poor < Fair < Good < Excellent).
  • Document the meaning of “Other” or “Not Available”.