Home > Stats for DS > Topics > Topic 4
Topic 4 of 7

📊 Topic 4: Data Foundations

Topic 4/7 ⭐ Beginner ⏱️ ~15 min

Understand what counts as data, why organisations collect it, and how structured versus unstructured formats influence analysis.

🎯 Learning Objectives

  • Describe what counts as data in different contexts.
  • List common reasons organisations invest in data collection.
  • Distinguish between primary and secondary data sources.
  • Explain the difference between structured and unstructured data.

4.1 What are data?

Data are recorded facts, measurements, or observations made to answer a question. They can be numbers, text, audio, video, or any digital artefact.

Entities

People, organisations, experiments, and devices all generate data: exam scores, website clicks, lab sensor readings, patient vitals.

Context

Recording why data were collected is as important as the raw values. Context prevents misinterpretation.

4.2 Why collect data?

  • Monitoring: track performance of a process or policy (e.g., energy usage per campus building).
  • Exploration: uncover patterns or clusters in behaviour (e.g., streaming service watch habits).
  • Regulation: maintain compliance with standards (e.g., pharmaceutical batch testing).
  • Prediction: build models to forecast future demand or risk.

4.3 Data sources

Source type Description Examples Advantages Challenges
Primary (collected) Data gathered firsthand for a specific purpose. Surveys, experiments, observations. High control over quality and definitions. Expensive and time-consuming.
Secondary (published) Data compiled by someone else and shared. Government portals, research articles, reports. Fast access, often large coverage. Definitions may differ; documentation varies.

4.4 Structured vs unstructured data

Structured data

Organised into tables with rows and columns. Easy to search and analyse using SQL, spreadsheets, and statistical software.

Example: admissions dataset with columns for student ID, programme, entrance score, and admission status.

Unstructured data

Free-form content such as text, images, audio, and video. Requires preprocessing or feature extraction before analysis.

Example: support ticket descriptions or customer feedback voice recordings.

Many modern projects combine both types: an e-commerce platform may have structured transaction tables plus unstructured product reviews.