BLOG · LEARNING ROADMAP · 2026

From Business Analyst
to Data Analyst.

A six-stage roadmap I wish I had when I started — what to learn, in what order, and why it matters once you are inside an enterprise data team. Tap any topic to mark it done; your progress is saved on this device.

0 of 22 topics marked as learned
Stages
01 Data Basics
02 Analytics
03 Data Platform
04 Data Modeling
05 ETL & ELT
06 Data Governance
BA → Data Analyst
Learning Roadmap
01Data Basicsunderstand the landscape
OLTP vs OLAP
Structured (CSV, tables)
Semi-structured (JSON, Parquet)
Unstructured (docs, logs)
03Data Platformwhere you practice
Snowflake
Databricks
BigQuery
05ETL & ELThow data moves
dbt
Azure Data Factory
Airflow
02Analyticslearn this first
SQL Fundamentals
Window Functions / CTEs
Python (Pandas)
Statistical Thinking
Power BI / Tableau
04Data Modelingdesign skills
Normalized vs Denormalized
Dimensional Modeling
Star Schema
Snowflake Schema
06Data Governancetrust & compliance
MDM
Data Security & Privacy
Data Quality Framework
01 · DATA BASICS

Data Basics

Before you analyze data, you need to understand what shape it comes in and where it lives. These four concepts appear in nearly every data conversation — knowing them helps you ask the right questions from day one.

OLTP vs OLAP
OLTP (e.g., Salesforce) databases handle live operations — each write is a fixed type at a fixed position. OLAP (analytical) databases store historical data optimized for aggregated queries. Always ask which one you're querying before you optimize anything; the two behave very differently.
Structured (CSV, tables)
Structured data lives in rows and columns — spreadsheets, SQL tables, CSV files. Every field has a fixed type and position. The easiest format to search and work with, and the one analysts do most of their analysis work in.
Semi-structured (JSON, Parquet)
Semi-structured: flexible structure with key-value pairs. JSON (APIs, event logs), Parquet (the data-warehouse pivot format). Has fields and values, but the structure can vary between records — you cannot treat it like a CSV until you parse it.
Unstructured (docs, logs)
Raw: free text, logs, images, PDFs. No fixed schema. Requires significant pre-processing — usually NLP or ML — before it becomes usable for analysis.
↑ Back to map
02 · ANALYTICS

Analytics

This is the core toolkit. SQL is the language of data — nothing else matters as much in day-to-day work. You don't need to be a data engineer, but you do need to be able to answer questions yourself without depending on others for every query.

SQL Fundamentals
SELECT, FROM, WHERE, JOIN, GROUP BY, row aggregates — the core of analytics work. You cannot do this without it. Know it well and don't rely on someone else for every query.
Window Functions / CTEs
Window functions calculate rankings, running totals, and moving averages without collapsing your data. CTEs (WITH clause) break complex queries into readable, sequential steps. Together they replace most subquery spaghetti.
Python (Pandas)
Python specifically for loading and reshaping data. Not for ML pipelines. Pandas puts you one read_csv() away from a full DataFrame so you can quickly inspect it, then pass you to a much stronger position alongside familiar BI tooling.
Statistical Thinking
Mean, median, distribution, correlation. SQL tells you what the data says; statistics helps you understand whether that answer is meaningful. Understanding variation prevents you from drawing too much insight from noise.
Power BI / Tableau
Knowing how to build a dashboard and tell the underlying data story. The goal is decision-making findings, not just charts. Pick one tool and go deep — breadth here matters less than depth.
↑ Back to map
03 · DATA PLATFORM

Data Platform

The cloud warehouses are where you will spend most of your time querying and building. You don't need to administer them — but knowing their architecture helps you write better queries and ask smarter questions of the engineers around you.

Snowflake
A cloud data warehouse purpose-built for analytics. Separates storage and compute so teams can scale independently. The most common place BAs find themselves querying production data in enterprise environments.
Databricks
A Lakehouse platform built on Apache Spark, purpose-built for data engineering and ML pipelines. Increasingly the engine behind enterprise data transformations — if your company has a large data science team, you will encounter it.
BigQuery
Google's serverless analytics warehouse. Pay-per-query pricing and native integration with GCP services. Often the entry point for teams already invested in Google Workspace or using Google Analytics data.
↑ Back to map
04 · DATA MODELING

Data Modeling

Data modeling is how analysts and engineers agree on what the data means before it's built. Getting this wrong makes every downstream report inconsistent. Getting it right makes the warehouse easy to query for years.

Normalized vs Denormalized
Normalization removes data redundancy by splitting tables; denormalization trades storage for query speed. Knowing when to apply which is a core design decision that determines how easy — or painful — your data is to query downstream.
Dimensional Modeling
A technique for organizing data warehouses into fact tables (measurements) and dimension tables (context). It is the standard mental model for analytical databases — understanding this unlocks how real enterprise data is structured and why.
Star Schema
The simplest dimensional model: one fact table in the centre, dimension tables radiating outward. Easy to query, easy to explain to stakeholders, and what most BI tools are optimized to consume.
Snowflake Schema
An extension of the star schema where dimension tables are further normalized. More storage-efficient but more complex to query. Knowing when to choose one over the other separates junior data analysts from senior ones.
↑ Back to map
05 · ETL & ELT

ETL & ELT

Data doesn't arrive analysis-ready. Understanding how it moves — from source systems into the warehouse — helps you trace data quality issues, contribute to pipeline design conversations, and avoid requesting reports built on broken data.

dbt
The de-facto tool for transforming raw data into analytics-ready models, in SQL. Version-controlled data transformations with built-in testing and documentation. If your team isn't using dbt, someone is spending time solving a problem it has already solved.
Azure Data Factory
Microsoft's cloud ETL service for ingesting and moving data at scale. The standard connector between source systems and the warehouse in most enterprise Microsoft stacks. Knowing it means you can trace where data comes from without asking an engineer every time.
Airflow
An orchestration platform that schedules and monitors data pipelines as code. When pipelines grow complex enough that "run this after that finishes" needs to be reliable and auditable, Airflow is the standard choice.
↑ Back to map
06 · DATA GOVERNANCE

Data Governance

Trust is slow to build and fast to lose. Governance is the set of practices that keeps data trustworthy — who can access it, what it means, and how good it is. In enterprise environments, this is non-negotiable from day one.

MDM
Master Data Management — the practice of creating a single, authoritative record for key business entities (customers, products, accounts). Without MDM, "customer count" returns a different number in every system, and every report becomes a negotiation.
Data Security & Privacy
Defining who can see what, how long sensitive data is retained, and how it is classified. GDPR, CCPA, and internal compliance policies all fall here. Non-negotiable in any regulated industry — and increasingly standard everywhere else.
Data Quality Framework
The systematic process for measuring, monitoring, and improving data quality — completeness, accuracy, consistency, timeliness. Bad data is the most common reason analytics projects lose stakeholder trust long before the insight is ever acted on.
↑ Back to map