AI Learning Roadmap

Study Notes

Everything covered from Python Foundations through Machine Learning — detailed notes, code examples, and projects all in one place.

✓ Phase 1 — Python, APIs, Databases ▶ Phase 2 — Machine Learning Phase 3 — Deep Learning

Projects Built

APIs with DB

ML Models

12+

Topics Covered

Phase 1 Python Foundations & APIs

Week 1 · Python Foundations

Variables & Datatypes

A variable is a named container that stores a value. Python automatically detects the type — you don't need to declare it.

# Common datatypes
name    = "Manish"       # str   — text
age     = 25             # int   — whole number
score   = 98.5           # float — decimal number
passed  = True           # bool  — True or False

str (String)

Text. Wrap in quotes.
"hello", 'world'

int (Integer)

Whole numbers, no decimal.
1, 42, -7

float

Numbers with a decimal point.
3.14, 98.5

bool (Boolean)

Only two values.
True or False

Week 1 · Python Foundations

Conditional Statements

Conditionals let the program make decisions — run different code depending on whether a condition is True or False.

score = 85

if score >= 90:
    print("Grade: A")
elif score >= 75:
    print("Grade: B")
elif score >= 50:
    print("Grade: C")
else:
    print("Grade: Fail")

Real use in your project: The student report card uses exactly this logic to assign grades based on average marks.

Week 1 · Python Foundations

Looping Constructs

Loops let you repeat a block of code multiple times without writing it over and over.

for loop — iterate over a sequence

subjects = ["Maths", "Science", "English"]

for subject in subjects:
    print(subject)

# Output: Maths, Science, English

while loop — repeat while a condition is True

choice = ""

while choice != "exit":
    choice = input("Enter command: ")
    print("You entered:", choice)

# Used in your expense tracker menu

Week 1 · Python Foundations

Functions

A function is a reusable block of code. Define it once, call it anywhere. Keeps code clean and avoids repetition.

def calculate_grade(avg):
    if avg >= 90:
        return "A"
    elif avg >= 75:
        return "B"
    elif avg >= 50:
        return "C"
    else:
        return "Fail"

# Call it
grade = calculate_grade(82)   # → "B"

def

Keyword to define a function.

Parameters

Inputs the function receives — avg in the example above.

return

Sends a value back to wherever the function was called.

Call

Execute the function by writing its name with arguments: calculate_grade(82)

Week 1 · Python Foundations

Data Structures

Ways to store and organise collections of data in Python.

[ ]

List

Ordered
Mutable (can change)
Allows duplicates

( )

Tuple

Ordered
Immutable (cannot change)
Allows duplicates

{ }

Dictionary

Key : Value pairs
Mutable
Keys must be unique

# List — ordered, mutable
marks = [85, 90, 78]
marks.append(95)       # add item

# Tuple — ordered, immutable
subjects = ("Maths", "Science", "English")

# Dictionary — key:value
student = {
    "name": "Manish",
    "maths": 85,
    "science": 90
}
print(student["name"])  # → "Manish"

Used in your projects: Expenses stored as a list of dictionaries — each expense is a dict {"category": "food", "amount": 50}, and all expenses are collected in a list.

Week 2 · Simple Python Apps

Student Report Card Generator

📋

Student Report Card

CLI app that collects student data, calculates averages and grades, and prints formatted report cards.

Python CLI

What it does

Input

Student ID, name, and marks for Maths, Science, English

Processing

Calculates total, average, and assigns a grade using conditionals

Storage

Each student stored as a dictionary inside a list

Output

Prints a formatted report card for every student

Key concepts applied

# Student stored as a dictionary
student = {
    "id": "S001",
    "name": "Manish",
    "marks": [85, 90, 78],
    "average": 84.3,
    "grade": "B"
}

# Grade logic (function + conditionals)
def calculate_grade(avg):
    if avg >= 90: return "A"
    elif avg >= 75: return "B"
    elif avg >= 50: return "C"
    else: return "Fail"

Week 2 · Simple Python Apps

Personal Expense Tracker (CLI)

💸

Expense Tracker

Interactive CLI menu app to track daily expenses — add, view, filter, and summarise spending.

Python CLI

Features built

Feature	How it works
Add Expense	Input category + amount → append dict to list
Show All	Loop through list, print each expense
Total Spent	`sum()` with a generator expression
Highest Expense	`max()` with a lambda key
Filter by Category	List comprehension to filter matching items
Group by Category	Dictionary to accumulate totals per category

# Expenses: list of dictionaries
expenses = [
    {"category": "food",   "amount": 200},
    {"category": "travel", "amount": 150},
    {"category": "food",   "amount": 80},
]

# Highest expense
highest = max(expenses, key=lambda x: x["amount"])

# Group by category
summary = {}
for exp in expenses:
    cat = exp["category"]
    summary[cat] = summary.get(cat, 0) + exp["amount"]

Week 4 & 5 · APIs

Request & Response / JSON

An API (Application Programming Interface) is a way for two systems to communicate. You send a Request, and the server sends back a Response.

HTTP Methods

Method	Purpose	Example
GET	Retrieve data	Get all expenses
POST	Create new data	Add an expense
PUT	Update existing data	Edit an expense
DELETE	Remove data	Delete an expense

JSON Structure

JSON (JavaScript Object Notation) is the standard format for sending data between client and server. It looks just like a Python dictionary.

# JSON response from an API
{
    "success": true,
    "data": {
        "id": 1,
        "category": "food",
        "amount": 200
    }
}

Request & Response Flow

Client sends a RequestHTTP method + URL + optional body (for POST/PUT)

Server processes itReads the request, queries DB or performs logic

Server returns a ResponseStatus code (200 OK, 404 Not Found) + JSON body

Week 5 & 6 · APIs

FastAPI

FastAPI is a modern Python framework for building APIs quickly. It uses type hints to validate data automatically and generates interactive docs at /docs.

Key Concepts

Pydantic / BaseModel

Defines the structure of request body data. FastAPI validates incoming data against it automatically.

Path Parameters

Part of the URL — /expenses/5
Defined with {expense_id} in the route.

Query Parameters

After the ? in the URL — /search?q=food
Passed as function arguments.

Request Body

JSON data sent with POST/PUT. Mapped to a Pydantic model in the function parameter.

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Expense(BaseModel):
    category: str
    amount: float

# POST /expenses — request body validated by Pydantic
@app.post("/expenses")
def add_expense(expense: Expense):
    return {"success": True, "data": expense}

# GET /expenses/5 — path parameter
@app.get("/expenses/{expense_id}")
def get_expense(expense_id: int):
    return {"id": expense_id}

Route ordering matters: FastAPI matches routes top-down. Always define specific routes like /expenses/highest before dynamic ones like /expenses/{id}, otherwise "highest" gets treated as an ID.

Week 7 & 8 · Database

SQL Basics (PostgreSQL)

SQL (Structured Query Language) is used to create, read, update, and delete data in relational databases like PostgreSQL.

Core Commands

-- Create a table
CREATE TABLE expenses (
    id       SERIAL PRIMARY KEY,
    category VARCHAR(100),
    amount   NUMERIC
);

-- Insert data
INSERT INTO expenses (category, amount) VALUES ('food', 200);

-- Select all
SELECT * FROM expenses;

-- Filter rows
SELECT * FROM expenses WHERE category = 'food';
SELECT * FROM expenses WHERE amount > 100;

-- Update a row
UPDATE expenses SET amount = 250 WHERE id = 1;

-- Delete a row
DELETE FROM expenses WHERE id = 1;

-- Aggregate — group and sum
SELECT category, SUM(amount) FROM expenses GROUP BY category;

Command	Purpose
SELECT	Read / retrieve data
INSERT	Add new rows
UPDATE	Modify existing rows
DELETE	Remove rows
CREATE TABLE	Define a new table structure

Week 7 & 8 · Database

Python + PostgreSQL (psycopg2)

psycopg2 is the Python library used to connect to and interact with a PostgreSQL database from code.

Connection Pattern

import psycopg2
import os
from dotenv import load_dotenv

load_dotenv()

def get_connection():
    return psycopg2.connect(
        dbname=os.getenv("DB_NAME"),
        user=os.getenv("DB_USER"),
        host=os.getenv("DB_HOST"),
        port=os.getenv("DB_PORT")
    )

Execute a Query

conn   = get_connection()
cursor = conn.cursor()

# Parameterised query — safe from SQL injection
cursor.execute(
    "INSERT INTO expenses (category, amount) VALUES (%s, %s) RETURNING id;",
    (expense.category, expense.amount)
)

new_id = cursor.fetchone()[0]
conn.commit()   # save the change

cursor.close()
conn.close()

Always use %s placeholders instead of string formatting to pass values into queries. This prevents SQL injection attacks.

try / finally Pattern

Always close the connection in a finally block so it gets cleaned up even if an error occurs.

try:
    conn   = get_connection()
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM expenses;")
    rows = cursor.fetchall()
except Exception as e:
    return {"success": False, "error": str(e)}
finally:
    if cursor: cursor.close()
    if conn:   conn.close()   # always runs

Week 9 · End-to-End App

Expense Tracker — API + Database

💳

Expense Tracker API

FastAPI + psycopg2 + Pydantic. Data stored permanently in PostgreSQL instead of in-memory.

FastAPI PostgreSQL

Combined everything from Phase 1 — FastAPI for the HTTP layer, psycopg2 for the database layer, and Pydantic for request validation.

APIs built

Method	Endpoint	Description
POST	`/expenses`	Add a new expense to DB
GET	`/expenses`	Get all expenses from DB
GET	`/expenses/highest`	Get the highest expense
GET	`/expenses/summary`	Total per category
GET	`/expenses/category/{cat}`	Filter by category
GET	`/expenses/category/{cat}/total`	Total for one category
PUT	`/expenses/{id}`	Update an expense
DELETE	`/expenses/{id}`	Delete an expense

Architecture

Client (Postman / Browser)Sends HTTP requests

FastAPIReceives request, validates body via Pydantic

psycopg2Executes SQL query against PostgreSQL

ResponseReturns JSON with success status and data

Phase 1 milestone: You went from basic Python variables all the way to a fully functional REST API backed by a real database.

Week 9 · End-to-End App

Student API — Refactored

🎓

Student API

Full CRUD REST API for student records with reusable DB connection helpers and env-based config.

FastAPI PostgreSQL

APIs built

Method	Endpoint	Description
POST	`/students`	Add a new student
GET	`/students`	Get all students
GET	`/students/{id}`	Get student by ID
PUT	`/students/{id}`	Update student record
DELETE	`/students/{id}`	Delete a student

Reusable DB helper

# db.py — shared across the whole app
def get_connection():
    return psycopg2.connect(
        dbname=os.getenv("DB_NAME"),
        user=os.getenv("DB_USER"),
        host=os.getenv("DB_HOST"),
        port=os.getenv("DB_PORT")
    )

# student_db_api.py — imports from db.py
from db import get_connection

Key refactor: Moving get_connection() to its own module means every route file imports it from one place — changes to DB config only need to happen once.

Phase 2 Machine Learning

Week 10–13 · Machine Learning

What is Machine Learning?

Machine Learning is a way of teaching computers to learn from data — instead of writing explicit rules, you show the model examples and let it figure out the patterns.

Analogy: Traditional programming is like giving someone a recipe. ML is like letting someone taste 1000 dishes and figure out the recipe themselves.

ML is broadly split into three categories based on how the model learns:

Supervised Learning

Learn from labeled data. You know the correct answers during training.

Unsupervised Learning

No labels. The model discovers hidden patterns and structure on its own.

Reinforcement Learning

Learn through trial and error by receiving rewards or penalties. (Not in Phase 2)

Semi-Supervised

Mix of labeled and unlabeled data. (Not in Phase 2)

Category 1

Supervised Learning

You teach the model using labeled data — data where you already know the correct answer. The model learns the relationship between inputs and outputs, then predicts outputs for new unseen inputs.

Analogy: Like a student studying past exam papers that already have answer keys. They learn from examples, then sit the real exam.

How it works

Labeled DatasetEach data point has an input (features) AND a known output (label)

Train the ModelModel learns the mapping: input → output

Predict on New DataGive it unseen inputs → it predicts the output

EvaluateCompare predicted vs actual to measure performance

Two Types

Regression

Predicts a continuous number
e.g. "What score will this student get?" → 78.5

Classification

Predicts a category/label
e.g. "Is this email spam?" → Spam / Not Spam

Category 2

Unsupervised Learning

No labels. No answer key. You give the model raw data and it finds hidden structure, patterns, or groupings by itself.

Analogy: Sorting a pile of mixed fruits with no instructions — you naturally group them by color, size, and shape without being told what to look for.

Key Type — Clustering

Groups similar data points together into clusters. Points in the same cluster are more similar to each other than to those in other clusters.

Your Project Example

Input: customer age, income, spending score
No labels given
Output: Group A (high spenders), Group B (budget), Group C (casual)

Topics You'll Cover

K-Means Elbow Method Matplotlib Seaborn

Summary

Supervised vs Unsupervised

	Supervised	Unsupervised
Labeled data?	✓ Yes	✗ No
Goal	Predict a known output	Discover hidden patterns
Output type	Number or Category	Groups / Clusters
Your projects	Student Predictor, Spam Classifier	Customer Segmentation
Evaluation	MSE, Accuracy, Precision, Recall	Visual inspection, Elbow Method

Supervised · Type 1

Regression

Predicts a continuous numerical value. The model learns from input-output pairs and finds the best-fitting line through the data.

Example Dataset

Hours Studied	Exam Score
1	40
2	50
3	60
5	75
8	90

The Equation — Linear Regression

y = mx + b

y  →  predicted score (output)
x  →  hours studied (input / feature)
m  →  slope (how much score increases per hour)
b  →  intercept (base score with 0 hours studied)

# With multiple inputs:
y = m1·x1 + m2·x2 + m3·x3 + b

How the Model Learns — Cost Function (MSE)

MSE = average of (predicted − actual)²

Example:
  Actual score:    75
  Predicted score: 70
  Error:           (70 − 75)² = 25

The model keeps adjusting m and b to minimise MSE.

Key Idea: The lower the MSE, the better the model. Training = finding values of m and b that give the lowest possible MSE.

Train / Test Split

Training Data — 80%

The model learns patterns from this data.

Testing Data — 20%

Model is evaluated on this. Never seen during training.

End-to-End Flow

Raw DataKaggle — student scores dataset

Data CleansingHandle missing values, fix formats, remove outliers

Train / Test Split80% training, 20% testing

Train the ModelScikit-Learn fits the line, minimises MSE

Test the ModelPredict on unseen test data

EvaluateCalculate MSE — how low is the error?

Supervised · Type 2

Classification

Predicts a category/label — the output is one of a fixed set of classes. Instead of a number, the model decides which group something belongs to.

Examples

Binary Classification (2 classes)

Email → Spam or Not Spam

Multi-Class Classification

Review → Positive Neutral Negative

Text Preprocessing (for Spam Classifier)

CleanRemove punctuation, lowercase everything, remove stop words

VectorizeConvert words to numbers — count how often each word appears

Feed to ModelModel works with number arrays, not raw text

"Win a free prize"  →  [0, 1, 0, 1, 1, 0, 1, ...]
                             ↑
            each number = presence/frequency of a word

Evaluation Metrics

Accuracy

Overall

How many total predictions were correct?

Precision

Quality

Of all predicted spam, how many were actually spam?

Recall

Coverage

Of all actual spam, how many did we catch?

Accuracy  = Correct Predictions / Total Predictions
Precision = True Positives / (True Positives + False Positives)
Recall    = True Positives / (True Positives + False Negatives)

Confusion Matrix

Predicted: Spam

Predicted: Not Spam

Actual: Spam

True Positive (TP)90

False Negative (FN)10

Actual: Not Spam

False Positive (FP)5

True Negative (TN)895

TP = correctly predicted spam | TN = correctly predicted not-spam | FP = predicted spam but wasn't (false alarm) | FN = missed actual spam

Week 13 · ML Project

Student Score Predictor

📊

Student Score Predictor

Trains a Linear Regression model on the Kaggle Students Performance dataset to predict exam scores from parental education, lunch type, test prep, and gender.

scikit-learn Python Regression pandas

Dataset — StudentsPerformance.csv

Feature (Input)	Description
gender	male / female
race/ethnicity	group A–E
parental level of education	high school → master's degree
lunch	standard / free-reduced
test preparation course	completed / none
math score Target	Score to predict (0–100)

Pipeline

Load CSV with pandaspd.read_csv("StudentsPerformance.csv")

Encode categorical columnspd.get_dummies() — convert text labels to 0/1 numbers

Train / Test Split — 80 / 20train_test_split(X, y, test_size=0.2)

Fit Linear Regressionmodel.fit(X_train, y_train)

Evaluate on test setmean_squared_error(y_test, y_pred)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

df = pd.read_csv("StudentsPerformance.csv")
df = pd.get_dummies(df)                       # encode categoricals

X = df.drop("math score", axis=1)
y = df["math score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.2f}")

Phase 2 milestone: First real ML project — took raw CSV data all the way through cleaning, encoding, splitting, training, and evaluation.

Week 13 · ML Project

Spam Email Classifier

🛡️

Spam Email Classifier

Trains a Multinomial Naive Bayes model on 5,572 real SMS messages to classify them as spam or ham. Uses TF-IDF vectorization to convert text into numbers the model can learn from.

scikit-learn Python Classification NLP TF-IDF

Dataset — spam.csv

Column	Description
v1 (label)	ham (not spam) or spam
v2 (message)	Raw SMS text content
5,572 messages — 4,825 ham · 747 spam

Pipeline

Text PreprocessingLowercase, strip punctuation, normalise whitespace with regex

Label Encodingham → 0, spam → 1

Train / Test Split — 80 / 20stratify=y ensures equal ham/spam ratio in both splits

TF-IDF VectorizationTfidfVectorizer(max_features=5000) — fit on train only, transform both

Train Naive BayesMultinomialNB().fit(X_train_vec, y_train)

Evaluate + Test Unseen Messagesaccuracy, precision, recall, confusion matrix

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score

vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)  # learn vocab from train only
X_test_vec  = vectorizer.transform(X_test)        # apply same vocab to test

model = MultinomialNB()
model.fit(X_train_vec, y_train)

y_pred    = model.predict(X_test_vec)
accuracy  = accuracy_score(y_test, y_pred)   # 96.05%
precision = precision_score(y_test, y_pred)  # 100.00% — zero false alarms
recall    = recall_score(y_test, y_pred)     # 70.47% — catches most spam

Performance Metrics

Accuracy

96.05%

Fraction of all messages labelled correctly

Precision

100.00%

Zero legitimate emails wrongly flagged as spam

Recall

70.47%

~70% of spam caught — model errs on side of caution

Key insight: Precision of 100% means no false positives — not a single real email was lost to the spam folder. The trade-off is a recall of 70%: some spam slips through, which is the safer failure mode.

Week 13 · ML Project

Customer Segmentation

🧩

Customer Segmentation

Unsupervised K-Means clustering that automatically groups 200 mall customers into 5 segments based on annual income and spending score — no labels needed.

scikit-learn Python Clustering Matplotlib Seaborn

Dataset — Mall_Customers.csv

Feature	Description
CustomerID	Unique identifier (dropped before training)
Gender	Male / Female → encoded 0 / 1
Age	Customer age in years
Annual Income (k$)	Annual income in thousands of dollars
Spending Score (1-100) Cluster feature	Mall-assigned score based on spending behaviour

Pipeline

Load & Clean DataEncode Gender, drop CustomerID, check for nulls

Feature ScalingStandardScaler — mean=0, std=1 so both features contribute equally to distance

Elbow MethodRun K-Means for K=1–10, plot inertia — elbow at K=5

K-Means Clustering (K=5)KMeans(n_clusters=5).fit_predict(X_scaled)

Visualize & SummarizeScatter plot, pairplot (Seaborn), per-cluster averages

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns

scaler   = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow Method — find best K
inertia = [KMeans(n_clusters=k, random_state=42).fit(X_scaled).inertia_ for k in range(1, 11)]

# Train final model
model = KMeans(n_clusters=5, random_state=42)
df["Cluster"] = model.fit_predict(X_scaled)

# Visualize
sns.scatterplot(data=df, x="Annual Income (k$)", y="Spending Score (1-100)", hue="Cluster")

Clusters Discovered

Cluster	Count	Avg Income	Avg Spend	Segment
0	81	$55k	49.5	Average / Mixed
1	39	$87k	82.1	High Income, High Spenders ⭐
2	22	$26k	79.4	Low Income, High Spenders
3	35	$88k	17.1	High Income, Low Spenders
4	23	$26k	20.9	Low Income, Low Spenders

Key insight: No labels were provided — the model discovered all 5 segments entirely on its own. Cluster 1 (high income, high spend) is the most valuable customer group to target. Cluster 3 (high income, low spend) represents an untapped opportunity.

Git History

Build Timeline

Every commit in the order it was built — from first FastAPI setup to the first trained ML model.

2026-04-13

Customer Segmentation

Unsupervised K-Means clustering on Mall_Customers.csv — Elbow Method, feature scaling, 5 cluster segments, Matplotlib + Seaborn visualizations.

2026-04-13

Spam Email Classifier

Multinomial Naive Bayes on 5,572 SMS messages — TF-IDF vectorization, text preprocessing, 96% accuracy, 100% precision.

2026-04-13

Student Score Predictor — Updated

Added results.txt output, LabelEncoder for categoricals, predicted vs actual table, RMSE evaluation report.

c531da4

ML Score Predictor

Trained Linear Regression model on StudentsPerformance.csv — Phase 2 first project.

6c8fe28

Code Cleanup

Refactored and tidied existing API and app files.

caa5774

Expense Tracker API + DB

Full CRUD, category filter, summary & totals with PostgreSQL — Phase 1 capstone.

9731706

Student API Refactor

Extracted reusable DB connection helper + complete CRUD with PostgreSQL.

7f49b2e

FastAPI + PostgreSQL Setup

First FastAPI student API with env-based configuration — first real API with a database.