AI Learning Roadmap

Study Notes

Everything covered from Python Foundations through Machine Learning — detailed notes, code examples, and projects all in one place.

✓ Phase 1 — Python, APIs, Databases ▶ Phase 2 — Machine Learning Phase 3 — Deep Learning
5
Projects Built
2
APIs with DB
3
ML Models
12+
Topics Covered
Phase 1 Python Foundations & APIs

Variables & Datatypes

A variable is a named container that stores a value. Python automatically detects the type — you don't need to declare it.

# Common datatypes name = "Manish" # str — text age = 25 # int — whole number score = 98.5 # float — decimal number passed = True # bool — True or False

str (String)

Text. Wrap in quotes.
"hello", 'world'

int (Integer)

Whole numbers, no decimal.
1, 42, -7

float

Numbers with a decimal point.
3.14, 98.5

bool (Boolean)

Only two values.
True or False

Conditional Statements

Conditionals let the program make decisions — run different code depending on whether a condition is True or False.

score = 85 if score >= 90: print("Grade: A") elif score >= 75: print("Grade: B") elif score >= 50: print("Grade: C") else: print("Grade: Fail")
Real use in your project: The student report card uses exactly this logic to assign grades based on average marks.

Looping Constructs

Loops let you repeat a block of code multiple times without writing it over and over.

for loop — iterate over a sequence

subjects = ["Maths", "Science", "English"] for subject in subjects: print(subject) # Output: Maths, Science, English

while loop — repeat while a condition is True

choice = "" while choice != "exit": choice = input("Enter command: ") print("You entered:", choice) # Used in your expense tracker menu

Functions

A function is a reusable block of code. Define it once, call it anywhere. Keeps code clean and avoids repetition.

def calculate_grade(avg): if avg >= 90: return "A" elif avg >= 75: return "B" elif avg >= 50: return "C" else: return "Fail" # Call it grade = calculate_grade(82) # → "B"

def

Keyword to define a function.

Parameters

Inputs the function receives — avg in the example above.

return

Sends a value back to wherever the function was called.

Call

Execute the function by writing its name with arguments: calculate_grade(82)

Data Structures

Ways to store and organise collections of data in Python.

[ ]
List
  • Ordered
  • Mutable (can change)
  • Allows duplicates
( )
Tuple
  • Ordered
  • Immutable (cannot change)
  • Allows duplicates
{ }
Dictionary
  • Key : Value pairs
  • Mutable
  • Keys must be unique
# List — ordered, mutable marks = [85, 90, 78] marks.append(95) # add item # Tuple — ordered, immutable subjects = ("Maths", "Science", "English") # Dictionary — key:value student = { "name": "Manish", "maths": 85, "science": 90 } print(student["name"]) # → "Manish"
Used in your projects: Expenses stored as a list of dictionaries — each expense is a dict {"category": "food", "amount": 50}, and all expenses are collected in a list.

Student Report Card Generator

📋

Student Report Card

CLI app that collects student data, calculates averages and grades, and prints formatted report cards.

Python CLI

What it does

Input

Student ID, name, and marks for Maths, Science, English

Processing

Calculates total, average, and assigns a grade using conditionals

Storage

Each student stored as a dictionary inside a list

Output

Prints a formatted report card for every student

Key concepts applied

# Student stored as a dictionary student = { "id": "S001", "name": "Manish", "marks": [85, 90, 78], "average": 84.3, "grade": "B" } # Grade logic (function + conditionals) def calculate_grade(avg): if avg >= 90: return "A" elif avg >= 75: return "B" elif avg >= 50: return "C" else: return "Fail"

Personal Expense Tracker (CLI)

💸

Expense Tracker

Interactive CLI menu app to track daily expenses — add, view, filter, and summarise spending.

Python CLI

Features built

FeatureHow it works
Add ExpenseInput category + amount → append dict to list
Show AllLoop through list, print each expense
Total Spentsum() with a generator expression
Highest Expensemax() with a lambda key
Filter by CategoryList comprehension to filter matching items
Group by CategoryDictionary to accumulate totals per category
# Expenses: list of dictionaries expenses = [ {"category": "food", "amount": 200}, {"category": "travel", "amount": 150}, {"category": "food", "amount": 80}, ] # Highest expense highest = max(expenses, key=lambda x: x["amount"]) # Group by category summary = {} for exp in expenses: cat = exp["category"] summary[cat] = summary.get(cat, 0) + exp["amount"]

Request & Response / JSON

An API (Application Programming Interface) is a way for two systems to communicate. You send a Request, and the server sends back a Response.

HTTP Methods

MethodPurposeExample
GETRetrieve dataGet all expenses
POSTCreate new dataAdd an expense
PUTUpdate existing dataEdit an expense
DELETERemove dataDelete an expense

JSON Structure

JSON (JavaScript Object Notation) is the standard format for sending data between client and server. It looks just like a Python dictionary.

# JSON response from an API { "success": true, "data": { "id": 1, "category": "food", "amount": 200 } }

Request & Response Flow

Client sends a RequestHTTP method + URL + optional body (for POST/PUT)
Server processes itReads the request, queries DB or performs logic
Server returns a ResponseStatus code (200 OK, 404 Not Found) + JSON body

FastAPI

FastAPI is a modern Python framework for building APIs quickly. It uses type hints to validate data automatically and generates interactive docs at /docs.

Key Concepts

Pydantic / BaseModel

Defines the structure of request body data. FastAPI validates incoming data against it automatically.

Path Parameters

Part of the URL — /expenses/5
Defined with {expense_id} in the route.

Query Parameters

After the ? in the URL — /search?q=food
Passed as function arguments.

Request Body

JSON data sent with POST/PUT. Mapped to a Pydantic model in the function parameter.

from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class Expense(BaseModel): category: str amount: float # POST /expenses — request body validated by Pydantic @app.post("/expenses") def add_expense(expense: Expense): return {"success": True, "data": expense} # GET /expenses/5 — path parameter @app.get("/expenses/{expense_id}") def get_expense(expense_id: int): return {"id": expense_id}
Route ordering matters: FastAPI matches routes top-down. Always define specific routes like /expenses/highest before dynamic ones like /expenses/{id}, otherwise "highest" gets treated as an ID.

SQL Basics (PostgreSQL)

SQL (Structured Query Language) is used to create, read, update, and delete data in relational databases like PostgreSQL.

Core Commands

-- Create a table CREATE TABLE expenses ( id SERIAL PRIMARY KEY, category VARCHAR(100), amount NUMERIC ); -- Insert data INSERT INTO expenses (category, amount) VALUES ('food', 200); -- Select all SELECT * FROM expenses; -- Filter rows SELECT * FROM expenses WHERE category = 'food'; SELECT * FROM expenses WHERE amount > 100; -- Update a row UPDATE expenses SET amount = 250 WHERE id = 1; -- Delete a row DELETE FROM expenses WHERE id = 1; -- Aggregate — group and sum SELECT category, SUM(amount) FROM expenses GROUP BY category;
CommandPurpose
SELECTRead / retrieve data
INSERTAdd new rows
UPDATEModify existing rows
DELETERemove rows
CREATE TABLEDefine a new table structure

Python + PostgreSQL (psycopg2)

psycopg2 is the Python library used to connect to and interact with a PostgreSQL database from code.

Connection Pattern

import psycopg2 import os from dotenv import load_dotenv load_dotenv() def get_connection(): return psycopg2.connect( dbname=os.getenv("DB_NAME"), user=os.getenv("DB_USER"), host=os.getenv("DB_HOST"), port=os.getenv("DB_PORT") )

Execute a Query

conn = get_connection() cursor = conn.cursor() # Parameterised query — safe from SQL injection cursor.execute( "INSERT INTO expenses (category, amount) VALUES (%s, %s) RETURNING id;", (expense.category, expense.amount) ) new_id = cursor.fetchone()[0] conn.commit() # save the change cursor.close() conn.close()
Always use %s placeholders instead of string formatting to pass values into queries. This prevents SQL injection attacks.

try / finally Pattern

Always close the connection in a finally block so it gets cleaned up even if an error occurs.

try: conn = get_connection() cursor = conn.cursor() cursor.execute("SELECT * FROM expenses;") rows = cursor.fetchall() except Exception as e: return {"success": False, "error": str(e)} finally: if cursor: cursor.close() if conn: conn.close() # always runs

Expense Tracker — API + Database

💳

Expense Tracker API

FastAPI + psycopg2 + Pydantic. Data stored permanently in PostgreSQL instead of in-memory.

FastAPI PostgreSQL

Combined everything from Phase 1 — FastAPI for the HTTP layer, psycopg2 for the database layer, and Pydantic for request validation.

APIs built

MethodEndpointDescription
POST/expensesAdd a new expense to DB
GET/expensesGet all expenses from DB
GET/expenses/highestGet the highest expense
GET/expenses/summaryTotal per category
GET/expenses/category/{cat}Filter by category
GET/expenses/category/{cat}/totalTotal for one category
PUT/expenses/{id}Update an expense
DELETE/expenses/{id}Delete an expense

Architecture

Client (Postman / Browser)Sends HTTP requests
FastAPIReceives request, validates body via Pydantic
psycopg2Executes SQL query against PostgreSQL
ResponseReturns JSON with success status and data
Phase 1 milestone: You went from basic Python variables all the way to a fully functional REST API backed by a real database.

Student API — Refactored

🎓

Student API

Full CRUD REST API for student records with reusable DB connection helpers and env-based config.

FastAPI PostgreSQL

APIs built

MethodEndpointDescription
POST/studentsAdd a new student
GET/studentsGet all students
GET/students/{id}Get student by ID
PUT/students/{id}Update student record
DELETE/students/{id}Delete a student

Reusable DB helper

# db.py — shared across the whole app def get_connection(): return psycopg2.connect( dbname=os.getenv("DB_NAME"), user=os.getenv("DB_USER"), host=os.getenv("DB_HOST"), port=os.getenv("DB_PORT") ) # student_db_api.py — imports from db.py from db import get_connection
Key refactor: Moving get_connection() to its own module means every route file imports it from one place — changes to DB config only need to happen once.
Phase 2 Machine Learning

What is Machine Learning?

Machine Learning is a way of teaching computers to learn from data — instead of writing explicit rules, you show the model examples and let it figure out the patterns.

Analogy: Traditional programming is like giving someone a recipe. ML is like letting someone taste 1000 dishes and figure out the recipe themselves.

ML is broadly split into three categories based on how the model learns:

Supervised Learning

Learn from labeled data. You know the correct answers during training.

Unsupervised Learning

No labels. The model discovers hidden patterns and structure on its own.

Reinforcement Learning

Learn through trial and error by receiving rewards or penalties. (Not in Phase 2)

Semi-Supervised

Mix of labeled and unlabeled data. (Not in Phase 2)

Supervised Learning

You teach the model using labeled data — data where you already know the correct answer. The model learns the relationship between inputs and outputs, then predicts outputs for new unseen inputs.

Analogy: Like a student studying past exam papers that already have answer keys. They learn from examples, then sit the real exam.

How it works

Labeled DatasetEach data point has an input (features) AND a known output (label)
Train the ModelModel learns the mapping: input → output
Predict on New DataGive it unseen inputs → it predicts the output
EvaluateCompare predicted vs actual to measure performance

Two Types

Regression

Predicts a continuous number
e.g. "What score will this student get?" → 78.5

Classification

Predicts a category/label
e.g. "Is this email spam?" → Spam / Not Spam

Unsupervised Learning

No labels. No answer key. You give the model raw data and it finds hidden structure, patterns, or groupings by itself.

Analogy: Sorting a pile of mixed fruits with no instructions — you naturally group them by color, size, and shape without being told what to look for.

Key Type — Clustering

Groups similar data points together into clusters. Points in the same cluster are more similar to each other than to those in other clusters.

Your Project Example

Input: customer age, income, spending score
No labels given
Output: Group A (high spenders), Group B (budget), Group C (casual)

Topics You'll Cover

K-Means Elbow Method Matplotlib Seaborn

Supervised vs Unsupervised

SupervisedUnsupervised
Labeled data?✓ Yes✗ No
GoalPredict a known outputDiscover hidden patterns
Output typeNumber or CategoryGroups / Clusters
Your projectsStudent Predictor, Spam ClassifierCustomer Segmentation
EvaluationMSE, Accuracy, Precision, RecallVisual inspection, Elbow Method

Regression

Predicts a continuous numerical value. The model learns from input-output pairs and finds the best-fitting line through the data.

Example Dataset

Hours StudiedExam Score
140
250
360
575
890

The Equation — Linear Regression

y = mx + b y → predicted score (output) x → hours studied (input / feature) m → slope (how much score increases per hour) b → intercept (base score with 0 hours studied) # With multiple inputs: y = m1·x1 + m2·x2 + m3·x3 + b

How the Model Learns — Cost Function (MSE)

MSE = average of (predicted − actual)² Example: Actual score: 75 Predicted score: 70 Error: (70 − 75)² = 25 The model keeps adjusting m and b to minimise MSE.
Key Idea: The lower the MSE, the better the model. Training = finding values of m and b that give the lowest possible MSE.

Train / Test Split

Training Data — 80%

The model learns patterns from this data.

Testing Data — 20%

Model is evaluated on this. Never seen during training.

End-to-End Flow

Raw DataKaggle — student scores dataset
Data CleansingHandle missing values, fix formats, remove outliers
Train / Test Split80% training, 20% testing
Train the ModelScikit-Learn fits the line, minimises MSE
Test the ModelPredict on unseen test data
EvaluateCalculate MSE — how low is the error?

Classification

Predicts a category/label — the output is one of a fixed set of classes. Instead of a number, the model decides which group something belongs to.

Examples

Binary Classification (2 classes)

Email → Spam or Not Spam

Multi-Class Classification

Review → Positive Neutral Negative

Text Preprocessing (for Spam Classifier)

CleanRemove punctuation, lowercase everything, remove stop words
VectorizeConvert words to numbers — count how often each word appears
Feed to ModelModel works with number arrays, not raw text
"Win a free prize" → [0, 1, 0, 1, 1, 0, 1, ...] ↑ each number = presence/frequency of a word

Evaluation Metrics

Accuracy
Overall

How many total predictions were correct?

Precision
Quality

Of all predicted spam, how many were actually spam?

Recall
Coverage

Of all actual spam, how many did we catch?

Accuracy = Correct Predictions / Total Predictions Precision = True Positives / (True Positives + False Positives) Recall = True Positives / (True Positives + False Negatives)

Confusion Matrix

Predicted: Spam
Predicted: Not Spam
Actual: Spam
True Positive (TP)90
False Negative (FN)10
Actual: Not Spam
False Positive (FP)5
True Negative (TN)895
TP = correctly predicted spam  |  TN = correctly predicted not-spam  |  FP = predicted spam but wasn't (false alarm)  |  FN = missed actual spam

Student Score Predictor

📊

Student Score Predictor

Trains a Linear Regression model on the Kaggle Students Performance dataset to predict exam scores from parental education, lunch type, test prep, and gender.

scikit-learn Python Regression pandas

Dataset — StudentsPerformance.csv

Feature (Input)Description
gendermale / female
race/ethnicitygroup A–E
parental level of educationhigh school → master's degree
lunchstandard / free-reduced
test preparation coursecompleted / none
math score TargetScore to predict (0–100)

Pipeline

Load CSV with pandaspd.read_csv("StudentsPerformance.csv")
Encode categorical columnspd.get_dummies() — convert text labels to 0/1 numbers
Train / Test Split — 80 / 20train_test_split(X, y, test_size=0.2)
Fit Linear Regressionmodel.fit(X_train, y_train)
Evaluate on test setmean_squared_error(y_test, y_pred)
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import pandas as pd df = pd.read_csv("StudentsPerformance.csv") df = pd.get_dummies(df) # encode categoricals X = df.drop("math score", axis=1) y = df["math score"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"MSE: {mse:.2f}")
Phase 2 milestone: First real ML project — took raw CSV data all the way through cleaning, encoding, splitting, training, and evaluation.

Spam Email Classifier

🛡️

Spam Email Classifier

Trains a Multinomial Naive Bayes model on 5,572 real SMS messages to classify them as spam or ham. Uses TF-IDF vectorization to convert text into numbers the model can learn from.

scikit-learn Python Classification NLP TF-IDF

Dataset — spam.csv

ColumnDescription
v1 (label)ham (not spam) or spam
v2 (message)Raw SMS text content
5,572 messages — 4,825 ham · 747 spam

Pipeline

Text PreprocessingLowercase, strip punctuation, normalise whitespace with regex
Label Encodingham → 0, spam → 1
Train / Test Split — 80 / 20stratify=y ensures equal ham/spam ratio in both splits
TF-IDF VectorizationTfidfVectorizer(max_features=5000) — fit on train only, transform both
Train Naive BayesMultinomialNB().fit(X_train_vec, y_train)
Evaluate + Test Unseen Messagesaccuracy, precision, recall, confusion matrix
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score vectorizer = TfidfVectorizer(max_features=5000) X_train_vec = vectorizer.fit_transform(X_train) # learn vocab from train only X_test_vec = vectorizer.transform(X_test) # apply same vocab to test model = MultinomialNB() model.fit(X_train_vec, y_train) y_pred = model.predict(X_test_vec) accuracy = accuracy_score(y_test, y_pred) # 96.05% precision = precision_score(y_test, y_pred) # 100.00% — zero false alarms recall = recall_score(y_test, y_pred) # 70.47% — catches most spam

Performance Metrics

Accuracy
96.05%

Fraction of all messages labelled correctly

Precision
100.00%

Zero legitimate emails wrongly flagged as spam

Recall
70.47%

~70% of spam caught — model errs on side of caution

Key insight: Precision of 100% means no false positives — not a single real email was lost to the spam folder. The trade-off is a recall of 70%: some spam slips through, which is the safer failure mode.

Customer Segmentation

🧩

Customer Segmentation

Unsupervised K-Means clustering that automatically groups 200 mall customers into 5 segments based on annual income and spending score — no labels needed.

scikit-learn Python Clustering Matplotlib Seaborn

Dataset — Mall_Customers.csv

FeatureDescription
CustomerIDUnique identifier (dropped before training)
GenderMale / Female → encoded 0 / 1
AgeCustomer age in years
Annual Income (k$)Annual income in thousands of dollars
Spending Score (1-100) Cluster featureMall-assigned score based on spending behaviour

Pipeline

Load & Clean DataEncode Gender, drop CustomerID, check for nulls
Feature ScalingStandardScaler — mean=0, std=1 so both features contribute equally to distance
Elbow MethodRun K-Means for K=1–10, plot inertia — elbow at K=5
K-Means Clustering (K=5)KMeans(n_clusters=5).fit_predict(X_scaled)
Visualize & SummarizeScatter plot, pairplot (Seaborn), per-cluster averages
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler import seaborn as sns scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Elbow Method — find best K inertia = [KMeans(n_clusters=k, random_state=42).fit(X_scaled).inertia_ for k in range(1, 11)] # Train final model model = KMeans(n_clusters=5, random_state=42) df["Cluster"] = model.fit_predict(X_scaled) # Visualize sns.scatterplot(data=df, x="Annual Income (k$)", y="Spending Score (1-100)", hue="Cluster")

Clusters Discovered

ClusterCountAvg IncomeAvg SpendSegment
081$55k49.5Average / Mixed
139$87k82.1High Income, High Spenders ⭐
222$26k79.4Low Income, High Spenders
335$88k17.1High Income, Low Spenders
423$26k20.9Low Income, Low Spenders
Key insight: No labels were provided — the model discovered all 5 segments entirely on its own. Cluster 1 (high income, high spend) is the most valuable customer group to target. Cluster 3 (high income, low spend) represents an untapped opportunity.

Build Timeline

Every commit in the order it was built — from first FastAPI setup to the first trained ML model.

2026-04-13

Customer Segmentation

Unsupervised K-Means clustering on Mall_Customers.csv — Elbow Method, feature scaling, 5 cluster segments, Matplotlib + Seaborn visualizations.

2026-04-13

Spam Email Classifier

Multinomial Naive Bayes on 5,572 SMS messages — TF-IDF vectorization, text preprocessing, 96% accuracy, 100% precision.

2026-04-13

Student Score Predictor — Updated

Added results.txt output, LabelEncoder for categoricals, predicted vs actual table, RMSE evaluation report.

c531da4

ML Score Predictor

Trained Linear Regression model on StudentsPerformance.csv — Phase 2 first project.

6c8fe28

Code Cleanup

Refactored and tidied existing API and app files.

caa5774

Expense Tracker API + DB

Full CRUD, category filter, summary & totals with PostgreSQL — Phase 1 capstone.

9731706

Student API Refactor

Extracted reusable DB connection helper + complete CRUD with PostgreSQL.

7f49b2e

FastAPI + PostgreSQL Setup

First FastAPI student API with env-based configuration — first real API with a database.