Master Optimization for Data Science

Learn optimization techniques from fundamental mathematical concepts to advanced machine learning applications. Designed by IIT faculty for students at all levels.

What You'll Learn

Mathematical Foundations

Derivatives, gradients, Hessians, and convexity concepts explained with visual intuition

Classical Algorithms

Linear programming, gradient descent, Newton's method, and constraint handling

ML Applications

Optimization in neural networks, hyperparameter tuning, and model selection

Practical Implementation

Python code examples, real datasets, and hands-on programming exercises

Mathematical Fundamentals

Understanding Derivatives and Gradients

In optimization, derivatives tell us the rate of change of a function. For data science, this helps us understand how small changes in our model parameters affect the loss function.

Single Variable Case

For a function $f(x) = x^2 + 2x + 1$, the derivative is:

$$f'(x) = 2x + 2$$

This tells us the slope at any point $x$.

Interactive Visualization

x = 0 f'(x) = 2

Python Implementation


import numpy as np
import matplotlib.pyplot as plt

# Define function and its derivative
def f(x):
    return x**2 + 2*x + 1

def f_prime(x):
    return 2*x + 2

# Calculate gradient at a point
x = 1.0
gradient = f_prime(x)
print(f"Gradient at x={x}: {gradient}")

# Numerical gradient (for verification)
h = 1e-8
numerical_gradient = (f(x + h) - f(x - h)) / (2 * h)
print(f"Numerical gradient: {numerical_gradient}")
                            

Convexity in Optimization

Convex functions are the "nice" functions in optimization. They have a single global minimum, making optimization algorithms reliable and predictable.

Mathematical Definition

A function $f$ is convex if for any two points $x_1, x_2$ and any $\lambda \in [0,1]$:

$$f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2)$$

Visual Understanding

Handling Constraints

In real-world problems, we often have constraints on our variables. Understanding how to handle these is crucial for practical optimization.

Equality Constraints

$g(x) = 0$ - The solution must satisfy this exactly

Example: $x_1 + x_2 = 1$ (budget constraint)

Inequality Constraints

$h(x) \leq 0$ - The solution must be in the feasible region

Example: $x_1, x_2 \geq 0$ (non-negativity)

Lagrange Multipliers

For constrained optimization, we use the Lagrangian:

$$L(x, \lambda) = f(x) + \lambda g(x)$$

The optimal solution satisfies: $\nabla_x L = 0$ and $g(x) = 0$

Optimization Algorithms

Gradient Descent

The workhorse of machine learning optimization

  1. Start with initial guess $x_0$
  2. Compute gradient $\nabla f(x_k)$
  3. Update: $x_{k+1} = x_k - \alpha \nabla f(x_k)$
  4. Repeat until convergence

Newton's Method

Faster convergence using second-order information

  1. Start with initial guess $x_0$
  2. Compute gradient $\nabla f(x_k)$ and Hessian $H_k$
  3. Update: $x_{k+1} = x_k - H_k^{-1} \nabla f(x_k)$
  4. Repeat until convergence

Adam Optimizer

Adaptive learning rates for modern deep learning

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$ $$x_{t+1} = x_t - \frac{\alpha}{\sqrt{v_t} + \epsilon} m_t$$

Simulated Annealing

Global optimization for non-convex problems

  1. Start with high "temperature" $T$
  2. Generate random neighbor solution
  3. Accept if better, or with probability $e^{-\Delta E/T}$
  4. Gradually reduce temperature

Real-World Applications

Linear Regression: A Simple Case Study

Let's see how optimization works in the simplest machine learning model.

Problem Setup

We want to find the best line $y = wx + b$ to fit our data points.

Loss function (Mean Squared Error): $L(w,b) = \frac{1}{n}\sum_{i=1}^n (y_i - wx_i - b)^2$

1.0
0.0
Loss: 0.00

Neural Network Training

Understanding how backpropagation uses optimization to train neural networks.

Backpropagation Algorithm

  1. Forward Pass: Compute predictions and loss
  2. Backward Pass: Compute gradients using chain rule
  3. Update: Adjust weights using gradient descent

Hyperparameter Optimization

Finding the best hyperparameters is itself an optimization problem!

Grid Search

Exhaustively search over a grid of hyperparameter values

✓ Guaranteed to find best in grid
✗ Computationally expensive

Random Search

Randomly sample from hyperparameter distributions

✓ More efficient than grid search
✗ No guarantee of finding optimum

Bayesian Optimization

Use probabilistic model to guide search

✓ Sample efficient
✗ Complex to implement

Portfolio Optimization

Markowitz portfolio theory: maximizing return while minimizing risk.

Mathematical Formulation

Minimize portfolio variance: $\min \frac{1}{2} w^T \Sigma w$

Subject to: $\sum w_i = 1$ (budget constraint) and $\mu^T w \geq r_{target}$ (return constraint)

10%

Practice Problems

Problem 1: Basic Optimization

Find the minimum of $f(x) = x^4 - 4x^3 + 6x^2 - 4x + 1$

Problem 2: Constrained Optimization

Minimize $f(x,y) = x^2 + y^2$ subject to $x + y = 1$

Problem 3: Machine Learning

Implement gradient descent for logistic regression

Quick Quiz

Which algorithm typically converges faster?