Lecture 2 - ML Refresher / Softmax Regression

Supervised learning

  • classification (e.g., spam detection, image recognition) and regression tasks (e.g., predicting house prices, stock forecasting).
  • Algorithms include decision trees, support vector machines (SVM), neural networks, and linear regression.
  • Requires a large amount of accurately labeled data for effective training.

Unsupervised learning

  • Useful for clustering (e.g., customer segmentation), association (e.g., market basket analysis), and dimensionality reduction (e.g., principal component analysis, or PCA)
  • Algorithms include k-means clustering, hierarchical clustering, and autoencoders.
  • Can work with unlabeled data, which is often more readily available and cost-effective to collect.

Multi-class classification setting

Matrix batch notation

  • X in $R^{m \times n}$
  • Matrix ops are more efficient than many vector ops

Loss function: classification error

  • $l_{\text{err}}(h(x), y) = 0 \quad \text{if} \quad \text{argmax}_i \, h_i(x) = y$
  • bad for optimization, because it is not differentiable

Loss function: softmax/cross-entropy loss

  • Softmax(h(x)) = normalize(exp(h(x)))
  • $l_ce(h(x), y) = -h_y(x) + log \sum_{j=1}^k exp(h_j(x))$ applied to a linear hypothesis class
  • Find \theta that minimizes the average loss of the training set

SGD

\alpha step size/learning rate Sample a minibatch of size B, change by \alpha /B

gradient of cross entropy

Hack to compute loss’s derivative: pretend everything is vector, and then reaggrange transpose matrices/vectors to make size work. Make sure check the numerical answers. This approach works for the matrix batch form of the loss

Lecture 3 - “Manual” Neural Networks

Need to create feature function \theta so that the classier is more powerful than the pure linear one.

  • Traditional: manually extract features
  • Neural network: learn features from data, automated

$\theta = {W_1 , W_2}$

A two layer network can approximate any smooth function well

Deep layers are more efficient than a single later at representing some functions unable to learn , e.g., parity function. Empirically, deep layers work better when the number of params is fixed.

Backpropogation

two-layer network gradient w.r.t. w_2

w_1

General form

Forward path: compute Z_i

Backward path: compute G_i

vector Jacobian product