Lecture 2 & 3

Lecture 2 - ML Refresher / Softmax Regression

Supervised learning

classification (e.g., spam detection, image recognition) and regression tasks (e.g., predicting house prices, stock forecasting).
Algorithms include decision trees, support vector machines (SVM), neural networks, and linear regression.
Requires a large amount of accurately labeled data for effective training.

Unsupervised learning

Useful for clustering (e.g., customer segmentation), association (e.g., market basket analysis), and dimensionality reduction (e.g., principal component analysis, or PCA)
Algorithms include k-means clustering, hierarchical clustering, and autoencoders.
Can work with unlabeled data, which is often more readily available and cost-effective to collect.

Multi-class classification setting

Matrix batch notation

X in $R^{m \times n}$
Matrix ops are more efficient than many vector ops

Loss function: classification error

$l_{\text{err}}(h(x), y) = 0 \quad \text{if} \quad \text{argmax}_i \, h_i(x) = y$
bad for optimization, because it is not differentiable

Loss function: softmax/cross-entropy loss

Softmax(h(x)) = normalize(exp(h(x)))
$l_ce(h(x), y) = -h_y(x) + log \sum_{j=1}^k exp(h_j(x))$ applied to a linear hypothesis class
Find \theta that minimizes the average loss of the training set

SGD

\alpha step size/learning rate Sample a minibatch of size B, change by \alpha /B

gradient of cross entropy

Hack to compute loss’s derivative: pretend everything is vector, and then reaggrange transpose matrices/vectors to make size work. Make sure check the numerical answers. This approach works for the matrix batch form of the loss

Lecture 3 - “Manual” Neural Networks

Need to create feature function \theta so that the classier is more powerful than the pure linear one.

Traditional: manually extract features
Neural network: learn features from data, automated

$\theta = {W_1 , W_2}$

A two layer network can approximate any smooth function well

Deep layers are more efficient than a single later at representing some functions unable to learn , e.g., parity function. Empirically, deep layers work better when the number of params is fixed.

Lecture 2 & 3

Lecture 2 - ML Refresher / Softmax Regression

SGD

Lecture 3 - “Manual” Neural Networks

Backpropogation

two-layer network gradient w.r.t. w_2

w_1

General form

Forward path: compute Z_i

Backward path: compute G_i