Lecture 2 & 3
Lecture 2 - ML Refresher / Softmax Regression
Supervised learning
- classification (e.g., spam detection, image recognition) and regression tasks (e.g., predicting house prices, stock forecasting).
- Algorithms include decision trees, support vector machines (SVM), neural networks, and linear regression.
- Requires a large amount of accurately labeled data for effective training.
Unsupervised learning
- Useful for clustering (e.g., customer segmentation), association (e.g., market basket analysis), and dimensionality reduction (e.g., principal component analysis, or PCA)
- Algorithms include k-means clustering, hierarchical clustering, and autoencoders.
- Can work with unlabeled data, which is often more readily available and cost-effective to collect.
Multi-class classification setting
Matrix batch notation
- X in $R^{m \times n}$
- Matrix ops are more efficient than many vector ops
Loss function: classification error
- $l_{\text{err}}(h(x), y) = 0 \quad \text{if} \quad \text{argmax}_i \, h_i(x) = y$
- bad for optimization, because it is not differentiable
Loss function: softmax/cross-entropy loss
- Softmax(h(x)) = normalize(exp(h(x)))
- $l_ce(h(x), y) = -h_y(x) + log \sum_{j=1}^k exp(h_j(x))$ applied to a linear hypothesis class
- Find \theta that minimizes the average loss of the training set
SGD
\alpha step size/learning rate Sample a minibatch of size B, change by \alpha /B
gradient of cross entropy
Hack to compute loss’s derivative: pretend everything is vector, and then reaggrange transpose matrices/vectors to make size work. Make sure check the numerical answers. This approach works for the matrix batch form of the loss
Lecture 3 - “Manual” Neural Networks
Need to create feature function \theta so that the classier is more powerful than the pure linear one.
- Traditional: manually extract features
- Neural network: learn features from data, automated
$\theta = {W_1 , W_2}$
A two layer network can approximate any smooth function well
Deep layers are more efficient than a single later at representing some functions unable to learn , e.g., parity function. Empirically, deep layers work better when the number of params is fixed.
Backpropogation
two-layer network gradient w.r.t. w_2
w_1
General form
Forward path: compute Z_i
Backward path: compute G_i
vector Jacobian product