Big thanks to Michigan Online, Andrej Karpathy, and Justin Johnson for creating and sharing the fantastic Deep Learning for Computer Vision (EECS598) course online! This project is fully licensed under EECS598.
The K-NN algorithm was used to train the CIFAR-10 dataset from scratch for image classification.
Cross-validation was performed to find the optimal hyperparameters, and testing was conducted.
As a result, a top-1 accuracy of 33.86% was achieved for K-NN classifier with k=10.
The figure on the left illustrates how to select the optimal hyperparameter k using cross-validation, while the figure on the right demonstrates how the K-NN algorithm operates when k = 5; a similar process applies to other values of k.
See 🔥here🔥 for more details about K-NN algorithms.
A single-layer neural network is trained from scratch on the CIFAR-10 dataset for image classification.
Here, I avoided using nn.Linear.forward() and loss.backward().
Instead, I implemented the forward pass of the linear layer and the backward pass (manually calculating gradients using the chain rule) entirely from scratch❗️
Two different loss functions, SVM loss and SoftMax loss, are used to compare their performance and they are also implemented from the bottom without using PyTorch modules.
SVM classifier achieves 38.99% for validation set while SoftMax classifier achieves 39.69%.
The figure below visualizes a learned weights of the linear layer. As you can see, the weights attempt to mimic the original object but a little bit blurry.
See 🔥here🔥 for more details about singe linear layer network.
A two layer linear neural network is trained from scratch on the CIFAR-10 dataset for image classification.
As I mentioned in A2-1, I implemented the forward and backward passes of two linear layers all from scratch without using nn.Linear.forward() and loss.backward()❗️
Experiments were conducted with neural networks using different hyper-parameters (hidden dimension for below-left figure, regularization term for upper-right figure, learning rate for upper-left figure) and found out that the optimal validation performance of 52.32% was achieved!
After then, I visualized the weights of the first linear layer (W1) both before and after training. Refer to the figure below.
Similar to the learned weights figure in A2-1, the weights here also attempt to mimic the original object but with greater clarity.
See 🔥here🔥 for more details about two layer linear neural network.
I implemented forward and backward functions for Linear layers, ReLU activation, and DropOut from scratch without using nn.Linear.forward() and loss.backward()❗️.
Then, I built two fully connected linear layers with ReLU non-linearity using different optimization algorithms: SGD, RMSProp, and Adam.
See 🔥here🔥 for more details about two layer linear networks and related experiments.
I implemented forward and backward functions for Convolution layers, MaxPooling, and Batch Normalization from scratch without using nn.Conv2d.forward() and loss.backward()❗️.
(I used three consecutive for-loops to implement forward and backward passes for convolution layers as convolution operates over dimensions of batch size, kernel size, and width & height.)
After then, I built three-layer convolutional networks and each layer consists of Convolution-BatchNorm-ReLU-MaxPool blocks.
I add another technique called Kaiming Initialization to stabilize model training at the beginning.
Using CIFAR-10 dataset, I achieved 71.9% top-1 accuracy.
The figure below shows the trained image of the first convolution kernel, which is entirely different from the weights of the linear layer shown in A2.
It resembles edges or one-dimensional shapes.
See 🔥here🔥 for more details about convolution, maxpool, and batchnorm operators.
The COCO Captions dataset includes 80,000 training images and 40,000 validation images, each paired with 5 captions provided by workers on Amazon Mechanical Turk.
The figure below illustrates examples from the dataset.
For this image captioning task, I implemented vanilla RNN and LSTM models, as they are well-suited for processing sequential text data as input.
I implemented those models from scratch using only torch.nn modules without using built-in nn.RNN() and nn.LSTM()❗️
See 🔥here🔥 for more details about implementation of RNN and LSTM!
(Note: This lecture was conducted in 2019, prior to the publication of the Vision Transformer paper.)
Based on Attention is All You Need paper, I implemented the Transformer's Self-Attention, Multi-head Attention, Encoder and Decoder blocks, as well as Layer Normalization from scratch using torch.nn modules (without using nn.MultiheadAttention(), nn.LayerNorm())❗️
I used a simple toy dataset designed for text-based calculations. Here are a few examples from the dataset:
Expression: BOS NEGATIVE 30 subtract NEGATIVE 34 EOS, Output: BOS POSITIVE 04 EOS
Expression: BOS NEGATIVE 34 add NEGATIVE 15 EOS, Output: BOS NEGATIVE 49 EOS
By training transformer seq2seq models with those text-based calculation dataset, I could get 69.92% accuracy for final model accuracy.
See 🔥here🔥 for more details about transformer implementation!
I implemented FCOS
One-Stage Object Detector from scratch as a one-stage object detection model and trained it on the PASCAL VOC 2007 dataset.(Detection sucks... Debug here ☹️)
See 🔥here🔥 for more details about FCOS implementation!
I implemented a two-stage object detector based on Faster R-CNN, which comprises two main modules: the Region Proposal Network (RPN) and Fast R-CNN.
I used FCOS as a backbone instead of Fast R-CNN .
As with previous section in 5-1, I used the PASCAL VOC 2007 dataset and evaluated performance using mean Average Precision (mAP) as the metric.
(Detection sucks... Debug here ☹️)
See 🔥here🔥 for more details about Faster R-CNN with FCOS implementation!
VAE, which stands for Variational AutoEncoder, is a type of generative model p(x) that incorporates a probabilistic approach into the traditional autoencoder.
Given an input x, the encoder compresses the data into a latent space z represented as q(z|x), while the decoder reconstructs x from the latent representation z as p(x|z).
Here, I used MNIST dataset to train the VAE.
(Conditional VAE is almost the same as VAE except that it has conditional input x given y.)
See 🔥here🔥 for more details about VAE and conditional VAE.
GAN, which stands for Generative Adversarial Network, is a type of generative model p(x) that employs two neural networks in a competitive framework: a generator and a discriminator.
The generator creates synthetic data G(z) from a latent space z, while the discriminator attempts to distinguish between real data x and generated data G(z).
Both networks are trained simultaneously, improving each other's performance iteratively.
Here, I implemented two types of GAN: "Deeply Convolutional GAN" and "Fully Connected GAN".
Figure below shows generated images for DCGAN with latent interpolation.
See 🔥here🔥 for more details about GAN!