## neural network
Neural networks mimic the architecture of the brain and serve as the underpinning of modern [[artificial intelligence]].
Neural networks are a multi-layer perceptron with an input layer, some number of hidden layers, and an output layer. If all neurons in adjacent layers are connected, it is called a **fully-connected neural network** and can be referred to as **dense**.
Architecture hyperparameters include the number of hidden layers, number of nodes per layer, and activation function. Training hyperparameters include learning rate, momentum, optimization method, and regularization (among others).
[[Back propagation]] is used to update the weights for each layer.
### training neural networks
- Monitor overfitting as epoch grows
- Hyperparameter tuning: learning rate, etc.
- Architecture: number of layers, number of neurons, activation function, etc.
- Optimization methods: RMSProp, Adam
- Regularization: dropout and batch normalization, or add L1/L2 regularization on the loss.
If the learning rate is too high, it may skip right past the mimima and shoot off in another direction, or roll back and forth without settling. A learning rate that is too low will not only take a long time to find the minima, but may in fact get stuck in some local depression of the error surface.
Dropout is used to randomly exclude some nodes in each step to find a more generalizable solution.
Batch normalization is used to normalize inputs for each batch in [[stochastic gradient descent]].
### history of the neural network
Scientists first began to understand the architecture of the brain and its role in cognition in the mid 1800s. Santiago Ramon y Cajal is known as the father of neuroscience for his work in this era, which led to a branch of science known as "connectionism" which sought to explain mental phenomena using artificial neural networks, led by scientists like Alexander Bain.

*Santiago Ramon y Cajal's depiction of neurons in the cerebellum.*
## perceptron
A perceptron is the computer's equivalent of a neuron--an artificial neuron. In the same way dendrites carry electrical impulses into a neuron which determine whether the neuron fires and sends a signal down it's axon to the axon terminals, weighted inputs to the perceptron feed an activation function that determines the output of the perceptron.
The perceptron is the building block of the [[neural network]].
Common activation functions are sigmoid, tanh, ReLu (linear after a certain point), and the step function.
### training algorithm
Training a perceptron can be accomplished with either the perceptron rule or delta rule (gradient descent), both of which resolve to the same equation.
$
\omega_j \gets \omega_j - \alpha(\hat y_i - y_i)X_{ij}
$
where $\omega_j$ is the weight of the $j$th perceptron and $\alpha$ is the learning (or step) rate.
### history of the perceptron
The [[perceptron]] was first proposed by Frank Rosenblatt in 1958 in the paper *The perceptron: a probabilistic model for information storage and organization in the brain*.

*The original perceptron concept from Rosenblatt (ref. 7); artificial neurons mimic the function of the brain, transforming inputs at the retina into responses.*
## back propagation
Back propagation is used in training a [[neural network]]. Using the [[chain rule]], the loss at each layer is propagated backward through the network with [[gradient descent]] to update the weights at each step. A [[computation graph]] is used to simplify calculations.
### simple backpropagation example
Consider a simple linear neural net with two hidden layers.
$
\begin{array}{cccc}
& \bigcirc & \bigcirc & \bigcirc \\
& \vert & \vert & \vert \\
\bigcirc & \bigcirc & \bigcirc & \bigcirc \\
x & & & y \\
\end{array}
$
Note that this network has three weights $W^1$, $W^2$, $W^3$ and three biases $b^1$, $b^2$, and $b^3$.
Suppose that each hidden and output neuron is equipped with a sigmoid activation function given by
$\sigma(z^L) = \frac{1}{1 + e_z}$
The loss function is given by
$
\ell(y, a^4) = \frac{1}{2}(y - a^4)^2
$
where $a^4$ is the value of the activation at the output neuron and $y \in \{0,1\}$ is the true label associated with the training example.
Recall that the activity $z^L$ is simply
$z^L = W^{L-1} * a^{L-1} + b^{L-1}$
### feed forward
Let's find the pre-activation values $z^L$ (activities) and the post-activation values $a^L$ (activations) of the two unlabeled red circles and the output $y$, staring with the second layer $L=2$ and one training example $(x, y) = (0.5, 0)$. Suppose each of the weights is initialized to $W^k = 1.0$ and each bias is initialized to $b^k = -0.5$.
---
$
\begin{array}{cccc}
& \bigcirc & \bigcirc & \bigcirc \\
& \vert & \vert & \vert \\
\bigcirc & \circledcirc & \bigcirc & \bigcirc \\
x & & & y \\
\end{array}
$
The activity $z^2$ of the second layer (first hidden node) with $x = 0.5$ is given by
$
z^2 = W^1 * x + b^1 = 1 * (0.5) + (-0.5) = 0
$
Given a sigmoid activation function $\sigma$, the activation $a^2$ is
$
a^2 = \frac{1}{1 + e^0} = 0.5
$
---
The activity at the next node is then
$
\begin{array}{cccc}
& \bigcirc & \bigcirc & \bigcirc \\
& \vert & \vert & \vert \\
\bigcirc & \bigcirc & \circledcirc & \bigcirc \\
x & & & y \\
\end{array}
$
$
a^3 = \sigma(W^2 * a^1 + b^2) = \sigma(1 * 0.5 + (-0.5)) = \sigma(0) = 0.5
$
---
The pattern repeats for $a^4$ to get $\hat y=0.5$.
$
\begin{array}{cccc}
& \bigcirc & \bigcirc & \bigcirc \\
& \vert & \vert & \vert \\
\bigcirc & \bigcirc & \bigcirc & \circledcirc \\
x & & & y \\
\end{array}
$
However, we know from the training data that $y$ should be $1$.
### backpropagation
Let's use backpropagation to update the weights and biases. From above, we have $(x, y) =(0.5, 1)$, $W^L = 1.0$, $b^L = -0.5$, activities $z^2 = z^3 = z^4 = 0$, and activations $a^2 = a^3 = a^4 = 0.5$.
### calculating the weight updates
Starting from the right-side, we can find the change in the loss function $\ell$ with respect to the change in $W^3$.
$
\frac{\partial \ell}{\partial W^3} = \frac{\partial \ell}{\partial a^4} * \frac{\partial a^4}{\partial z^4} * \frac{\partial z^4}{\partial W^3}
$
Breaking this down by component, we have
$\frac{\partial \ell}{\partial a^4} = y - a^4$
$\frac{\partial a^4}{\partial z^4} = \sigma'(z^4) = \sigma(z^4)(1 - \sigma(z^4)$
$\frac{\partial z^4}{\partial W^3} = a^3$
Plugging in the values we calculate
$\begin{align}
\frac{\partial \ell}{\partial W^3} &= (y - a^4) * \sigma'(z^4) * a^3 \\
&= (1 - 0.5) * \sigma(0) * (1 - \sigma(0)) * 0.5 \\
&= 0.5 * 0.5 * (1 - 0.5) * 0.5 \\
&= 0.0625
\end{align}
$
Using the chain rule to continue through the network for $W^2$, we have
$
\frac{\partial \ell}{\partial W^2} = \frac{\partial \ell}{\partial a^4} * \frac{\partial a^4}{\partial z^4} * \frac{\partial z^4}{\partial a^3} * \frac{\partial a^3}{\partial z^3} * \frac{\partial z^3}{\partial W^2}
$
We have already calculated some of these terms, but we need to find
$
\frac{\partial z^4}{\partial a^3} = W^3
$
$\frac{\partial a^3}{\partial z^3} = \sigma'(z^3) = \sigma(z^3)(1 - \sigma(z^3)$
$
\frac{\partial z^3}{\partial W^2} = a^2
$
Plugging those values in we get
$\begin{align}
\frac{\partial \ell}{\partial W^2} &= (y - a^4) * \sigma'(z^4) * W^3 * \sigma'(z^3) * a^2 \\
&= 0.5 * 0.25 * 1 * 0.25 * 0.5 \\
&\approx 0.016
\end{align}
$
The pattern continues such that
$\begin{align}
\frac{\partial \ell}{\partial W^1} &= (y - a^4) * \sigma'(z^4) * W^3 * \sigma'(z^3) * W^2 * \sigma'(z^2) * x \\
&\approx 0.004
\end{align}
$
### updating the weights
The opposite of the cost vector $-C = [0.004, 0.016, 0.0625]$ gives us the amounts to adjust each weight using the formula
$W^L := W^L + \alpha \frac{\partial \ell}{\partial W^L}$
where $\alpha$ is the learning rate. With $\alpha=0.1$, we can update the weights as follows
$\begin{align}
W^1 &= 1.0 - 0.1 * 0.004 = .9996 \\
W^2 &= 1.0 - 0.1 * 0.016 = .9984 \\
W^3 &= 1.0 - 0.1 * 0.0625 = .99375
\end{align}$
### calculating the bias updates
The bias updates are exactly like the weight updates except there is no need to scale the delta by the input since the partial derivative of $W * a +b$ with respect to $b$ is just $1$. We can use the same formulas above to compute:
$
\frac{\partial \ell}{\partial b^L} = (y - a^4) \cdot \sigma'(z^L) \cdot \prod W^k \cdot \sigma'(z^k)
$
for each weight $k$.
Using the values we've already calculated for the deltas at each layer, we get
$
\begin{align}
\frac{\partial \ell}{\partial b^3} &= (y - a^4) \cdot \sigma'(z^4) = 0.5 \cdot 0.25 = 0.125 \\
\frac{\partial \ell}{\partial b^2} &= 0.5 \cdot 0.25 \cdot 1 \cdot 0.25 = 0.03125 \\
\frac{\partial \ell}{\partial b^1} &= 0.5 \cdot 0.25 \cdot 1 \cdot 0.25 \cdot 1 \cdot 0.25 = 0.0078125
\end{align}
$
### applying the bias updates
We update the bias using the same gradient desent rule:
$
b^L \gets b^L - \alpha \cdot \frac{\partial \ell}{\partial b^L}
$
With learning rate $\alpha=0.1$, we get:
$
\begin{align}
b^1 &= -0.5 - 0.1 \cdot 0.0078125 = -0.50078125 \\
b^2 &= -0.5 - 0.1 \cdot 0.03125 = -0.503125 \\
b^3 &= -0.5 - 0.1 \cdot 0.125 = -0.5125
\end{align}
$
### next steps
In this toy example, we might run the training data point through another **epoch** to update the weights again and help the model "memorize" this one data point. Depending on the learning rate, it might take 50 - 100 epochs for the weights to stabilize.
In real life, we'll have many training examples and so we would feed all of those in and update the weights for the average cost vector across all examples. Again, we would repeat multiple epochs to improve the model.
We would likely also increase the complexity of the architecture of the model, including more layers and more artificial neurons per layer. We could experiment with different activation functions, different loss functions like cross-entropy loss, and
> [!Tip]- Additional Resources
> - [Artem Kirsanov | The Most Important Algorithm in Machine Learning (YT)](https://youtu.be/SmZmBKc7Lrs?si=uWZOq_X37JoTkEGM)
> - [3Blue1Brown | ]()
## gradient descent
Gradient descent is one of the most important algorithms in machine learning. Gradient descent can be used to find a local minimum for any differentiable function.
Intuitively, think about trying to get to the lowest point on a landscape by walking quickly downhill. Your next step should be in the direction with the steepest slope. After taking a step, you reassess your next step. You can easily get down a mountain and into a valley in this way, but as you probably can guess you are not guaranteed to be in the lowest point. Where you start from will make a difference. So too will your decision to go left or right at a saddle point. It's also best to avoid cliffs; if you find yourself on a plateau you may not know where to go next. All of these considerations also apply to gradient descent in machine learning.
While you can look out and see where to go next, the gradient descent algorithm must rely on the [[derivative]] of the optimization function. The derivative tells the algorithm whether the function is increasing or decreasing at a given point. If increasing, it tells the algorithm to adjust in the opposite direction, and vice versa. If increasing strongly, the algorithm can take a bigger step with confidence.
A learning rate $\alpha$ is typically used to avoid taking too big a step and overshooting.
Consider updating the a weight in a [[multilayer perceptron]].
$
W^L = W^L_{t -1} - \alpha \frac{\partial{\ell}}{\partial{W^L}}
$
The weight $W$ in layer $L$ at time step $t$ is given by the weight at the previous time step less the change in the loss function $\ell$ given a change in the weight (the partial derivative of $\ell$ with respect to $W^L$) multiplied by the learning rate $\alpha$.
The update procedure is continued until the loss function stabilizes or otherwise meets some specification.
### stochastic gradient descent
In stochastic gradient descent (SGD), the [[gradient descent]] algorithm is used on a randomized subset of the data, called a **minibatch**.
Each step only sees a small subset of the data, which greatly decreases the computational requirements. The next step is not necessarily the best next step (not necessarily in the direction of steepest slope for the full training set), but with the correct batch size the marginal gains of any additional training examples are less than the marginal gains in speed.
#### momentum
SGD is still slow to converge but adding a **momentum** (moving average) improves speed. Intuitively, momentum is meant to make it harder to change direction from step to step.
Using the multilayer perceptron example, momentum is given by $\beta$.
$
W^L = \beta W^L_{t-1} - \alpha \frac{\partial \ell}{\partial W^L_{t-1}}
$
Note however that typically a separate variable $v$ is used to hold velocity and momentum is applied to it, here we are using the weight to hold velocity for simplicity (see [[lit/textbooks/Deep Learning|Deep Learning]] section 8.3.2).
Momentum also helps overcome plateaus, where the gradient is zero, allowing the algorithm to simply continue in a direction until the gradient changes again.
Nestrov momentum is intended to achieve convergence faster but empirically speaking it doesn't do much.
#### decay
Decay is used to reduce the learning rate at each time step to prevent overshooting. The learning rate, and thus step size, decreases over time. This can help with overfitting as well.
Instead of a constant decay, in `keras` you can set a learning rate schedule with a stepwise function which can also improve prediction error.
#### SGD optimization
There are a number of SGD variants available in packages like `keras` that manipulate momentum and learning rate primarily for numerical stability and speed in convergence. Adam and RMSprop are good general purpose options.
See [this article](https://imgur.com/a/visualizing-optimization-algos-Hqolp) on how these variants compare.

#### AdaGrad
In Adagrad, the learning rate is normalized by the square root of the total sum of the gradient.
#### AdaDelta
In AdaDelta, the learning rate is normalized by the root mean square (RMS) of the gradient for numerical stability. The change in weights is proportional to the RMS ratio.
#### RMSprop
RMSprop is a variant of AdaDelta that takes a moving average when it calculates the RMS of the gradient for numerical stability.
#### Adam
Adaptive Momentum Estimation (Adam) mimics momentum for gradient and gradient-squared.
## computation graph
Computation graphs are used to represent formulas step-by-step.
In [[back propagation]], computation graphs are used to decompose a function of arbitrary complexity into component primitive functions for which derivatives are known (e.g., $x^n$, $e^x$). This allows for the efficient calculation of the derivative of the complex function using the [[chain rule]].
Let's use a simple example to illustrate the basics. Let's say we have some function
$
F(a, b, c) = 3 \times (a + b) \times c
$
The computation graph can be represented as
```mermaid
graph LR
A[a]
B[b]
C[c]
A --> U["u=a+b"]
B --> U
U --> Mul["v=u*c"]
C --> Mul
Mul --> F["F=3*v"]
```
Note that in some representations, only the operator (e.g., $+$ or $*$) is shown.
First, we must forward propagate the graph with some initial values. Let's use $a=2$, $b=4$ , and $c=8$.
```mermaid
graph LR
A[a=2]
B[b=4]
C[c=8]
A --> U["u=a+b=6"]
B --> U
U --> Mul["v=u*c=48"]
C --> Mul
Mul --> F["F=3*v=144"]
```
For back propagation, we ultimately want to calculate the sensitivity of $F$ to changes in input variables $a$, $b$, and $c$, which is to say, using variable $a$ for example, what is
$\frac{\partial F}{\partial a}$?
From the chain rule, we can break this down as
$
\frac{\partial F}{\partial a} = \frac{\partial F}{\partial v} \times \frac{\partial v}{\partial u} \times \frac{\partial u}{\partial a}
$
What impact would a $0.001$ increase in $a$ have on $F$? It would be
$
F = 3 * (2.001 + 4) * 8 = 144.024
$
which is a $24$ times difference ($0.024 / 0.001 = 24)$
Let's compute the partial derivatives of each step n the calculation graph which gives
$\begin{align}
\frac{\partial F}{\partial v} = 3 &&
\frac{\partial v}{\partial u} = c = 8 &&
\frac{\partial u}{\partial a} = 1
\end{align}$
Therefore
$
\frac{\partial F}{\partial a} = \frac{\partial F}{\partial v} \times \frac{\partial v}{\partial u} \times \frac{\partial u}{\partial a} = 3 \times 8 \times 1 = 24
$
Now we can see how the chain rule works for this example.
To use the computation graph, simply start from the right side of the graph and work backwards. Filling in the interim value at each step in the computation. For each step, you are calculating the the partial derivative of $F$ with respect to the associated variable at that step in the graph (for completeness, technically, we start with $\frac{\partial F}{\partial F} = 1$).
To recap, to use a computation graph, follow these steps.
1. Convert your complex formula into a computation graph
2. Forward feed the initial values
3. Find the partial derivative at each step
4. Starting from the right side of the graph with value 1, plug in the necessary values for each partial derivative to get the partial derivative of $F$ with respect to the associated variable at that step in the graph.
### Computation graph with basic neural network
Let's consider a two-layer perceptron with inputs $X_1$ and $X_2$ and a hidden layer with perceptrons $h1$ and $h2$ and a single output $a$.
> [!Tip]- Additional Resources
> - [DeepLearning.ai | Derivatives with Computation Graphs](https://youtu.be/nJyUyKN-XBQ?si=AODv0fRICvFc9SXW)
> - [MITOpenCourseware | Differentiation on Computational Graphs](https://youtu.be/r9_5dxtDTOk?si=J5cJ6yyjte6EwNRg)
> -
## keras
[Keras](https://keras.io/) is a [[base/Deep Learning/deep learning]] library that was acquired by Google and now is a wrapper for [[Tensorflow]] and [[PyTorch]].
You'll want to run `keras` on a [[graphics processing unit|GPU]], your local environment will likely be too slow even on small toy models. The best way to access a GPU is to [[enable GPU or TPU in Google Colab]].
In the example below, we'll build a simple neural network with one hidden layer to predict 5 classes from 100 training images. The number of epochs is the number of times the model is shown the training data.
```python
import keras
from keras.models import Sequential, Dense
# Create basic neural net
model = Sequential()
# Add densely connected hidden layer with ReLU activation 100 inputs
model.add(Dense(units=64, activation='relu', input_dim=100))
# Add output layer with softmax activation function for 5 classes
model.add(Dense(units=5, activation='softmax'))
# Create an optimizer and specify learning rate
opt = keras.optimizers.SGD(learning_rate=0.1)
# Print summary
model.summary
```
To train the model
```python
# Compile model
model.compile(
loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy']
)
# Train model (TODO: define X, y)
model.fit(X, y, epochs=5)
```
You can also access `keras` from within `tensorflow`.
```python
import tensorflow as tf
model = tf.keras.Sequential()
...
```
> [!Tip]- Additional Resources
> - [Stanford Deep Learning for Computer Vision (CS231n)](https://cs231n.stanford.edu/)
## keras image
Efficient data streaming is essential for speeding up data processing. Kaggle makes available both GPU and TPU processer units with limited free use per week. The Keras class `image_dataset_from_directory` allows streaming directly from disk, allowing better performance on GPU, however requires images be converted from TIFF to a supported file format (e.g., JPG) and a subdirectory structure with subfolder for each label like
```
*
--+ train/
--+ 1/
--+ 0/
--+ test/
--+ 1/
--+ 0/
```
I'll copy the files into a new directory matching the above and then read using Keras class `image_dataset_from_directory` before training the model.
```python
# Balance training data (subsample for debugging)
n_train = train_labels['label'].value_counts().min()
n_train = round(n_train * 0.05) # Limit to 5% for debugging
benign = train_labels[train_labels['label'] == 0].sample(n_train)
malignant = train_labels[train_labels['label'] == 1].sample(n_train)
df_train_all = pd.concat([benign, malignant], axis=0).reset_index(drop=True)
# Create directories
keras_train_dir = 'keras/train'
keras_test_dir = 'keras/test'
for fold in [keras_train_dir, keras_test_dir]:
for subf in ["0", "1"]:
os.makedirs(os.path.join(fold, subf), exist_ok=True)
# copy files
def copy_and_convert_images_to_jpg(
src_dir, dest_dir, df_fids
):
for _, (fid, label) in tqdm(
df_fids.iterrows(), total=len(df_fids), desc="Converting images"
):
fname = str(fid) + ".tif"
src = os.path.join(src_dir, fname)
new_fname = str(fid) + ".jpg"
dst = os.path.join(dest_dir, str(label), new_fname)
# Open, convert, and save as .jpg or .png
with Image.open(src) as img:
img.save(dst, format="JPEG")
copy_and_convert_images_to_jpg(train_dir, keras_train_dir, df_train_all)
# set up data streams
image_size = (96, 96)
batch_size = 128
train_ds, val_ds = keras.utils.image_dataset_from_directory(
keras_train_dir,
validation_split=0.2,
subset="both",
seed=1337,
image_size=image_size,
batch_size=batch_size,
shuffle=True
)
# Prefetching samples in GPU memory helps maximize GPU utilization.
train_ds = train_ds.prefetch(tf_data.AUTOTUNE)
val_ds = val_ds.prefetch(tf_data.AUTOTUNE)
# Set up image stream for test data
test_ds = keras.utils.image_dataset_from_directory(
keras_test_dir,
image_size=image_size,
batch_size=batch_size,
shuffle=False
)
```
## activation function
The activation function determines the value that is passed to the next layer in the neural network. Critically, it is a nonlinear function of the inputs (which include all of the activations in the previous layer multiplied by their weights).
## sigmoid
### softmax
Softmax is useful in multi-class classification problems and is given by
$
P(y = j | x) = \frac{e^{x^T w_j}}{\sum^K_{k=1}e^{x^Tw_k}}
$
Softmax will give the probability for each category and is typically used to resolve to only one class.
### rectified linear unit
Rectified linear unit (ReLU) is the most popular [[activation function]] for neural networks given its ease of computation. Previously, the [[sigmoid]] activation function was popular but ReLU was found to be as accurate while also being faster to implement.
$max(0, z)$
Leaky ReLU or Parametric ReLU (PReLU) allows some negative values as opposed to a minimum of 0.
## convolutional neural network
A convolutional neural network (CNN) augments the [[multilayer perceptron]] by adding one or more [[convolution]] steps to the inputs. CNNs can be used to constrain the number of parameters in neural networks with large inputs (e.g., images) and encode the context of a pixel in its representation when the image is converted to a 1-D array (solving a problem known as **translational invariance**).
In image processing, convolution layers can be understood as identifying specific features (like eyebrows and mouths). For this reason, the convolution and [[pooling layer]] are together referred to as **feature extractor**.
In a convolutional neural network, multiple convolutions are used to create **feature layers** that are then fed into the neural network (however you could use convolution simply to create features for any other classifier like [[XGBoost]]).
Use **padding** to avoid reducing the dimension of the output matrix. Use **stride** to take larger steps in the moving window operation. Stride will reduce the output matrix by a factor of the stride length (a stride of 2 returns an output matrix half the size of the input).
The weights in the filter (or kernel) are learned through the fitting process.
CNN architecture can become quite complicated. A few examples of famous CNN architectures include:
- **VGGNet** (2014) used n-layers comprised of two convolution layers and one pooling layer.
- **GoogLeNet InceptionNet** (2014) used repeating inception layers combining 1x1, 3x3 and 5x5 convolution filters in parallel.
- **ResNet** (2015) used a skip connections to avoid unstable training due to the nature of many non-linear activation functions in series causing the gradient to vanish or explode.
### convolution
A convolution is a type of matrix operation given by
$
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}
*
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix}
= d + 2c + 3b + 4a
$
Convolution can be used as a **moving window** filter. Note the above convolution reduces the 2x2 input matrix to a 1x1 scalar output.
### pooling layer
Pooling reduces the size of the convolution layers. Typical pooling size is 2x2, which reduces the convolution layer by half in each dimension. There are two flavors:
- **Max pool** takes the maximum value in the pool
- **Average pool** averages the values
Stride without padding in the convolution can also be used to reduce the size of the convolution layers.
### training tips for CNN
- Use learning rate of 0.01 to 0.001
- Use Adam or RMSProp optimization method
- Use ReLU or PReLU activation for hidden layers
- Use Sigmoid, Softmax, Tanh, or PReLU for output layers
- Use 3x3 filters
- Use n-layers of Conv-Conv-MaxPool for architecture
- Use L2 regularization
- Use batch normalization (and/or dropouts but batch normalization is usually sufficient)
## transfer learning
Transfer learning describes re-training a fitted model for new datasets.
If you have access to the model components, you could for example take just the feature extractor layer (or subcomponents) and pipe it into a different model.
Fine-tuning is used to update the weights from the pretrained model based on new labeled data. Starting from the pretrained weights can save significant compute compared to training a new model.
## Cycle GAN
Introduced in the paper [Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks](https://arxiv.org/pdf/1703.10593) CycleGAN is an adversarial approach to translating an image from one domain (e.g., a photo) to another domain (e.g., the painting style of Monet). As described in paper *Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks* the algorithm learns a mapping $G: X \to Y$ (i.e., from photos to Monet) with an inverse mapping $F: Y \to X$ under the constraint $F(G(X)) \approx X$ such that cycling an image through will translate from one domain back to its own domain. This is akin to what humans do when imagining the actual landscape that Monet painted, or the way Monet might paint a landscape.
## softmax
Softmax is used to resolve multi-label classification probabilities.
$
P(y=c|x;w) = \frac{e^{z_c}}{\sum_{j=1}^k e^{z_j}}
$
where $z_c = W_c x$ for class $c$.
Softmax is analagous to the [[sigmoid]] for single class classification problems,