Variational Autoencoders (VAEs) are neural networks that learn to create new data, like images or text, by understanding and mimicking patterns in existing data. In this article, we will gain an intuitive understanding of how and why they work. But before we get into the specifics of VAEs, it is crucial to understand the basics of autoencoders, as they are the building blocks upon which VAEs are built.
Autoencoder fundamentals
Autoencoders are neural networks that try to learn a compressed representation of input data for the purpose of dimensionality reduction or feature learning. This lower dimensional compressed representation is called latent space.
An autoencoder consist of two main components:
- An encoder, which maps each input sample to a lower dimensional vector (latent vector) in the latent space
- A decoder, which attempts to reconstruct the original sample from this latent vector.
This leads to their application at tasks like data compression, classification and anomaly detection. Autoencoders, however, face significant challenges when it comes to generating new unseen samples. This limitation arises because their loss function focuses solely on minimizing the reconstruction loss, without understanding the overall structure of the latent space.
Reconstruction-only focus
Traditional autoencoders aim to minimize the reconstruction loss, which is the difference between the original input and the reconstructed output. The encoder creates latent vectors from input samples and the decoder can (with some error) reconstruct input samples from these latent vectors. But if we modify a latent vector slightly, hoping to generate new data, the decoder’s output can change abruptly and become completely meaningless. This is because a traditional autoencoder does not learn a well defined latent space. Therefore, we cannot sample a latent vector randomly from this latent space and generate new data. In other words, we can say that the latent space is fragmented or non-smooth.
Why is this important for DATA generation
For generating new data, we need:
- A well-defined latent space from which we can randomly sample latent vectors.
- All of these sampled latent vectors should produce meaningful outputs.
- Small changes in latent vectors should translate to small changes in the output.
Traditional autoencoders do not meet these requirements because their latent space is not well-defined, and their decoder can only reconstruct learned latent vectors.
Variational Autoencoders (VAEs)
Variational Autoencoders (VAEs) are generative models that overcome these limitations by introducing a regularization term in their loss function. The regularization encourages the latent space to follow a specific, well-defined distribution—typically a Gaussian distribution with zero mean and unit variance \mathcal{N}_{\textbf z}(\textbf 0, \textbf I) . Just like an autoencoder, a VAE consists of an encoder and a decoder.
The Encoder: Compressing Data into Latent Space
The encoder’s role is to map the input data \textbf x into a lower-dimensional latent space. Instead of mapping each input sample \textbf x to a fixed latent vector, the encoder in a VAE outputs a vector of means \mu(\textbf x) and a vector of variances \log(\sigma^2 (\textbf x)) . Both vectors have a length D equal to the number of dimensions in latent space. These vectors define a gaussian distribution in the latent space for each input sample. A D dimensional latent vector \textbf z can then be sampled from this distribution for decoding.
Mathematically, for each input sample, the encoder outputs a gaussian distribution given by:
q(\textbf z | \textbf x)=\mathcal N(\textbf z; \mu(\textbf x), \sigma^2(\textbf{x}))
VAEs use Kullback-Liebler divergence term in their loss function to push this distribution towards a standard gaussian distribution with zero mean and unit variance.
D_{KL}(q(\textbf z| \textbf x) || \mathcal{N}_{\textbf z}(\textbf 0, \textbf I))
Note that this only encourages the latent distribution to have a mean closer to zero and variance closer to one. This does not mean that all latent distributions have a zero mean and unit variance.
Sampling the latent vector
To sample a latent vector from this distribution, we use the reparameterization technique given by:
\textbf z = \mu(\textbf x) + \sigma(\textbf x) \odot \epsilon
where:
- \epsilon is a noise vector sampled from a standard normal distribution \mathcal{N}_{\epsilon}(\textbf 0, \textbf I).
- \odot denotes element-wise multiplication.
Essentially, each pair of mean and variance gives one element of the latent vector. So \mu_1 and \sigma_1 are used to calculate z_1 , \mu_2 and \sigma_2 give z_2 and so on until we have a D dimensional latent vector \textbf z .
The Decoder: Reconstructing Data from Latent Space
Due to the random component \epsilon in the calculation of \textbf z , the latent vector \textbf z can be different each time it is calculated even for the same input sample. The decoder in a VAE learns to generate meaningful results (reduce reconstruction loss) for any vector \textbf z sampled from these gaussian distributions. This is different from the decoders in traditional autoencoders which only learn to reconstruct fixed latent vectors. This is a very important property for generating new data. Once a VAE has been trained, we can sample D dimensional latent vectors from a standard gaussian distribution and the decoder can transform these vectors into new realistic data.
The reconstruction loss combined with the regularization of latent space leads to a smooth well-defined latent space. This means that small changes in latent vectors are interpreted by the VAE decoder as small and coherent changes in the output data. The decoder can even generate meaningful results for a latent vector not seen in the training data by combining features from similar latent vectors nearby.
The VAE Loss Function
The VAE optimizes a combined loss function to balance accurate reconstruction and a smooth latent space:
L = \text{Reconstruction Loss} + \beta \cdot \text{KL Divergence}where \beta is a weighting factor that balances the two objectives.
Reconstruction Loss
Reconstruction loss measures how closely the reconstructed output \hat{\textbf{x}} matches the original input \textbf{x} . Common metrics include Mean Squared Error (MSE) for continuous data or Binary Cross-Entropy (BCE) for binary data.
For one input sample \textbf{x} and its reconstruction \hat{\textbf{x}} , MSE is calculated as:
MSE = \frac{1}{D_{\textbf x}} \sum_{d=1}^{D_{\textbf x}} (x_d – \hat{ x}_d)^2where:
- x_d is the d -th dimension of the input vector \textbf x .
- \hat{x}_d is the d -th dimension of the reconstructed vector \hat{\textbf x}.
- D_{\textbf x} is the number of dimensions in \textbf x .
KL Divergence Loss
This term measures how closely the learned latent space distribution q(\textbf{z}|\textbf{x}) matches a standard Gaussian distribution \mathcal{N}(\textbf{0}, \textbf{I}) . For one input sample \textbf{x} , it is calculated as:
D_{KL}(q(\textbf z| \textbf x) || \mathcal{N}_{\textbf z}(\textbf 0, \textbf I)) = \frac{1}{2} \sum_{d=1}^{D_{\textbf z}} \left( \mu_{d}^2 + \sigma_{d}^2 – \log(\sigma_{d}^2) – 1 \right)where:
- \mu_d and \sigma_d^2 are the mean and variance of the d -th dimension of q(\textbf{z}|\textbf{x}) .
- D_{\textbf{z}} is the number of dimensions in the latent space.
Leave a Reply