Please note that this is an optional notebook that is meant to introduce more advanced concepts, if you’re up for a challenge. So, don’t worry if you don’t completely follow every step! We provide external resources for extra base knowledge required to grasp some components of the advanced material.
In this notebook, you’re going to learn about TGAN, from the paper Temporal Generative Adversarial Nets with Singular Value Clipping (Saito, Matsumoto, & Saito, 2017), and its origins in image generation. Here’s the quick version:
Two Generators TGAN is the first work within video generation that uses two distinct generators: a temporal generator and an image generator. The temporal generator produces temporal latent vectors $\vec{z}_t$s which were transformed by the image generator $G_i$. Works after adopt similar approaches.
Created an Inception Score Benchmark At the time, the most common quantitative comparison method was the Inception Score (IS). For a GAN trained on ImageNet, to calculate the IS one needs a pretrained Inception model. For videos, there was no comparable model to Inception, hence the authors proposed the usage of a C3D model trained on the UCF101 dataset. Using this pre-trained model they established a common method for calculating IS for video generation.
Singular Value Clipping (SVC) To enforce a 1-Lipschitz constraint on the discriminator, the authors propose clipping the singular values on the convolutional and linear layers. After every 5 epochs they perform Singular Value Decomposition on the weight matrices and enforce the following algorithm:
$\begin{gather}U \Sigma V^* = W \ \Sigma_{ii} := \min(\Sigma_{ii}, 1) \ W := U \Sigma V^* \end{gather}$
In their experiments they showed TGAN trained with SVC outperforms the normal GAN setup.
For this notebook, we will be focusing on the two generators. But first, some useful imports and commands:
!echo Installing Library to Display gifs:
!pip install moviepy
!echo Downloading pre-trained weights
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-" -O state_normal81000.ckpt && rm -rf /tmp/cookies.txt
Installing Library to Display gifs:
Requirement already satisfied: moviepy in /usr/local/lib/python3.6/dist-packages (0.2.3.5)
Requirement already satisfied: imageio<3.0,>=2.1.2 in /usr/local/lib/python3.6/dist-packages (from moviepy) (2.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from moviepy) (1.19.4)
Requirement already satisfied: tqdm<5.0,>=4.11.2 in /usr/local/lib/python3.6/dist-packages (from moviepy) (4.41.1)
Requirement already satisfied: decorator<5.0,>=4.0.2 in /usr/local/lib/python3.6/dist-packages (from moviepy) (4.4.2)
Requirement already satisfied: pillow in /usr/local/lib/python3.6/dist-packages (from imageio<3.0,>=2.1.2->moviepy) (7.0.0)
Downloading pre-trained weights
--2021-01-10 18:12:04-- https://docs.google.com/uc?export=download&confirm=j9uS&id=1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-
Resolving docs.google.com (docs.google.com)... 172.217.7.174, 2607:f8b0:4004:800::200e
Connecting to docs.google.com (docs.google.com)|172.217.7.174|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-10-docs.googleusercontent.com/docs/securesc/1fkrk4l9c8qo05kt1q5n2jb4ail8r8n3/ibugfq6r4civi31q43f80svgjtb8955u/1610302275000/14637487104375540506/11557808022786128186Z/1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-?e=download [following]
--2021-01-10 18:12:04-- https://doc-04-10-docs.googleusercontent.com/docs/securesc/1fkrk4l9c8qo05kt1q5n2jb4ail8r8n3/ibugfq6r4civi31q43f80svgjtb8955u/1610302275000/14637487104375540506/11557808022786128186Z/1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-?e=download
Resolving doc-04-10-docs.googleusercontent.com (doc-04-10-docs.googleusercontent.com)... 172.217.2.97, 2607:f8b0:4004:80a::2001
Connecting to doc-04-10-docs.googleusercontent.com (doc-04-10-docs.googleusercontent.com)|172.217.2.97|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://docs.google.com/nonceSigner?nonce=7bbup7e3pb8i4&continue=https://doc-04-10-docs.googleusercontent.com/docs/securesc/1fkrk4l9c8qo05kt1q5n2jb4ail8r8n3/ibugfq6r4civi31q43f80svgjtb8955u/1610302275000/14637487104375540506/11557808022786128186Z/1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-?e%3Ddownload&hash=s92qutertbfs7ugse44mov00aeja2n2u [following]
--2021-01-10 18:12:04-- https://docs.google.com/nonceSigner?nonce=7bbup7e3pb8i4&continue=https://doc-04-10-docs.googleusercontent.com/docs/securesc/1fkrk4l9c8qo05kt1q5n2jb4ail8r8n3/ibugfq6r4civi31q43f80svgjtb8955u/1610302275000/14637487104375540506/11557808022786128186Z/1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-?e%3Ddownload&hash=s92qutertbfs7ugse44mov00aeja2n2u
Connecting to docs.google.com (docs.google.com)|172.217.7.174|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://doc-04-10-docs.googleusercontent.com/docs/securesc/1fkrk4l9c8qo05kt1q5n2jb4ail8r8n3/ibugfq6r4civi31q43f80svgjtb8955u/1610302275000/14637487104375540506/11557808022786128186Z/1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-?e=download&nonce=7bbup7e3pb8i4&user=11557808022786128186Z&hash=8fqj4tm7qvu8t9lo8k2nckdd7aflmnml [following]
--2021-01-10 18:12:04-- https://doc-04-10-docs.googleusercontent.com/docs/securesc/1fkrk4l9c8qo05kt1q5n2jb4ail8r8n3/ibugfq6r4civi31q43f80svgjtb8955u/1610302275000/14637487104375540506/11557808022786128186Z/1mk9JdmJH79_vtQkl8zk-jDxa7xUXpck-?e=download&nonce=7bbup7e3pb8i4&user=11557808022786128186Z&hash=8fqj4tm7qvu8t9lo8k2nckdd7aflmnml
Connecting to doc-04-10-docs.googleusercontent.com (doc-04-10-docs.googleusercontent.com)|172.217.2.97|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘state_normal81000.ckpt’
state_normal81000.c [ <=> ] 118.23M 326MB/s in 0.4s
2021-01-10 18:12:04 (326 MB/s) - ‘state_normal81000.ckpt’ saved [123969792]
import torch
import torch.nn as nn
import numpy as np
from moviepy.editor import ImageSequenceClip
from IPython.display import Image
def genSamples(g, n=8):
'''
Generate an n by n grid of videos, given a generator g
'''
with torch.no_grad():
s = g(torch.rand((n**2, 100), device='cuda')*2-1).cpu().detach().numpy()
out = np.zeros((3, 16, 64*n, 64*n))
for j in range(n):
for k in range(n):
out[:, :, 64*j:64*(j+1), 64*k:64*(k+1)] = s[j*n+k, :, :, :, :]
out = out.transpose((1, 2, 3, 0))
out = (out + 1) / 2 * 255
out = out.astype(int)
clip = ImageSequenceClip(list(out), fps=20)
clip.write_gif('sample.gif', fps=20)
The first thing to note about video generation is that we are now generating tensors with an added dimension. While conventional image methods work to generate tensors in $\mathbb{R}^{C \times H \times W}$, we are now generating tensors of size $\mathbb{R}^{T \times C \times H \times W}$.
To solve this problem, TGAN proposed generating temporal dynamics first, then generating images. Gordon and Parde, 2020 have a visual that summarizes the generator’s process.

A latent vector $\vec{z}_c$ is sampled from a distribution. This vector is fed into some generic $G_t$ and it transforms the vector into a series of latent temporal vectors. $G_t:\vec{z}_c \mapsto {\vec{z}_0, \vec{z}_1, \dots, \vec{z}_t}$ From there each temporal vector is joined with $\vec{z}_c$ and fed into an image generator $G_i$. With all images created, our last step is to concatenate all of the images to form a video. Under this setup we decompose time and the images.
Today we will be trying to represent the UCF101 dataset. This dataset is composed of 101 action classes. Below is a sample of real examples:

Here we will be implementing our temporal generator. It transforms a vector in $\mathbb{R}^{100}$ to multiple (16 to be exact) vectors in $\mathbb{R}^{100}$. In TGAN they used a series of transposed 1D convolutions, we will discuss the limitations of this choice later.
class TemporalGenerator(nn.Module):
def __init__(self):
super().__init__()
# Create a sequential model to turn one vector into 16
self.model = nn.Sequential(
nn.ConvTranspose1d(100, 512, kernel_size=1, stride=1, padding=0),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.ConvTranspose1d(512, 256, kernel_size=4, stride=2, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.ConvTranspose1d(256, 128, kernel_size=4, stride=2, padding=1),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.ConvTranspose1d(128, 128, kernel_size=4, stride=2, padding=1),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.ConvTranspose1d(128, 100, kernel_size=4, stride=2, padding=1),
nn.Tanh()
)
# initialize weights according to paper
self.model.apply(self.init_weights)
def init_weights(self, m):
if type(m) == nn.ConvTranspose1d:
nn.init.xavier_uniform_(m.weight, gain=2**0.5)
def forward(self, x):
# reshape x so that it can have convolutions done
x = x.view(-1, 100, 1)
# apply the model and flip the
x = self.model(x).transpose(1, 2)
return x
With our $\vec{z}_c$ generated, and our temporal vectors created, it is time to generate our individual images. The first step is to map the two vectors into appropriate sizes to be fed into a transposed 2D convolutional kernel. This is done by a linear transformation with a nonlinearity. Each newly transformed vector is reshaped to a tensor of $\mathbb{R}^{256 \times 4 \times 4}$. In this shape the two sets of vectors are concatenated across the channel dimension.
After the vectors are transformed, reshaped, and concatenated, it’s finally time for us to make the images! TGAN ensues with a generic image generator of multiple transposed 2D convolutions. After enough transposed convolutions, batchnorms, and ReLUs, the final two operations are a transposed convolution to 3 color channels and a $\tanh$ activation. Our last step is to alter the shape so that the tensor has time, color-channel, height, and width dimensions. We now have a video!
class VideoGenerator(nn.Module):
def __init__(self):
super().__init__()
# instantiate the temporal generator
self.temp = TemporalGenerator()
# create a transformation for the temporal vectors
self.fast = nn.Sequential(
nn.Linear(100, 256 * 4**2, bias=False),
nn.BatchNorm1d(256 * 4**2),
nn.ReLU()
)
# create a transformation for the content vector
self.slow = nn.Sequential(
nn.Linear(100, 256 * 4**2, bias=False),
nn.BatchNorm1d(256 * 4**2),
nn.ReLU()
)
# define the image generator
self.model = nn.Sequential(
nn.ConvTranspose2d(512, 256, kernel_size=4, stride=2, padding=1, bias=False),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1, bias=False),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1, bias=False),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.ConvTranspose2d(32, 3, kernel_size=3, stride=1, padding=1),
nn.Tanh()
)
# initialize weights according to the paper
self.fast.apply(self.init_weights)
self.slow.apply(self.init_weights)
self.model.apply(self.init_weights)
def init_weights(self, m):
if type(m) == nn.ConvTranspose2d or type(m) == nn.Linear:
nn.init.uniform_(m.weight, a=-0.01, b=0.01)
def forward(self, x):
# pass our latent vector through the temporal generator and reshape
z_fast = self.temp(x).contiguous()
z_fast = z_fast.view(-1, 100)
# transform the content and temporal vectors
z_fast = self.fast(z_fast).view(-1, 256, 4, 4)
z_slow = self.slow(x).view(-1, 256, 4, 4).unsqueeze(1)
# after z_slow is transformed and expanded we can duplicate it
z_slow = torch.cat([z_slow]*16, dim=1).view(-1, 256, 4, 4)
# concatenate the temporal and content vectors
z = torch.cat([z_slow, z_fast], dim=1)
# transform into image frames
out = self.model(z)
return out.view(-1, 16, 3, 64, 64).transpose(1, 2)
We’re no longer operating on images, so now we need to rethink our discriminator. 2D convolutions won’t work due to our time dimension, what should we do? TGAN proposes a discriminator composed of a series of 3D convolutions and singular 2D convolution. From one video it produces a single integer.
class VideoDiscriminator(nn.Module):
def __init__(self):
super().__init__()
self.model3d = nn.Sequential(
nn.Conv3d(3, 64, kernel_size=4, padding=1, stride=2),
nn.LeakyReLU(0.2),
nn.Conv3d(64, 128, kernel_size=4, padding=1, stride=2),
nn.BatchNorm3d(128),
nn.LeakyReLU(0.2),
nn.Conv3d(128, 256, kernel_size=4, padding=1, stride=2),
nn.BatchNorm3d(256),
nn.LeakyReLU(0.2),
nn.Conv3d(256, 512, kernel_size=4, padding=1, stride=2),
nn.BatchNorm3d(512),
nn.LeakyReLU(0.2)
)
self.conv2d = nn.Conv2d(512, 1, kernel_size=4, stride=1, padding=0)
# initialize weights according to paper
self.model3d.apply(self.init_weights)
self.init_weights(self.conv2d)
def init_weights(self, m):
if type(m) == nn.Conv3d or type(m) == nn.Conv2d:
nn.init.xavier_normal_(m.weight, gain=2**0.5)
def forward(self, x):
h = self.model3d(x)
# turn a tensor of R^NxTxCxHxW into R^NxCxHxW
h = torch.reshape(h, (32, 512, 4, 4))
h = self.conv2d(h)
return h
Once our discriminator performs inference on some samples the generated integers are then used in the WGAN formulation (you’ll learn more about this next week!):
$$\operatorname*{argmax}_D \operatorname*{argmin}G\mathbb{E}{x\sim \mathbb{P}r}[D(x)]-\mathbb{E}{z\sim p(z)}[D(G(z))]$$
During training this looks like the following.
# update discriminator
pr = dis(real)
fake = gen(torch.rand((batch_size, 100), device='cuda')*2-1)
pf = dis(fake)
dis_loss = torch.mean(-pr) + torch.mean(pf)
dis_loss.backward()
disOpt.step()
# update generator
genOpt.zero_grad()
fake = gen(torch.rand((batch_size, 100), device='cuda')*2-1)
pf = dis(fake)
gen_loss = torch.mean(-pf)
gen_loss.backward()
genOpt.step()
This model took 16 hours to train on an RTX-2080ti, so we’ll use a pretrained version to explore the results.
Note: Make sure to use a GPU runtime!
# instantiate the generator, load the weights, and create a sample
gen = VideoGenerator().cuda()
gen.load_state_dict(torch.load('state_normal81000.ckpt')['model_state_dict'][0])
genSamples(gen)
[MoviePy] Building file sample.gif with imageio
100%|██████████| 17/17 [00:03<00:00, 5.48it/s]
# Run this cell to see results!
Image(open('sample.gif', 'rb').read())

Your first thought is most likely that these results are less than spectacular. The subproblem of video generation is not yet anywhere near the success of StyleGAN. Suprisingly, the generated results are from the state-of-the-art model in 64 by 64 pixel video generation. As of right now, the results are unpublished, but the model holds the highest average inception score, 14.74, calculated over 10 runs of 2048 samples, with the next best being 13.62. In the original TGAN paper the model achieved 11.85. The quantitative and qualitative results open a lot of discussion within this problem. What could cause such extreme variation in training results? What is holding back video generation from reaching our qualitative standards?
One of the first limitations with this paper is that the temporal generator functions on transposed 1D convolutions. This format doesn’t fully follow with how we as humans understand time. Works to follow like MoCoGAN use an LSTM, or in TGANv2 a convolutional LSTM. A pre-registered paper even proposed using neural differential equations for the temporal generator. To see how the field has progressed, here is a brief chronology:
Another development has been the inclusion of Fréchet Inception Distance (FID) scores to benchmark the models. While there is not yet a perfect way to quantify GAN performance, FID has some benefits over IS. The main one is that it compares the synthetic data distribution to the real data distribution. An added bonus is that you can also use the same C3D model by selecting a certain feature layer!
Now, you’ve seen the primary changes, and you understand the current state-of-the-art in 64 by 64 pixel video generation, TGAN, congratulations!
SVC worked well in the original TGAN paper, and its improvements have been replicated. Constraining the discriminator to a 1-Lipschitz function stabilizes training. The following graph compares the training time IS scores between TGAN trained with and without SVC.
To enforce the 1-Lipschitz constraint on the discriminator, certain alterations must be made to parameters during training time. Within TGAN they give a helpful figure which explains what and how to constrain each parameter.
The following code/pseudocode explains how to do this within native PyTorch.
def singular_value_clip(w):
dim = w.shape
# reshape into matrix if not already MxN
if len(dim) > 2:
w = w.reshape(dim[0], -1)
u, s, v = torch.svd(w, some=True)
s[s > 1] = 1
return (u @ torch.diag(s) @ v.t()).view(dim)
for iteration in range(steps):
# update generator and discriminator weights
# enfore 1-Lipschitz
if iteration % 5 == 0:
for module in list(dis.model3d.children()) + [dis.conv2d]:
if type(module) == nn.Conv3d or type(module) == nn.Conv2d:
module.weight.data = singular_value_clip(module.weight)
elif type(module) == nn.BatchNorm3d:
gamma = module.weight.data
std = torch.sqrt(module.running_var)
gamma[gamma > std] = std[gamma > std]
gamma[gamma < 0.01 * std] = 0.01 * std[gamma < 0.01 * std]
module.weight.data = gamma