Programming Project #5 (`proj5`)

COMPSCI 180 Intro to Computer Vision and Computational Photography

Chuyan Zhou

This webpage uses the Typora Academic theme of markdown files.

Part A

0. Setup

2 Stages

We first use 3 prompts to let the model generate output images. Here are images and captions displayed below, with different inference steps:

5 steps (i.e. num_inference_steps=5):
- Size: 64px * 64px (Stage 1)
  
  an oil painting of
  a snowy mountain village
  
  a man wearing a hat
  
  a rocket ship
- Size: 256px * 256px (Stage 2)
  
  an oil painting of a snowy mountain village
  
  a man wearing a hat
  
  a rocket ship
20 steps:
- Stage 1:
  
  an oil painting of a
  snowy mountain village
  
  a man wearing a hat
  
  a rocket ship
- Stage 2:
  
  an oil painting of a
  snowy mountain village
  
  a man wearing a hat
  
  a rocket ship
100 steps:
- Stage 1:
  
  an oil painting of
  a snowy mountain village
  
  a man wearing a hat
  
  a rocket ship
- Stage 2:
  
  an oil painting of a
  snowy mountain village
  
  a man wearing a hat
  
  a rocket ship

Reflection on the generation

We find that for 5 steps, the outputs are not so clear, specifically, the noise added are not removed so completely. We can observe lots of noisy dots in the generated images. The generated feature is also not so clear.

For 20 steps, the noise is removed, and the generated image starts to be decent. The generated images are quite close to the text prompts.

For 100 steps, the generated images are quite clear and the features are well generated, also closer to the text prompts.

Seed

We use the seed SEED=42 in this project part.

1. Sampling Loops

1.1 Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it.

\begin{matrix} (1) & q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I) \end{matrix}

$x_t$ :

\begin{matrix} (2) & x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ where ϵ \sim N (0, I) \end{matrix}

$x_0$ $x_t$ $t$ $\sqrt{\bar\alpha_t} x_0$ $(1 - \bar\alpha_t)$ just $\sqrt{\bar\alpha_t}$ $\sqrt{1-\bar\alpha_t}$ $\alpha_t$ $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ $\bar\alpha_t$ $t$ $t$ .

$t \in [250, 500, 750]$ . Here is the noisy images in different time steps as the results:

1.2 Classical Denoising

$t\in[250,500,750]$ is 3,5,7, and the sigma is 1.5,2.5,3.5. Here we show the trials of denoising using this classical way.

We can see the Gaussian filters denoise the images so poorly: the original noises are not eliminated, while main features and shapes of the Campanile is blurred.

1.3 One-step Denoising

$x_0$ $x_t$ $t\in[250,500,750]$ $x_0$ $\epsilon$ $\epsilon_\theta(x_t,t)$ $t$ as in the expression.

$x_0$ can be directly given by the forward equation (2) above, which is the one-step denoising:

\begin{matrix} (2.1) & x_{0} = \frac{1}{\sqrt{{\bar{α}}_{t}}} x_{t} - \frac{\sqrt{1 - {\bar{α}}_{t}}}{\sqrt{{\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t) \end{matrix}

$\epsilon_\theta$ is the UNet as the noise predictor.

The one-step denoising results (the original image, the noisy image, and the estimate of the original image) are shown below.

$t$ is larger, it still goes worse: the denoised image is blurred.

1.4 Iterative Denoising

$t$ , the denoised image starts blurring. Though in the math, the one-step equation is somehow equivalent to the iterative scheme if the models are all (both) perfect, but in real, for a model with limited capability, the latter would be better because it tears the task apart into smaller and easier procedures.

The formula for iterative denoising to estimate the previous step of forwarding (i.e. the next iterated step in denoising) is

x_{t^{'}} = \frac{\sqrt{{\bar{α}}_{t^{'}}} β_{t}}{1 - {\bar{α}}_{t}} x_{0} + \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t^{'}})}{1 - {\bar{α}}_{t}} x_{t} + v_{σ}

where

$t'$ $t'<t$ ;
$\alpha_t={\bar\alpha_{t'}\over \bar\alpha_t}$ ;
$\beta_t = 1-\alpha_t$ ;
$v_\sigma$ is a variance term also predicted by the model in our case.

$x_t$ $x_0$ $x_{t'}$ $t$ $t=0$ i.e. the original image. The results of denoising are shown below.

1.5 Diffusion Model Sampling

In this part, we use another important use of diffusion models other than denoising: sampling from the real-image manifold. We feed the iterative denoising function with randomly (drawn from Gaussian) generated noises, using the prompt "a high quality photo" as a "null" prompt as a way to let the model simply do unconditional generation.

Here are 5 images from sampling from the "null" prompt:

These images are reasonable, but not too clear nor spectacular. We can enhance this by CFG in the next section.

1.6 Classifier-free Guidance

$\epsilon_u$ $\epsilon_c$ $\epsilon_u$ , not the "null" prompt mentioned above. Actually, the "null" prompt can be the conditioning in this case, for an unconditional generation in the outer context.

The estimate of the noise, from above, is expressed as

\begin{matrix} (4) & ϵ = ϵ_{u} + γ (ϵ_{c} - ϵ_{u}) = γ ϵ_{c} + (1 - γ) ϵ_{u} \end{matrix}

$\gamma$ $\gamma=7$ in this project.

$\epsilon_c-\epsilon_u$ ) from the unconditional point in the manifold to the conditional point, that pushes the image to have more "conditional-ness". For example, for a dog as the conditioning, pushing this can make the image resemble a dog more, i.e. have more dog-ness.

$\gamma=1$ $\gamma>1$ , the push will be enhanced, which is what we are doing in CFG.

$\gamma=7$ :

The result images are much better.

1.7 Image-to-Image Translation

In this part, we follow SDEdit algorithm to transform one image (our inputs) to another with some conditioning. This can be done with inputting the iterative denoising pipeline with our input images, and set a t (or an equivalent index of the strided time steps i.e. i_start a.k.a. noise level), which is the forward step. t is seen as a claimed level of the noises added to the input, i.e. how much "noise" should the model "reduce" into "the original image". The smaller the noise level is, the more t is, and the more the image is denoised (edited).

We use given noise levels [1, 3, 5,7, 10, 20] and the "null" prompt i.e. "a high quality photo" as the conditioning. Results are shown below:

Result 1: Berkeley Campanile

Result 2: Self-selected image 1: kusa.png

Result 3: Self-selected image 2: pien.png

1.7.1 Editing Hand-Drawn and Web Images

Same as above, we pick some images from the web & hand-drawn and feed them into the translation.

Result 1: Web image

Result 2: Hand-drawn image 1: A Cruise

Result 3: Hand-drawn image 2: A Lemon

1.7.2 Inpainting

$\bold m$ $\bold m$ $x_{orig}$ $t$ $t'$ , we follow this formula according to this paper:

\begin{matrix} (5) & x_{t^{'}} = m ⊙ x_{t^{'}} + (1 - m) ⊙ forward (x_{o r i g}, t^{'}) \end{matrix}

$\odot$ is the element-wise multiplication. This formula is to fill the holes in the image with the inpainted pixels from the iterative denoising. The results are shown below.

Result 1: Berkeley Campanile

Result 2: Self-selected image 1 (Pagoda)

Result 3: Self-selected image 2 (Pien)

1.7.3 Text-Conditional Image-to-image Translation

In this part, we do the same as in 1.7 and 1.7.1. But we use a text prompt as the conditioning. The text prompt is "a rocket ship". The results are shown below.

Result 1: Berkeley Campanile

Result 2: Self-selected image 1: kusa.png

Result 3: Self-selected image 2: pien.png

1.8 Visual Anagrams

In this part, we use the iterative denoising pipeline to generate visual anagrams (according to this research), which is basically a image that shows a feature when watched ordinarily without being transformed, and another feature when watched upside down.

$x_t$ $p_1$ $\text{flip}(x_t)$ $p_2$ i.e. the second prompt. Then the estimate for the flipped image is flipped again, aligned with the direction of the ordinary view. Finally, these two estimates are averaged, and the desired estimate is outputted. The process can be expressed as:

\begin{matrix} (6) & \begin{matrix} ϵ_{1} = UNet (x_{t}, t, p_{1}) \\ ϵ_{2} = flip (UNet (flip (x_{t}), t, p_{2})) \\ ϵ = \frac{ϵ_{1} + ϵ_{2}}{2} . \end{matrix} \end{matrix}

The results are shown below.

Result 1

Ordinary: an oil painting of people around a campfire

Result 2

Result 3

Ordinary: an oil painting of a snowy mountain village

1.9 Hybrid Images

In this section, we perform the hybrid image generation, which is to generate an image that shows one feature in low frequency (far away / blurredly) and another feature in high frequency (closely / clearly), based on this paper (Factorized Diffusion). We estimate the noise by these formulas:

\begin{matrix} (7) & \begin{matrix} ϵ_{1} = UNet (x_{t}, t, p_{1}) \\ ϵ_{2} = UNet (x_{t}, t, p_{2}) \\ ϵ = f_{lowpass} (ϵ_{1}) + f_{highpass} (ϵ_{2}) \end{matrix} \end{matrix}

$f_\text{lowpass}$ $f_\text{highpass}$ are the low-pass and high-pass filters, respectively.

We use a kernel size of 33 and sigma of 2 as is recommended in the project spec for the LP filter as a Gaussian filter, and the HP filter is to find the difference between the original image and the LP-filtered image, i.e. the difference between identity and the LP filter. The results are shown below.

I used the text encoder instead of predetermined .pth embeddings to get the embeddings for my DIY prompts as in Result 2 and 3.

Result 1 Low pass: a lithograph of a skull High pass: a lithograph of waterfalls

Result 2 Low pass: a salmon sushi nigiri High pass: a sitting orange cat with a white belly

Hybrid Image of a salmon sushi nigiri and a cat

Result 3 Low pass: a photo of the Ayers rock High pass: a photo of a dog lying on stomach

Hybrid Image of the Ayers rock and a dog

2. Bells & Whistles

I used the text encoder instead of predetermined .pth embeddings to get the embeddings for my DIY prompts as above.

2.1 A logo for the CS180 course

I designed a logo for this course, CS180, using the model stage 1 above, and also upsampled it to a higher resolution using stage 2 of the model.

The logo is a pixelated bear holding a camera, ready to taking a photo.

The logo is shown below:

CS180 Logo

Part B

1. Training a Single-Step Denoising UNet

$z$ $D_\theta$ $z$ $x$ . L2 loss is used in this training process (as well as in the whole Part B)

\begin{matrix} (8) & L = E_{z, x} | | D_{θ} (z) - x | |^{2} \end{matrix}

1.1 Implementing the UNet

We implement an unconditional UNet as shown in the graph above, where operation blocks mentioned above are:

1.2 Using the UNet to Train a Denoiser

$(z,x)$ $x$ , and

\begin{matrix} (9) & z = x + σ ϵ, ϵ \sim N (0, I) \end{matrix}

$\sigma=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$ .

1.2.1 Training

$\sigma=0.5$ $D=128$ $D$ is mentioned in the above computation graph), and an Adam optimizer with a learning rate of 1e-4 on 5 epochs.

The training loss curve is shown below.

We visualize denoised results on the test set at the end of training.

Results on digits from the test set after 1 epoch of training

Results on digits from the test set after 5 epoch of training

1.2.2 OOD Testing

$\sigma=0.5$ $\sigma$ $\sigma=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$ .

Results on digits from the test set with varying noise levels

2. Training a Diffusion Model

DDPM $\epsilon_\theta$ and the loss is

\begin{matrix} (10) & L = E_{ϵ, z} | | ϵ_{θ} (z) - ϵ | |^{2} \end{matrix}

From (2) we know

\begin{matrix} (2) & x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ where ϵ \sim N (0, I) \end{matrix}

$t\in\{0,\cdots,T\}$ $t$ to let the model work. The time-conditional diffusion model has a computation graph as follows:

where the FCBlock is

$\beta_t,\alpha_t,\bar\alpha_t$ . The relationship is:

$\beta_0=1e-4,\beta_T=0.02$ $\beta_t$ 's in between are uniformly spaced;
$\alpha_t=1-\beta_t$
$\bar\alpha_t=\prod_{s=1}^t \alpha_s$ is a cumulative product.

2.1 Adding Time Conditioning to UNet

We add an encoded time conditioning using broadcasting to the results of an UpBlock and an Unflatten layer as shown in the graph above.

Now, the objective with time conditioning is

\begin{matrix} (11) & L = E_{ϵ, x_{0}, t} | | ϵ (x_{t}, t) - ϵ | |^{2} \end{matrix}

$x_t$ is produced in (2).

2.2 Training the Time-Conditional DDPM

The training algorithm is as follows:

$D=64$ $0.1^{1/\text{n\_epochs}}$ $t$ is always normalized.

The training loss curve is shown below.

2.3 Sampling from the Time-Conditional DDPM

Following the sampling algorithm of DDPM as follows:

we can now sample from the model. We show sampling results after the 5th and 20th epoch.

2.4 Adding Class-Conditioning to UNet

FCBlock $p_{\rm uncond}=0.1$ because we still want the model to preserve the ability of unconditional generation.

When we are adding time conditioning, we now multiply the pre-addition hiddens elementwisely with the outputs of the FCBlocks passing the class signals.

We use a same set of hyperparameters as in 2.2. The class-conditional training algorithm is as follows:

The training loss curve is shown below.

2.5 Sampling from the Class-Conditional DDPM

$\gamma=5.0$ $\epsilon_u$ $\epsilon_c$ is the conditioned one.

The sampling results are shown below. We can see the class signals are received very well.

3. Bells & Whistles: Improving Time-conditional UNet Architecture

For ease of explanation and implementation, our UNet architecture above is pretty simple.

I added skip connections (shortcuts) in ConvBlock, DownBlock and Upblock, which is to add a plainly convoluted input (working as the residual a.k.a. the "identity") to the output of the block. We train with the same set of hyperparameters as in 2.2.

The improved UNet can achieve a better test loss (0.02820390514746497) than the original (0.02956294636183147).

The training loss curve is shown below.

The sampling results are shown below.

4. Bells & Whistles: Rectified Flow

Instead of DDPM, we now implement a novel SOTA framework named Rectified Flow.

4.1 Overview

$\pi_0$ $\pi_0=N(0,I)$ $\pi_1$ , which is the distribution of clean images.

align the velocity estimate $v_\theta$ to the actual velocity $X_0$ $X_1$ $t\in[0,1]$ $\{0,\cdots, T\}$ .

$t\in[0,T']$ $X_1-X_0\over T'$ $X_1-X_0$ $T'=1$ , we can numerically equal these two quantities.

We define the interpolation path as

\begin{matrix} (12) & X_{t} = t X_{1} + (1 - t) X_{0}, t \in [0, 1] \end{matrix}

$X_0\sim \pi_0,X_1\sim\pi_1$ . Then, the time-conditional objective is

\begin{matrix} (13.1) & min_{θ} \int_{0}^{1} E [∥ (X_{1} - X_{0}) - v_{θ} (X_{t}, t) ∥^{2}] d t \end{matrix}

$X_1\sim \pi_1|c$ :

\begin{matrix} (13.2) & min_{θ} \int_{0}^{1} E [∥ (X_{1} - X_{0}) - v_{θ} (X_{t}, t, c) ∥^{2}] d t . \end{matrix}

We can see, we want the path as straight as possible.

4.2 Training

For an RF, the objective is listed above, which is to minimize over the whole dataset and also over all times.

$\int_0^1$ $\mathbb E$ symbol, instead of directly computing.

For a time-conditional RF, using the Monte Carlo method, we can estimate the objective (loss) by

\begin{matrix} (14.1) & \begin{matrix} L = \int_{0}^{1} E [∥ (X_{1} - X_{0}) - v_{θ} (X_{t}, t) ∥^{2}] d t \\ \approx \frac{1}{n} \sum_{i = 1}^{n} | | x_{1}^{(i)} - x_{0}^{(i)} - v_{θ} (x_{t}^{(i)}, t^{(i)}) | |^{2} \end{matrix} \end{matrix}

$x_1^{(i)}$ $x_0^{(i)}$ $x_t^{(i)}$ $t^{(i)}$ $t$ $\{0,\cdots, T\}$ , a continuous uniform or other nonlinear schedules such as sigmoid-ed 1-d Gaussian.

For a class-conditional RF, the estimate is

\begin{matrix} (14.2) & \begin{matrix} L = \int_{0}^{1} E [∥ (X_{1} - X_{0}) - v_{θ} (X_{t}, t) ∥^{2}] d t \\ \approx \frac{1}{n} \sum_{i = 1}^{n} | | x_{1}^{(i)} - x_{0}^{(i)} - v_{θ} (x_{t}^{(i)}, t^{(i)}, c^{(i)}) | |^{2} \end{matrix} \end{matrix}

$c^{(i)}$ $x_1^{(i)}$ .

The loss can be regarded as a L2 loss between the predicted velocity and the real displacement too. The algorithm is shown below:

4.3 Sampling

For a RF, we build an ODE to sample. The ODE setup for a time-conditional RF is

\begin{matrix} (15.1) & \frac{d Z_{t}}{d t} = v_{θ} (Z_{t}, t), Z_{0} \sim π_{0} \end{matrix}

$Z_1$ as the generated image. The general form of the solution is

\begin{matrix} (16.1) & Z_{t} = Z_{0} + \int_{0}^{t} v_{θ} (Z_{s}, s) d s . \end{matrix}

Similarly, this integral is also not directly computable. We can use ODE solver (solving methods) to estimate this as well.

The methods can be Euler's method or RK45, and we implement the former as a simple but working one.

$Z_1$ , using Euler's method, is

\begin{matrix} (17.1) & Z_{1} \approx Z_{0} + \frac{1}{T} \sum_{k = 0}^{T - 1} v_{θ} (Z_{k / T}, \frac{k}{T}) \end{matrix}

$\frac1T$ $\Delta t$ .

$c$ , so we have

\begin{matrix} (15.2) & \frac{d Z_{t}}{d t} = v_{θ} (Z_{t}, t, c), Z_{0} \sim π_{0}, \end{matrix}

\begin{matrix} (16.2) & Z_{t} = Z_{0} + \int_{0}^{t} v_{θ} (Z_{s}, s, c) d s, \end{matrix}

and the estimate

\begin{matrix} (17.2) & Z_{1} \approx Z_{0} + \frac{1}{T} \sum_{k = 0}^{T - 1} v_{θ} (Z_{k / T}, \frac{k}{T}, c) . \end{matrix}

4.4 Implementation Detail and Results

I implemented two kinds of RF (time/class-conditional) based on the structure of DDPM.

I used the time-conditional UNet for the time-conditional RF, and the class-conditional UNet for the class-conditional one. The architecture of this core model remains same as in DDPM.

Beta schedules (the list) is no longer needed, but the number of timesteps as a hyperparameter is still necessary for the forward and sampling methods to generate an estimate.

For the class-conditional RF, the CFG is also slightly changed to guide the conditioned velocity estimate instead of the noise estimate, from the unconditioned counterpart:

\begin{matrix} (18) & Z_{1} \approx Z_{0} + \frac{1}{T} \sum_{k = 0}^{T - 1} γ v_{θ} (Z_{k / T}, \frac{k}{T}, c) + (1 - γ) v_{θ} (Z_{k / T}, \frac{k}{T}, 0) . \end{matrix}

We train with the same set of hyperparameters as in 2.2. The training and testing loss are higher than those in DDPM training, but the generated (sampled) images are fairly good and unnoised.