Epoch 20 Animation

 

Programming Project #5 (proj5)

COMPSCI 180 Intro to Computer Vision and Computational Photography

Chuyan Zhou

This webpage uses the Typora Academic theme of markdown files.

Part A

0. Setup

2 Stages

We first use 3 prompts to let the model generate output images. Here are images and captions displayed below, with different inference steps:

Reflection on the generation

We find that for 5 steps, the outputs are not so clear, specifically, the noise added are not removed so completely. We can observe lots of noisy dots in the generated images. The generated feature is also not so clear.

For 20 steps, the noise is removed, and the generated image starts to be decent. The generated images are quite close to the text prompts.

For 100 steps, the generated images are quite clear and the features are well generated, also closer to the text prompts.

Seed

We use the seed SEED=42 in this project part.

1. Sampling Loops

1.1 Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it.

(1)q(xt|x0)=N(xt;α¯tx0,(1α¯t)I)

which is equivalent to an equation giving xt:

(2)xt=α¯tx0+1α¯tϵwhere ϵN(0,I)

That is, given a clean image x0, we get a noisy image xt at timestep t by sampling from a Gaussian with mean α¯tx0 and variance (1α¯t). Note that the forward process is not just adding noise -- we also scale the image by α¯t and scale the noise by 1α¯t. The alpha's cumulated product is actually an equivalent from an iterative noise adding with scheduled αt's, which is expressed as α¯t=s=1tαs. α¯t is close to 1 for small t, and close to 0 for large t.

We run the forward process on the test image with t[250,500,750]. Here is the noisy images in different time steps as the results:

Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750

 

1.2 Classical Denoising

From the noisy images above in different time steps, we try using Gaussian blurring filters to denoise them. Respectively, the kernel size for t[250,500,750] is 3,5,7, and the sigma is 1.5,2.5,3.5. Here we show the trials of denoising using this classical way.

Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Gaussian Blur Denoising at t=250
Gaussian Blur Denoising at t=500
Gaussian Blur Denoising at t=750

 

We can see the Gaussian filters denoise the images so poorly: the original noises are not eliminated, while main features and shapes of the Campanile is blurred.

1.3 One-step Denoising

Now, we try to recover x0 using UNet from xt, where t[250,500,750]. The usage of UNet is not to directly predict x0, but to predict the added noise ϵ. We denote the noise-predicting model (UNet) as ϵθ(xt,t), which is also conditioned on the time step t as in the expression.

Here, the expression of x0 can be directly given by the forward equation (2) above, which is the one-step denoising:

(2.1)x0=1α¯txt1α¯tα¯tϵθ(xt,t)

where ϵθ is the UNet as the noise predictor.

The one-step denoising results (the original image, the noisy image, and the estimate of the original image) are shown below.

Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
One-Step Denoised Campanile at t=250
One-Step Denoised Campanile at t=500
One-Step Denoised Campanile at t=750

We have seen a much better denoising performance in 1.3 i.e. one-step denoising. But when t is larger, it still goes worse: the denoised image is blurred.

1.4 Iterative Denoising

The diffusion model by an iterative denoising can solve the problem in 1.3 that for larger t, the denoised image starts blurring. Though in the math, the one-step equation is somehow equivalent to the iterative scheme if the models are all (both) perfect, but in real, for a model with limited capability, the latter would be better because it tears the task apart into smaller and easier procedures.

The formula for iterative denoising to estimate the previous step of forwarding (i.e. the next iterated step in denoising) is

xt=α¯tβt1α¯tx0+αt(1α¯t)1α¯txt+vσ

where

Given xt from the last step, and x0 in this step predicted from the formula (2.1) in 1.3, we can compute xt from the formula (3). In this project, we set the start of t as 990, and the stride as 30, so that the model skips 30 steps each time and finally arrives at t=0 i.e. the original image. The results of denoising are shown below.

Denoised to t=690
Denoised to t=540
Denoised to t=390
Denoised to t=240
Denoised to t=90
Original
Iteratively Denoised
One-step denoised
Gaussian Blurred

 

1.5 Diffusion Model Sampling

In this part, we use another important use of diffusion models other than denoising: sampling from the real-image manifold. We feed the iterative denoising function with randomly (drawn from Gaussian) generated noises, using the prompt "a high quality photo" as a "null" prompt as a way to let the model simply do unconditional generation.

Here are 5 images from sampling from the "null" prompt:

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

These images are reasonable, but not too clear nor spectacular. We can enhance this by CFG in the next section.

1.6 Classifier-free Guidance

For a noise or generally, input image, we have the generation conditioned on some prompts. For the same input without conditioning, the model can estimate an unconditional noise denoted as ϵu, and another estimated noise conditioned on the prompt as usual denoted as ϵc. Note that we use a truly empty prompt for generating ϵu, not the "null" prompt mentioned above. Actually, the "null" prompt can be the conditioning in this case, for an unconditional generation in the outer context.

The estimate of the noise, from above, is expressed as

(4)ϵ=ϵu+γ(ϵcϵu)=γϵc+(1γ)ϵu

where γ is the scale factor, which we set as γ=7 in this project.

Basically, this can be seen as a guidance, i.e. a push (ϵcϵu) from the unconditional point in the manifold to the conditional point, that pushes the image to have more "conditional-ness". For example, for a dog as the conditioning, pushing this can make the image resemble a dog more, i.e. have more dog-ness.

If we set γ=1, the push will be equivalent as that in the above section, which is shown not so effective. If γ>1, the push will be enhanced, which is what we are doing in CFG.

Here are 5 images from sampling from the "null" prompt, with CFG at scale γ=7:

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

The result images are much better.

1.7 Image-to-Image Translation

In this part, we follow SDEdit algorithm to transform one image (our inputs) to another with some conditioning. This can be done with inputting the iterative denoising pipeline with our input images, and set a t (or an equivalent index of the strided time steps i.e. i_start a.k.a. noise level), which is the forward step. t is seen as a claimed level of the noises added to the input, i.e. how much "noise" should the model "reduce" into "the original image". The smaller the noise level is, the more t is, and the more the image is denoised (edited).

We use given noise levels [1, 3, 5,7, 10, 20] and the "null" prompt i.e. "a high quality photo" as the conditioning. Results are shown below:

Result 1: Berkeley Campanile

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Berkeley Campanile

Result 2: Self-selected image 1: kusa.png

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
kusa.png

Result 3: Self-selected image 2: pien.png

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
pien.png

1.7.1 Editing Hand-Drawn and Web Images

Same as above, we pick some images from the web & hand-drawn and feed them into the translation.

Result 1: Web image

Web Image
Web Image
Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Web Image resized

Result 2: Hand-drawn image 1: A Cruise

Hand-drawn Image 1
Hand-drawn Image 1: A Cruise
Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Hand-drawn Image 1 resized

Result 3: Hand-drawn image 2: A Lemon

Hand-drawn Image 2
Hand-drawn Image 2: A Lemon

 

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Hand-drawn Image 2 resized

1.7.2 Inpainting

Now, we implement a hole-filling (inpainting) algorithm. We use the same iterative denoising pipeline, but with a mask m on the input image. Mask values i.e. values in m for pixels to be inpainted are set to 1, and those for the rest (the known pixels) are set to 0. The mask is fed into the model as an additional input. Initially, we also produce a Gaussian noise as above, and we also hold the original image as xorig. Then, every time we iteratively denoise from t to t, we follow this formula according to this paper:

(5)xt=mxt+(1m)forward(xorig,t)

where is the element-wise multiplication. This formula is to fill the holes in the image with the inpainted pixels from the iterative denoising. The results are shown below.

Result 1: Berkeley Campanile

Original
Mask
Hole
Inpainted

Result 2: Self-selected image 1 (Pagoda)

Original
Mask
Hole
Inpainted

Result 3: Self-selected image 2 (Pien)

Original
Mask
Hole
Inpainted

1.7.3 Text-Conditional Image-to-image Translation

In this part, we do the same as in 1.7 and 1.7.1. But we use a text prompt as the conditioning. The text prompt is "a rocket ship". The results are shown below.

Result 1: Berkeley Campanile

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
Berkeley Campanile

Result 2: Self-selected image 1: kusa.png

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
kusa.png

Result 3: Self-selected image 2: pien.png

Noise Level 1
Noise Level 3
Noise Level 5
Noise Level 7
Noise Level 10
Noise Level 20
pien.png

1.8 Visual Anagrams

In this part, we use the iterative denoising pipeline to generate visual anagrams (according to this research), which is basically a image that shows a feature when watched ordinarily without being transformed, and another feature when watched upside down.

We can implement this by modifying the noise estimate. One estimate from the noised image now i.e. xt is based on p1 which is the first prompt, and another estimate from the flipped image flip(xt) is based on p2 i.e. the second prompt. Then the estimate for the flipped image is flipped again, aligned with the direction of the ordinary view. Finally, these two estimates are averaged, and the desired estimate is outputted. The process can be expressed as:

(6)ϵ1=UNet(xt,t,p1)ϵ2=flip(UNet(flip(xt),t,p2))ϵ=ϵ1+ϵ22.

The results are shown below.

Result 1

Ordinary: an oil painting of people around a campfire
Flipped: an oil painting of an old man

Result 2

Ordinary: a lithograph of waterfalls
Flipped: a photo of a dog

Result 3

Ordinary: an oil painting of a snowy mountain village
Flipped: a photo of a hipster barista

1.9 Hybrid Images

In this section, we perform the hybrid image generation, which is to generate an image that shows one feature in low frequency (far away / blurredly) and another feature in high frequency (closely / clearly), based on this paper (Factorized Diffusion). We estimate the noise by these formulas:

(7)ϵ1=UNet(xt,t,p1)ϵ2=UNet(xt,t,p2)ϵ=flowpass(ϵ1)+fhighpass(ϵ2)

where flowpass and fhighpass are the low-pass and high-pass filters, respectively.

We use a kernel size of 33 and sigma of 2 as is recommended in the project spec for the LP filter as a Gaussian filter, and the HP filter is to find the difference between the original image and the LP-filtered image, i.e. the difference between identity and the LP filter. The results are shown below.

I used the text encoder instead of predetermined .pth embeddings to get the embeddings for my DIY prompts as in Result 2 and 3.

Result 1 Low pass: a lithograph of a skull High pass: a lithograph of waterfalls

Hybrid Image of a skull and waterfalls

Result 2 Low pass: a salmon sushi nigiri High pass: a sitting orange cat with a white belly

Hybrid Image of a salmon sushi nigiri and a cat

Result 3 Low pass: a photo of the Ayers rock High pass: a photo of a dog lying on stomach

Hybrid Image of the Ayers rock and a dog

 

2. Bells & Whistles

2.1 A logo for the CS180 course

I designed a logo for this course, CS180, using the model stage 1 above, and also upsampled it to a higher resolution using stage 2 of the model.

The logo is a pixelated bear holding a camera, ready to taking a photo.

The logo is shown below:

CS180 Logo

Part B

1. Training a Single-Step Denoising UNet

Given a noisy image z, we want to train a denoiser Dθ with UNet so as to map z to a clean image x. L2 loss is used in this training process (as well as in the whole Part B)

(8)L=Ez,x||Dθ(z)x||2

1.1 Implementing the UNet

We implement an unconditional UNet as shown in the graph above, where operation blocks mentioned above are:

1.2 Using the UNet to Train a Denoiser

To train the unconditional UNet denoiser, we dynamically (not with pre-computed noises) generate (z,x) pairs from clean images from the training data. The clean image drawn from the training data is x, and

(9)z=x+σϵ,ϵN(0,I)

We show varying levels of noise on MNIST digits, with σ=[0.0,0.2,0.4,0.5,0.6,0.8,1.0].

Varying noise levels on MNIST digits

1.2.1 Training

Now, we train the denoiser with σ=0.5, batch size 256, 128 hidden channels (D=128 where D is mentioned in the above computation graph), and an Adam optimizer with a learning rate of 1e-4 on 5 epochs.

The training loss curve is shown below.

Training Loss Curve

We visualize denoised results on the test set at the end of training.

Results on digits from the test set after 1 epoch of training
Results on digits from the test set after 5 epoch of training

1.2.2 OOD Testing

Though the denoiser is trained where σ=0.5, we can also perform out-of-distribution testing with a range of σ which is σ=[0.0,0.2,0.4,0.5,0.6,0.8,1.0].

Results on digits from the test set with varying noise levels

2. Training a Diffusion Model

Now, we are to implement DDPM. We now want the UNet to predict the noise instead of the clean image, i.e. the model is ϵθ and the loss is

(10)L=Eϵ,z||ϵθ(z)ϵ||2

From (2) we know

(2)xt=α¯tx0+1α¯tϵwhere ϵN(0,I)

for a certain time step t{0,,T} as the noise-adding process (the forward process). Because we have a varying noise level now, so we should condition the model on t to let the model work. The time-conditional diffusion model has a computation graph as follows:

where the FCBlock is

In DDPM, we also have a noise schedule which is a list of βt,αt,α¯t. The relationship is:

2.1 Adding Time Conditioning to UNet

We add an encoded time conditioning using broadcasting to the results of an UpBlock and an Unflatten layer as shown in the graph above.

Now, the objective with time conditioning is

(11)L=Eϵ,x0,t||ϵ(xt,t)ϵ||2

where xt is produced in (2).

2.2 Training the Time-Conditional DDPM

The training algorithm is as follows:

In the implementation, we train the DDPM on MNIST (same in parts below) with batch size 128, 20 epochs, D=64 and an Adak optimizer with an initial learning rate of 1e-3. An exponential LR decay scheduler with a gamma of 0.11/n_epochs is also used. Also, t is always normalized.

The training loss curve is shown below.

Training Loss Curve

2.3 Sampling from the Time-Conditional DDPM

Following the sampling algorithm of DDPM as follows:

we can now sample from the model. We show sampling results after the 5th and 20th epoch.

Epoch 5
Epoch 5, animated
Epoch 20
Epoch 20, animated

2.4 Adding Class-Conditioning to UNet

We want the DDPM generate images given a specific class. To modify the UNet architecture, we can now add 2 more FCBlocks and feed them both with the one-hot class vectors which are masked to 0 with a probability puncond=0.1 because we still want the model to preserve the ability of unconditional generation.

When we are adding time conditioning, we now multiply the pre-addition hiddens elementwisely with the outputs of the FCBlocks passing the class signals.

We use a same set of hyperparameters as in 2.2. The class-conditional training algorithm is as follows:

The training loss curve is shown below.

Training Loss Curve

2.5 Sampling from the Class-Conditional DDPM

With class conditioning, we should also use classifier-free guidance mentioned in Part A. We use CFG with a guidance scale γ=5.0 for this part, and the sampling algorithm is as follows, where ϵu is the unconditioned predicted noise and ϵc is the conditioned one.

The sampling results are shown below. We can see the class signals are received very well.

Epoch 5
Epoch 5, animated
Epoch 20
Epoch 20, animated

3. Bells & Whistles: Improving Time-conditional UNet Architecture

For ease of explanation and implementation, our UNet architecture above is pretty simple.

I added skip connections (shortcuts) in ConvBlock, DownBlock and Upblock, which is to add a plainly convoluted input (working as the residual a.k.a. the "identity") to the output of the block. We train with the same set of hyperparameters as in 2.2.

The improved UNet can achieve a better test loss (0.02820390514746497) than the original (0.02956294636183147).

The training loss curve is shown below.

Training Loss Curve

The sampling results are shown below.

Epoch 5
Epoch 5, animated
Epoch 20
Epoch 20, animated

4. Bells & Whistles: Rectified Flow

Instead of DDPM, we now implement a novel SOTA framework named Rectified Flow.

4.1 Overview

Rectified Flow (RF) is a generative modeling method, which tries to transport data from source distribution π0 (which corresponds to the pure Gaussian distribution π0=N(0,I)) and the target distribution π1, which is the distribution of clean images.

The overall objective is to align the velocity estimate (using the UNet, denoted as vθ now) to the actual velocity between the source image X0 and the target image X1. First, the timesteps here are all normalized between t[0,1], instead of spreading in {0,,T}.

For a general case when t[0,T], the velocity is X1X0T while the displacement is X1X0. Upon being normalized i.e. T=1, we can numerically equal these two quantities.

We define the interpolation path as

(12)Xt=tX1+(1t)X0,t[0,1]

where X0π0,X1π1. Then, the time-conditional objective is

(13.1)minθ01E[(X1X0)vθ(Xt,t)2]dt

and we can also add the class conditioning where X1π1|c:

(13.2)minθ01E[(X1X0)vθ(Xt,t,c)2]dt.

We can see, we want the path as straight as possible.

4.2 Training

For an RF, the objective is listed above, which is to minimize over the whole dataset and also over all times.

However, as we all know, we can only estimate the integral 01 and the integral behind the E symbol, instead of directly computing.

For a time-conditional RF, using the Monte Carlo method, we can estimate the objective (loss) by

(14.1)L=01E[(X1X0)vθ(Xt,t)2]dt1ni=1n||x1(i)x0(i)vθ(xt(i),t(i))||2

where x1(i) is a data point (a clean image) drawn from the target distribution (the training dataset), x0(i) is a dynamically-generated Gaussian noise (i.e. drawn from the source distribution), and xt(i) is the interpolation where the timestep t(i) is sampled from a distribution of timesteps. The distribution of t can be a discrete uniform on {0,,T}, a continuous uniform or other nonlinear schedules such as sigmoid-ed 1-d Gaussian.

For a class-conditional RF, the estimate is

(14.2)L=01E[(X1X0)vθ(Xt,t)2]dt1ni=1n||x1(i)x0(i)vθ(xt(i),t(i),c(i))||2

where c(i) is the class determined by (of) the drawn x1(i).

The loss can be regarded as a L2 loss between the predicted velocity and the real displacement too. The algorithm is shown below:

4.3 Sampling

For a RF, we build an ODE to sample. The ODE setup for a time-conditional RF is

(15.1)dZtdt=vθ(Zt,t),Z0π0

and we want Z1 as the generated image. The general form of the solution is

(16.1)Zt=Z0+0tvθ(Zs,s)ds.

Similarly, this integral is also not directly computable. We can use ODE solver (solving methods) to estimate this as well.

The methods can be Euler's method or RK45, and we implement the former as a simple but working one.

The estimate for Z1, using Euler's method, is

(17.1)Z1Z0+1Tk=0T1vθ(Zk/T,kT)

where 1T works as the sampling step size Δt.

For a class-conditional RF, the framework is similar, but we specify the class c, so we have

(15.2)dZtdt=vθ(Zt,t,c),Z0π0,
(16.2)Zt=Z0+0tvθ(Zs,s,c)ds,

and the estimate

(17.2)Z1Z0+1Tk=0T1vθ(Zk/T,kT,c).

4.4 Implementation Detail and Results

I implemented two kinds of RF (time/class-conditional) based on the structure of DDPM.

I used the time-conditional UNet for the time-conditional RF, and the class-conditional UNet for the class-conditional one. The architecture of this core model remains same as in DDPM.

Beta schedules (the list) is no longer needed, but the number of timesteps as a hyperparameter is still necessary for the forward and sampling methods to generate an estimate.

For the class-conditional RF, the CFG is also slightly changed to guide the conditioned velocity estimate instead of the noise estimate, from the unconditioned counterpart:

(18)Z1Z0+1Tk=0T1γvθ(Zk/T,kT,c)+(1γ)vθ(Zk/T,kT,0).

We train with the same set of hyperparameters as in 2.2. The training and testing loss are higher than those in DDPM training, but the generated (sampled) images are fairly good and unnoised.

Results of Time-Conditional RF

The training loss curve for the time-conditional RF is shown below.

Training Loss Curve

The sampling results for the time-conditional RF are shown below.

Epoch 5
Epoch 5, animated
Epoch 20
Epoch 20, animated

Results of Class-Conditional RF

The training loss curve for the class-conditional RF is shown below.

Training Loss Curve

The sampling results for the class-conditional RF are shown below.

Epoch 5
Epoch 5, animated
Epoch 20
Epoch 20, animated

5. Bells & Whistles: Sampling Gifs

I implemented the GIF generating code, and the generated Gifss are juxtaposed with static images in every section above and below.