headpic

Programming Project #1 (`proj1`)

COMPSCI 180 Intro to Computer Vision and Computational Photography

Chuyan Zhou

This webpage uses the Typora Newsprint theme of markdown files.

Overview

Sergei Mikhailovich Prokudin-Gorskii (1863-1944) [Сергей Михайлович Прокудин-Горский, to his Russian friends] was a man well ahead of his time. Convinced, as early as 1907, that color photography was the wave of the future, he won Tzar's special permission to travel across the vast Russian Empire and take color photographs of everything he saw including the only color portrait of Leo Tolstoy. And he really photographed everything: people, buildings, landscapes, railroads, bridges... thousands of color pictures! His idea was simple: record three exposures of every scene onto a glass plate using a red, a green, and a blue filter. Never mind that there was no way to print color photographs until much later -- he envisioned special projectors to be installed in "multimedia" classrooms all across Russia where the children would be able to learn about their vast country. Alas, his plans never materialized: he left Russia in 1918, right after the revolution, never to return again. Luckily, his RGB glass plate negatives, capturing the last years of the Russian Empire, survived and were purchased in 1948 by the Library of Congress. The LoC has recently digitized the negatives and made them available on-line.

The goal of this assignment is to take the digitized Prokudin-Gorskii glass plate images and, using image processing techniques, automatically produce a color image with as few visual artifacts as possible. In order to do this, you will need to extract the three color channel images, place them on top of each other, and align them so that they form a single RGB color image.

Step 1: Preprocessing

Every image provided is a digitized glass plate image vertically arranged, with one example, the original cathedral.jpg listed as follows:

Given the images, i.e. digitized glass plate images, black and white borders appear constantly in every one of them. Because evaluating the similarity metrics of two image channels may be influenced by these borders, we can do some preprocessing to avoid this consequence. First, we directly cut the original rectangle-shaped image into 3 equal parts (shapes of which are ensured to be the same), BGR top to bottom; and we just crop the B/G/R channel subimages by 10% of the height of each subimage. After this procedure, no borders are visible given the project data.

This is actually proved effective. For example, the cropped version aligned with NCC exhaustive search of monastery.jpg:

is significantly visibly better than the uncropped version:

Step 2: Exhaustive Search

We align Green/Red channels to the Blue channel, and we will displace these channels by the roll function in the NumPy library.

2.1 Algorithm Overview

To align one image (G/R channel) to another (B channel), i.e. to displace (roll) one image while another works as a fixed reference, the naive idea is to try a range of displacement vectors (tuples) of the first image (we call img1), evaluate every possible displacement with a metric, and then choose the best one w.r.t. the optimum of the metric, which may be maximum or minimum.

Some metrics are given in the description, which we first implement for the baseline method.

$[-15,15]$ both in width and height.

2.2 Sum of Squared Differences (SSD)

For the image img1img2 $I_1$ $I_2$ img1 $I_1'$ .

The SSD is the sum of all squared differences of every pair of pixels in the same position in the two images, i.e.

S S D (I_{1}^{'}, I_{2}) = \sum_{i, j} [{(I_{1}^{'})}_{i . j} - (I_{2})_{i, j}]^{2},

$(i,j)$ pair is a pixel position.

We search for the minimum SSD to find the best alignment.

Notice the MSE/Euclid distance metrics are just variants (in this project), because if we want to minimize these metrics, the minimizers will be the same as the SSD version.

2.3 Normalized Cross-Correlation (NCC)

The other metric given is NCC, which is the dot product of the two images flattened and then normalized, i.e.

N C C (I_{1}^{'}, I_{2}) = \frac{\sum_{i, j} (I_{1}^{'})_{i, j} (I_{2})_{i, j}}{| | I_{1}^{'} | |_{F} | | I_{2} | |_{F}},

$||\cdot||_F$ denotes the Frobenius norm just equal to the square root of the sum of all squared elements in the matrix.

2.4 Results for JPEGs

With exhaustive search implemented, we are able to find decent displacements for the images. These are the results and their displacement vectors.

SSD and NCC metrics turn out to have the same displacement results for these 3 images.

2.5 Problem met

For TIFF images given, the file sizes are too large for exhaustive search to process in a reasonable time span. We have to find some new methods to reduce the time complexity here.

Step 3: Image Pyramid

3.1 Algorithm Overview

Image pyramid algorithm is to process one image or multiple images by first processing on the low resolution levels, and then use the scaled processing actions to process relatively high resolution levels until processing the original image.

Different levels, with level 0 as the original image, are placed in order and processed in order. We call the ordered list or stack a pyramid. The pyramid top is the smallest scale.

img1 $[I_1, D(I_1), \cdots, D^{(n)}(I_1)]$ img2 $[I_2, D(I_2), \cdots, D^{(n)}(I_2)]$ $n$ $D^{(n)}(\cdot)$ $D(\cdot)$ is the denotation for downsampling an image to 1/2 width and height, where 1/2 is a parameter that we can adjust to smaller scale.

$D^{(n)}(I_1)$ . Because the image is 200-pixel scale, the exhaustive search here is very fast compared to the original one. Then we save the displacement to this level.

Then, repeat: roll the current level image in 2x scale displacement from last level; exhaustively search for a best displacement for the current level image that has been rolled; add new displacement to the 2x-scaled displacement; save the updated displacement (and go into the next level where we apply the 2x displacement in the current level), until there is no next level, namely, we reach the original image.

$[-1,1]$ $[-25,25]$ $s$ $[-s+1,s-1]$ in width and height.

We can still use the metric above, and the result will not change because the overall searching spaces are equivalent.

3.2 Results for all

Because of the file sizes are too large, we post the jpeg-compressed images of all tiffs here.

NCC and SSD lead to same displacement results. Also, in the TIFF-processing stage, the NCC method will run slower than SSD. The SSD pyramid will take about 38s, while the NCC takes about 1min.

Icon

Green: (41, 17)
Red: (89, 23)

3.3 The Problem of Emir

With just searching the displacements over cropped images, and just SSD/NCC metrics, the emir.tif is not aligned well, because the images to be matched do not actually have the same brightness values (they are different color channels). We can balance the brightness values or use some other metrics & methods.

emir

Step 4: Cleverer Metrics

We use the metrics introduced below combined with the image pyramid method to try to align the emir's image correctly.

4.1 Mutual Information (MI)

$X$ $Y$ is defined as

\begin{matrix} M I (X; Y) = \sum_{x \in X} \sum_{y \in Y} P_{X, Y} (x, y) \log (\frac{P_{X, Y} (x, y)}{P_{X} (x) P_{Y} (y)}) \\ = H (X) + H (Y) - H (X, Y), \end{matrix}

$H(\cdot)$ is the entropy operator.

For two images mentioned before as img1img2 $I_1$ $I_2$ $I_1'$ $I_2$ we have the mutual information

M I (I_{1}^{'}, I_{2}) = \sum_{i, j} P (x_{i, j}, y_{i, j}) \log (\frac{P (x_{i, j}, y_{i, j})}{P_{1} (x_{i, j}) P_{2} (x_{i, j})}),

$x_{i,j}$ $I_1'$ $(i,j)$ $x_{i,j} = (I_1')_{i,j}$ $y_{i,j}=(I_2)_{i,j}$ here for simplicity.

$P(\cdot,\cdot)$ $P_1(\cdot)$ $I_1'$ $P_2(\cdot)$ $I_2$ .

Because the mutual information quantifies the "amount of information" obtained about one variable when observing another variable, it is better when the mutual information is higher. So we maximize the MI metric for the best displacement of G/R with B as reference.

4.1.1 The Emir w.r.t. MI

emir_mi

Green: (49, 23) Red: (106, 40)

We can see the channels align correctly now under this metric.

4.1.2 Running Time

Though MI is an effective metric, it is much more time-consuming than provided SSD or NCC metrics, with the same parameter set of the pyramid method. For the emir, it takes about 40s to find the displacement under MI, but only about 3~5s to find the displacement under SSD/NCC.

4.2 Structural Similarity Index Measure (SSIM)

$A$ $B$ , the SSIM metric is defined as

S S I M (A, B) = [l (A, B)]^{α} [c (A, B)]^{β} [s (A, B)]^{γ},

$l(A,B)$ $c(A,B)$ $s(A,B)$ $\alpha,\beta,\gamma$ are the weights for these three comparisons.

$l$ $c$ $s$ are defined as

l (A, B) = \frac{2 μ_{A} μ_{B} + C_{1}}{μ_{A}^{2} + μ_{B}^{2} + C_{1}},

c (A, B) = \frac{2 σ_{A} σ_{B} + C_{2}}{σ_{A}^{2} + σ_{B}^{2} + C_{2}},

s (A, B) = \frac{σ_{A B} + C_{3}}{σ_{A} σ_{B} + C_{3}},

$\mu_A$ $\mu_B$ $A$ $B$ $\sigma_A$ $\sigma_B$ $A$ $B$ $\sigma_{AB}$ $A$ $B$ $C_1$ $C_2$ $C_3$ are constants to avoid zero denominators.

$\alpha=\beta=\gamma=1$ $C_1=(k_1L)^2$ $C_2=(k_2L)^2$ $C_3=C_2/2$ $L$ $k_1=0.01$ $k_2=0.03$ are constants.

We can thus simplify the SSIM metric as

\begin{matrix} S S I M (A, B) = \frac{2 μ_{A} μ_{B} + C_{1}}{μ_{A}^{2} + μ_{B}^{2} + C_{1}} \cdot \frac{2 σ_{A} σ_{B} + C_{2}}{σ_{A}^{2} + σ_{B}^{2} + C_{2}} \cdot \frac{σ_{A B} + C_{3}}{σ_{A} σ_{B} + C_{3}} = \\ \frac{(2 μ_{A} μ_{B} + C_{1}) (2 σ_{A B} + C_{2})}{(μ_{A}^{2} + μ_{B}^{2} + C_{1}) (σ_{A}^{2} + σ_{B}^{2} + C_{2})} . \end{matrix}

We maximize the simplified SSIM metric to find the best alignment, because it evaluates the similarity between two images.

In application, because human eyes perceive an image in a localized way (Saccadic Eye Movement), so we find the means and variances of a localized part (a window with a uniform filter). We first convolute the image through the filter, and then find the overall mean value of the convoluted image.

4.2.1 The Emir w.r.t. SSIM

emir_mi

Green: (50, 23) Red: (105, 40)

Though the displacement is slightly different with the MI one, we can see the channels align correctly now under this metric.

4.2.2 Running Time

It takes about 30s to run, which is better than MI and still slower than SSD/NCC.

Step 5: Edge Detection

Edge detection algorithms can also used to solve the problem of emir. We implement the method that first detect the edges of three channels, and then align the edges, output the displacements to the original image and finally roll the original image by the displacements created by the edge alignment.

We align the edge images by the image pyramid speed-up method and the SSD metric, so this can be seen as another kind of preprocessing or feature engineering of images fed into searching methods.

5.1 Canny Edge Detection

The Canny Algorithm for edge detection is a renowned algorithm with merits such as high accuracy and high resolution of edges detected.

5.1.1 Algorithm Overview

The Canny edge detection is divided in 4 steps in order.

$I$ $I\gets K_G*I$ $K_G$ is the filter kernel.
Sobel operator $G$ :
$\begin{matrix} G_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] * I \\ G_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}] * I \\ G = \sqrt{G_{x}^{2} + G_{y}^{2}}, θ = \arctan (G_{y} / G_{x}), \end{matrix}$
$G_x(i,j)=I_{i+1,j}-I_{i,j}$ $G_y(i,j)=I_{i,j+1}-I_{i,j}$ .
Non-maximum suppression, which is to generate the edge from the gradient matrix by setting all zero to non-maximum pixels around one pixel along the max-gradient direction.
Set low and high thresholds for gradients. Gradients higher than the high threshold will be seen as edges, those lower than the low threshold will be seen as non-edges. Those in between will be seen as edges only if there are 2 higher-than-threshold gradient pixels around.