morphtitle

Programming Project #4 (`proj4`)

COMPSCI 180 Intro to Computer Vision and Computational Photography

Chuyan Zhou

This webpage uses the Typora Academic theme of markdown files.

Part A

1. Shoot the pictures

I took two pictures of one place in my Minecraft world. It's visible that the two pictures have different perspectives, and there is a projective relationship between the two pictures, especially embodied on the surface of the main building. The two pictures are shown below:

I think the effort of configuring & taking photos in games with perspective properties may be seen as one kind of B&W in the "own idea" section.

2. Recover Homographies

$H$ to transform the points in one image to the other. More technically, the homography matrix can work as a transformation matrix to warp the points in one image to the other. Here is the mathmatical model of the homography matrix:

$p = (x, y)^T$ $p' = (wx', wy')^T$ in the second image, we have the following equation:

[\begin{matrix} w x^{'} \\ w y^{'} \\ w \end{matrix}] = H [\begin{matrix} x \\ y \\ 1 \end{matrix}] ⟺ [\begin{matrix} p^{'} \\ w \end{matrix}] = H [\begin{matrix} p \\ 1 \end{matrix}]

$H$ $3 \times 3$ matrix modeled as:

\begin{matrix} H = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & 1 \end{matrix}] \end{matrix}

$w$ is the scaling factor. We can transform the above equation into the following form:

h = [a, b, c, d, e, f, g, h]^{T}

and the transformation equation can be expanded as

{\begin{cases} w x^{'} & = a x + b y + c \\ w y^{'} & = d x + e y + f \\ w & = g x + h y + 1 \end{cases} ⟺ {\begin{cases} (g x + h y + 1) x^{'} & = a x + b y + c \\ (g x + h y + 1) y^{'} & = d x + e y + f, \end{cases}

$Ah = b$ form as

[\begin{matrix} x & y & 1 & 0 & 0 & 0 & - x^{'} x & - x^{'} y \\ 0 & 0 & 0 & x & y & 1 & - y^{'} x & - y^{'} y \end{matrix}] h = [\begin{matrix} x^{'} \\ y^{'} \end{matrix}] := b

$A$ $2 \times 8$ $h$ $8 \times 1$ $b$ $2 \times 1$ $4$ $A$ $4$ pairs of points, we can use the least square method to solve the equation as a overdetermined system:

min_{h} | | A h - b | |^{2}

3. Warp the images

$H$ , we can warp the points in one image to the other. The warping process can be done by the following steps:

For each pixel in the target image, we can use the homography matrix to find its corresponding pixel in the source image.
We can use the bilinear interpolation to find the values for the non-integer pixel positions in the source image.

$T$ for both images to make a good alignment.

We warp image 1 to 2 now.

Here is the version without the correspondence points:

4. Image Rectification

In this part, we choose 4 points from one original image, and warp the image to make the 4 points become a rectangle.

4.1 Rectified Image 1

Original Image, marked with corresponding points

Rectified Image, marked with corresponding points

A large rectified image as above:

4.2 Rectified Image 2

5. Blend the images into a mosaic

In this part, we blend the two images mentioned in part 2 and 3 into a mosaic. For a 2-image system to get a mosaic (which is not more, and is our case), we first warp the two images into the same coordinate system, and then blend the two images into one.

5.1 Alpha Blending

First, we use linear blending to blend the two images, a.k.a. Alpha blending, which is basically set 0.5 in the overlapping region of both masks and make the weighted sum.

However, we can observe apparent artifacts in the edges of the images.

We can use methods such as Laplacian Pyramid and Distance Transform.

5.2 Distance Transform

We can see the two images are not perfectly aligned because of different illumination/exposure conditions and (possibly) correspondence by hand. We can see some artifacts in the final mosaic image.

So we could use the distance transform method to blend the two images. The distance transform method is to calculate the distance of each pixel of each mask to the nearest edge (nearest zero pixel of the mask) of the two images, and blend the two images based on the distance. The closer the pixel to the edge, the more weight it will have in the blending process.

Here is step-by-step code to blend the two images using the distance transform method:

$M$ :
$D_{k} (x, y) \leftarrow min_{M (x^{'}, y^{'}) = 0} \sqrt{(x - x^{'})^{2} + (y - y^{'})^{2}}, k \in [n] .$
$[0, 1]$ :
$D_{k} (x, y) \leftarrow \frac{D_{k} (x, y) - min D_{k}}{max D_{k} - min D_{k}}, k \in [n] .$
$I_1$ $I_2$ :
$I_{B} (x, y) \leftarrow \frac{\sum_{k = 1}^{n} D_{k} (x, y) I_{k} (x, y)}{\sum_{k = 1}^{n} D_{k} (x, y)} .$

$n = 2$ in our blending process: even though we have multiple images, we can blend them one by one (pairwise) to get the final mosaic image.

This seems much better.

I also implemented 3-image warping and blending.

Distance-transform-blended Image, but from 3 images

5.3 Deliverables: 3 Mosaics

Here lays my 3 done mosaics.

Mosaic 1

Mosaic 2

Mosaic 3

Part B

1. Corner detection

1.1 Harris Interest Point Detector

Initially, to find feature points from an image, we pay attention to the corners. Corners are the regions in the image where the intensity of the image changes in all directions. The Harris corner detection (Interest point detection) is a corner detection algorithm that is commonly used in computer vision algorithms. Below is the implementation of the Harris corner detection algorithm.

The Harris corner detection algorithm is based on thresholding and finding the local maxima of the Harris response function. We set the threshold to 0.002 times the maximum of the Harris heatmap.

1.2 ANMS (Adaptive Non-Maximal Suppression)

Harris corner detection algorithm gives us a lot of corner points. However, not all of them are useful and there are too many. Only a slightly greater than 4 feature points are needed to recover the homography from the overdetermined system.

To reduce the number of corner points, we use the Adaptive Non-Maximal Suppression (ANMS) algorithm. ANMS algorithm selects the best corner points from the corner points detected by the Harris corner detection algorithm. Below is the implementation of the ANMS algorithm.

Here is the algorithm step by step from the paper:

We are given the harris heatmap, which is a 2D array of corner scores.
$c_{\text{robust}}$ times is greater than the current corner's score, i.e.
$\begin{matrix} r_{i} = min_{j} | | x_{i} - x_{j} | | s.t. {\begin{cases} score (x_{i}) < c_{robust} \cdot score (x_{j}), \\ r_{i} < r_{max} \end{cases} \end{matrix}$
$x_i$ $i$ $x_j$ $j$ $r_{\text{max}}$ $r_i = \infty$ .
$n$ $n$ $r_i$ values.

We use k-d tree to find the nearest neighbors efficiently and a heap as an ordered list so as to implement the ANMS algorithm within a reasonable time.

$c_{\text{robust}}$ $r_{\text{max}}$ $n$ points in a reasonable time and memory, which turns out to be 500 as the final choice.

2. Extracting Feature Descriptors

2.1 Axis-aligned Patches

Now we have acquired the feature points from the image. The next step is to extract the feature descriptors from the image to find matches. We use MOPS (Multi-Orientation Patch Sampling) to extract the feature descriptors. In 2.1, we first compute axis-aligned patches.

In detail, for every corner point from ANMS, we extract a 40x40 patch around the corner point. We then apply Gaussian filter to the patch and downsample it to 8x8. We then flatten the 8x8 patch to a 64-dimensional feature vector, also normalize it to have zero mean and unit variance.

We also visualize the patches without being flattened and normalized as raw feature descriptors. Note that we yield grayscale patches for matching.

First 121 raw descriptors of the image above (RGB-corresponded, not normalized, and not flattened)

2.2 Rotation-Invariant Descriptors (Bells & Whistles)

This part is independently shown in B.5.

3. Feature Matching

Now we have already extracted the feature descriptors from the image. The next step is to find the matches between the feature descriptors from the two images. We use the Squared Sum of Differences (SSD) to find the best matches between the feature descriptors from the two images. To further improve the matching and reduce the number of false matches, we use:

$x$ $y$ $y$ $x$ .
$x$ $y$ $x$ $y$ $\frac{\text{best match}}{\text{second best match}} < 0.8$ ,we consider the match as a valid match.
RANSAC (Random Sample Consensus): Find the largest set of inliers from the matches in a stochastic way.

3.1 Symmetry Check and Lowe's Ratio Test

We implement the Symmetry Check and Lowe's Ratio Test to find the valid matches between the feature descriptors from the two images. We also show the matches between the feature descriptors from the two images.

Matching relations after Symmetry check and Lowe's trick

One can find that even after these checks and tricks, there are still too many unnecessary connections (correspondences).

3.2 RANSAC

We now apply RANSAC on these feature pairs. RANSAC includes these steps:

Select 4 feature pairs at random;
$H$ based on these feature pairs;
$d(Hx, y) < \text{threshold}$ $x$ $y$ is the feature from image 2;
$n$ iterations and keep the homography with the most populous inlier set;
$H'$ from the overdetermined inlier set.

Matching relations after applying RANSAC

Note that because of the stochastic nature of RANSAC, the automatically found correspondence points can be wrong in a certain chance when mosaicing. In such case, we can rerun or fine-tune the hyperparameters in the pipeline, to get a more robust or better auto-stitching with better performance. The producing may not be successful all the time, but the implemented pipeline is able to find the correct correspondence and homography in a tractable chance.

4. Mosaic Generation (Image Blending)

Finally, we generate the mosaic of the two images using the homography matrix obtained from RANSAC. We use the homography matrix to warp the first image to the second image and blend the two images to generate the mosaic using distance transform. We also show the final mosaic of the two images.

Below are the deliverables, i.e. 3 mosaics with comparison between manually and automatically stitched images.

Here we only use axis-aligned feature descriptors. For the B&W: rotation-invariant descriptors, readers may see the part B.5.

Deliverables: 3 Mosaics

5. Rotation-Invariant Descriptors

We compute the orientation of the corner points, rotate the image accordingly (which is equivalent to rotating the window) and slice from a window to make the feature descriptors rotation-invariant. The gradients and orientations are computed using the Sobel operator, from grayscale version of the image.

First 121 raw descriptors (rotation invariant) of the image above (RGB-corresponded, not normalized, and not flattened)

Deliverables: 3 Mosaics using Rotation-Invariant Descriptors

Automatically Stitched (with rotation-invariant descriptors)

What I've learned?

In part A, I have learned the perspective projection, the importance of fixing COP while taking photos, the fine-grained details of recovering homography matrices & warping images, and the distance transform method to blend images.

In part B, I have learned various algorithms for detecting & selecting feature points, expressing features, matching features (as well as reducing them to a reasonable amount correctly).

Programming Project #4 (proj4)