Introduction

The Prokudin-Gorskii Photo Collection is a set of images captured of various parts of the Russian Empire made between 1905 and 1915. As the Library of Congress describes it, the collection consists of “2,607 distinct images include people, religious architecture, historic sites, industry and agriculture, public works construction, scenes along water and railway transportation routes, and views of villages and cities.” The photographer who captures this collection was Sergei Mikhailovich Prokudin-Gorskii, who was also a scientist.

He embarked to capture colored images using a specific technique in computer vision, where you have three separate filters (red, blue, and green) to capture different exposures of the same scene. These scene captures became the glass negatives that make up the entire collection. His vision was to oversee the development of more complex projectors of sorts that would be able to take his glass negatives and create the colorized images from them, but this dream was not fulfilled at the time.

Problem Statement

In present-day, we have very advanced technology for color photography, so we don’t need to approach capturing scenes on glass negatives. On top of that, we can reconstruct the images that Prokudin-Gorskii took through computer vision techniques: aligning, cropping, and matching the properties across the three different exposures.

Solution Approach

To align these images, we could exhaustively search all possible arrangements, but this would not scale well for very large images. As a result, we need to devise more efficient algorithms to still find reasonable alignments between each color channel.

Naive Implementation

Assuming that the images are much smaller, we are able to iterate over a smaller range of values (i.e [-15, 15]) across both the x and y-axes to find the best arrangement for two images. Given that we have three different channels, we can use one channel as our reference image, and try aligning both of the other two images to that reference. In my case, I used the blue channel as my reference channel and align both of the red and green channels to it.

How do we determine which alignment is the best? We can define our own similarity/loss function which measures how close our images are to each other solely based on the features and pixels in our images.

Loss Function

There are a variety of different ways we can measure the similarity/difference between two images as our inputs. One of the simplest ways of doing this is defining a difference formula between the pixels values in one image compared to another.

  1. Mean Squared Error (MSE) - also known as Euclidean Distance: sklearn.metrics.mean_squared_error

    Distance is a measure of how far two values are apart from each other, meaning that it is a good measure of difference between values. In our case, this Euclidean Distance is calculated between every pair of pixels that overlap between the two channels. Initially, this is the loss metric I used since it is one I am much more familiar with. However, the results weren’t optimal since the filters were still visibly off, when zooming into the image.

                           cathedral.jpg: MSE loss
g→b [1, -1], r→b [-1, 7]

                        *cathedral.jpg:* MSE loss
                          g→b [1, -1], r→b [-1, 7]
    

                          monastery.jpg: MSE loss
g→b [0, -6], r→b [1, 9]

                       *monastery.jpg:* MSE loss
                         g→b [0, -6], r→b [1, 9]
    
  2. Normalized Cross Correlation (NCC): tr(A.T @ B) / (np.linalg.norm(A) * np.linalg.norm(B))

    Instead of measuring the difference between the two images, I wanted to see if there was a difference using a similarity metric like NCC. At its core, NCC is using the dot product between two vectors (or flattened matrices in this case) to find the arrangement with the highest similarity between the images. To use this as a loss function, I needed to negate the output, so that more similar image arrangements yielded smaller loss values.

    However, this method also yielded the same results as MSE. The main reason for this is because we only use the pixel values to determine our loss function values, when there are differences in contrast between the different channels, important features, and lots of other important stored in the images that could be used. There are other metrics that take advantage of some of those other values outside of just pixel values. One of those metrics is the Structural Similarity Index.

  3. Structural Similarity Index Measure (SSIM): skimage.metrics.structural_similarity (Bells & Whistles)

    The Structural Similarity Index is an improved metric to measure the similarity between two images, utilizing texture on top of pixel values to produce a score. Initially, this measure was used to determine the quality of cinematography, but is also a commonly used similarity metric that does better than MSE.

                      cathedral.jpg: MSE loss
g→b [1, -1], r→b [-1, 7]

                  *cathedral.jpg:* MSE loss
                   g→b [1, -1], r→b [-1, 7]

                    harvesters.jpg: MSE loss
g→b [-3, 118], r→b [7, 120]

                *harvesters.jpg:* MSE loss
                 g→b [-3, 118], r→b [7, 120]

                      cathedral.jpg: SSIM loss
g→b [2, 5], r→b [3, 12]

                  *cathedral.jpg*: SSIM loss 
                   g→b [2, 5], r→b [3, 12]

                     harvesters.jpg: SSIM loss
g→b [14, 59], r→b [11, 122]

                 *harvesters.jpg:* SSIM loss
                 g→b [14, 59], r→b [11, 122]

With this difference in quality, I chose to use the Structural Similarity Index Metric for the my final results for the images.

Image Pyramid Algorithm

While the naive implementation works well for smaller images, we can’t apply the same algorithm to larger images since the search of a subset of arrangements will be too computationally expensive for us to perform. Instead, we can cover larger displacements across the original image using a image pyramid structure for our algorithm.