Algorithm description


We have two goals:

  • Scene segmentation: Segment a video into scenes (extra points: scene classification)
  • Face detection and tracking: Find and track faces in a video (extra points: facial classification)

Scene Segmentation

There are many approaches to scene segmentation. Basically, we want to detect when the pixels in the movie stream change significantly – usually a different camera shot, or a drastic change in light/color/etc. The naive approach of using pixel differences has the drawback of having to deal with too many false positives. Movement within a frame, with the same camera shot, generally should not constitute a new scene. By doing pixel differences, it is hard to determine an appropriate threshold for these transitions. Furthermore, slow fade-in/outs further complicate matters.

A more robust approach is to use a method that leverages histogram “summaries” of a given frame. Inspired by Local Binary Pattern Histograms, or LBPHs (which concatenate many histograms together to create a new feature vector), I found that stacking a grayscale intensity and a hue histogram work in a robust way. Such an approach is sensitive to a change in the intensity, or a change in hue.

Below is a frame from a TED talk by Christopher deCharms (“A look inside the brain in real time”) transformed to grayscale and hue-only (ignoring the saturation/value channels, or the “SV” in “HSV”, and using artificial coloring to visualize).

output_11_1 output_11_2

An effective way to capture the information in the images is to convert them into histograms. Because these are 8-bit images, the grayscale histogram goes from 0-255. The hue color space is an angle mapped from 0-180. Converting the grayscale and hue images to histograms looks like this:

histbins_hsv = np.arange(0,188,8)
histbins_int = np.arange(0,264,8)
hist_hsv = np.histogram(hframe, bins=histbins_hsv)
hist_int = np.histogram(gframe, bins=histbins_int)

plt.xlabel('Hue value (angle)')

plt.xlabel('Grayscale intensity')

Notice that I used relatively coarse bins (steps of 8 units for both histograms) to reduce bin jitter. In order to combine these two color and intensity spaces, I concatenated the hue and grayscale histograms for each image. The plot below shows the histogram values for the entire TED talk (again, Christopher deCharms). You’ll see time in the x-axis, and the concatenated hue/intensity histogram in the y-axis. This is similar to visualizing an audio recording with a spectrogram:

Notice that I blanked off the first 15 seconds to skip the intro animation for simplicity: it’s an unnecessary source of noise, and does not add any information since all TED talk videos have it.

Just by eye, you can not only see the structure of the scene transitions above, but also classify the scenes! Look at 00:30 and 00:50 – those are clearly the same camera shot. Same for 00:15, 00:45, 01:00, and so on.

At this point, we can take two main approaches:

  • Calculate the difference from one histogram to the next to find the scene transitions, or
  • Use machine learning to cluster the similar scenes together (scene transitions come for free)

Let’s start with the second one, since it’s much more interesting 🙂

PCA+DBSCAN approach



Hist diff method


Face detection

For face detection, we are using the Viola-Jones object detection algorithm with Haar cascade filters. The basic idea is to perform a series of increasingly harder “tests” (called “cascades” due to this fact) on a subwindow in an image. If any of the cascade tests fail, the subwindow moves to the next location. If all tests pass for the subwindow, it is marked and the subwindow moves to the next location. Multiple passes are done through the image with increasing subwindow sizes (normalli 10-20% increases in linear size from one pass to the next). Regions that satisfy all cascade tests through multiple passes are then saved as a positive detection.

The Viola-Jones detection algorithm can be used to detect any object (see e.g. detecting bananas) by first training on positive and negative images. From these examples, the algorithm determines the series of cascade tests to execute on a new image.

Check out this amazing visualization by Adam Harvey (from CV Dazzle) showing the algorithm in action:

The execution of cascade tests with Haar filters is relatively quick due to the fact that Haar filters require only sums. Haar filters take their name from the Haar wavelet, which is the most basic wavelet form, and is similar to a square wave with -1 on a region and +1 on another. The Haar wavelet is shown below for reference (from Wikipedia). So the Haar filters are vertical or horizontal alternating stripes (the white and black ones in the video above), and the area in the different stripe colors are compared.




Face tracking with optical flow

Due to the fact that the Viola-Jones face detection approach with frontal faces fails to detect many instances of a face, we need a robust method for tracking the face movement in the “gap frames” with no face detected. Here is where we use optical flow for such tracking.

Optical flow is a technique that attempts to determine the motion of pixels across two images.

Improvements and future directions

The Viola-Jones algorithm with Haar cascades is relatively fast and executes well, but is limited to frontal faces. In the future, a generative model similar to Facebook’s DeepFace approach (not necessarily using Deep Learining) would be an appropriate direction.

For face classification, it is clear that LBPH classifiers are rather limited and should not continue to be used going forward. A better approach would be to use convolutional neural networks.

See Yann LeCun’s CVPR 2015 keynote slides: “What’s Wrong With Deep Learning?” for a thorough review of modern approaches to object detection.