This chapter does not appear in the book.
One of the claims to fame of the Kinect sensor is its depth processing capabilities, including the generation of depth maps. It's possible to implement similar functionality on a PC with two ordinary webcams, (after they've been calibrated). The picture on the right shows the left and right images from the cameras being rectified, using the calibration information to undistort and align the pictures. Those images are then transformed into a grayscale disparity map, 3D point cloud, and an anaglyph picture.
The disparity map indicates that the user's coffee cup is closest to the cameras since it's colored white, the user is a light gray and so a bit further away, and the background is a darker gray.
It's possible to click on the disparity map, to retrieve depth information (in millimeters). The next image shows the complete DepthViewer GUI, with a red dot and number marked on the map stating that the coffee cup is 614 mm away from the camera.
Unfortunately, this information isn't particularly accurate (the actual distance is nearer 900 mm) due to reasons explained later.
A point cloud is a 3D representation of the depth information, stored in the popular PLY data format, which allows it to be loaded (and manipulated) by various 3D tools. The image below shows two screenshots of the point cloud of at the top of this page loaded into MeshLab.
The image on the right shows the point cloud rotated to the left so that the z-axis (the depth information) is more visible.
The anaglyph in the image at the top of this page is created by encoding the left and right rectified images using red and cyan filters and merging them into a single picture. The 3D effect becomes clear when the image is viewed through color-coded anaglyph glasses. An enlarged version of the anaglyph appears in the image below, along with an example of suitable glasses.
The quality of the disparity map, point cloud, and anaglyph depend on the undistortion and rectification mapping carried out on the left and right input images. This mapping is generated during an earlier calibration phase, when a large series of paired images are processed by DepthViewer. These image pairs are collected using a separate application, called SnapPics, that deals with the two webcams independently of the complex tasks involved in depth processing.
The calibration technique supported by OpenCV requires the user to hold a chessboard picture. The next image shows one of the calibration image pairs.
In summary, depth processing consists of three stages:
I'll explain these stages in more detail during the course of this chapter. For more information on the underlying maths, I recommend chapters 11 and 12 of Learning OpenCV by Gary Bradski and Adrian Kaehler, O'Reilly 2008.