LinkedIn emplea cookies para mejorar la funcionalidad y el rendimiento de nuestro sitio web, así como para ofrecer publicidad relevante. Si continúas navegando por ese sitio web, aceptas el uso de cookies. Consulta nuestras Condiciones de uso y nuestra Política de privacidad para más información.
LinkedIn emplea cookies para mejorar la funcionalidad y el rendimiento de nuestro sitio web, así como para ofrecer publicidad relevante. Si continúas navegando por ese sitio web, aceptas el uso de cookies. Consulta nuestra Política de privacidad y nuestras Condiciones de uso para más información.
We have all heard a lot about touch interaction recently. In this talk I’m going to describe a new way to interact with thin-screen devices. We have developed a device we are calling the BiDi Screen, short for bi-directional, which supports seamless transition from on-screen multi-touch to hover-based gestural interaction, among other features, in an LCD-thin package.
Here’s a quick teaser to illustrate the capabilities I’m describing. <wait for multi- to hover part to pass> Here you see a user pulling her hands away to rotate and zoom a 3-D model. We also show a use of 3D gesture to navigate a 3D world. We support these modes by creating an array of virtual cameras on an LCD using a technique known as Spatial Heterodyning. Because we’re using an optical technique, we also enable dynamic relighting applications, where real-world lighting is transfered to a rendered scene.
We are inspired by the next generation of multi-touch devices that rely on optical sensors embedded in an LCD matrix typically used for display. We also take inspiration from developments in commercializing depth sensitive cameras and the explosion of multi-touch interaction devices in consumer electronics and media and their ability to provide a smooth and intuitive user experience. What if we could combine all of these features into a single device?
This device would of course support multi-touch on-screen interaction, but because it can measure the distance to objects in the scene a user’s hands can be tracked in a volume in front of the screen, without gloves or other fiducials.
Since we are adapting LCD technology we can fit a BiDi screen into laptops and mobile devices.
So here is a preview of our quantitative results. I’ll explain this in more detail later on, but you can see we’re able to accurately distinguish the depth of a set of resolution targets. We show above a portion of portion of views form our virtual cameras, a synthetically refocused image, and the depth map derived from it.
You may be wondering at this point how a you can build a thin device to enable touch and 3D gesture interaction with bare hands, and still display images without interference.
Recall that one of our inspirations was this new class of optical multi-touch device. At the top you can see a prototype that Sharp Microelectronics has published. These devices are basically arrays of naked phototransistors. Like a document scanner, they are able to capture a sharp image of objects in contact with the surface of the screen. But as objects move away from the screen, without any focusing optics, the images captured this device are blurred.
Our observation is that by moving the sensor plane a small distance from the LCD in an optical multitouch device, we enable mask-based light-field capture. We use the LCD screen to display the desired masks, multiplexing between images displayed for the user and masks displayed to create a virtual camera array. I’ll explain more about the virtual camera array in a moment, but suffice to say that once we have measurements from the array we can extract depth.
Thus the ideal BiDi screen consists of a normal LCD panel separated by a small distance from a bare sensor array. This format creates a single device that spatially collocates a display and capture surface.
In order to see what’s going on here, its useful to consider the case where we display a pattern of pinhole masks on the LCD. This essentially creates an array of virtual cameras, each of which has a slightly different view of the objects in front of the screen. Pinholes allow very little light through to the sensor layer, however. Some of us (Lanman and Raskar) have shown in previous work that with a technique called Spatial Heterodyning, a tiled-broadband mask, such as the MURA code shown here, has equivalent resolution properties to a pinhole array but allows 50 times more light through to the sensor.
With this virtual camera array we’re able to capture depth information, which can be used for a wide range of purposes. We show this simple luke-skywalker like interaction to demonstrate the abilites of the BiDi screen, but much more is possible.
This sounds somewhat complicated, I know. In our thinking about this topic we considered a number of alternative solutions. Touch and gesture interaction is a popular field, with many techniques and variety of design goals. I’ll show you why I think the BiDi screen has much to offer over these alternatives. /* From top left, clockwise: Jeff Han’s multitouch screen used by King on CNN, the Apple iPhone, gSpeak by Oblong Industries, the Stanford Multi-Camera array, the Canesta Depth Cam, the ThinSight, by Microsoft Research, and the Microsoft Surface */
To start, let’s consider the the simplest cases: traditional touch technology. Resistive and capacitive touch screens have been the staple of touch interaction for years and the techniques are well understood and cheap. As they stand commercially neither can sense off-screen gesture. It may be possible using arrays of capacitive sensing RF antennas to detect objects with sufficient resolution for gesture interaction. In this mode, calibration becomes a significant issue. It is important to note that in contrast to optical multi-touch neither resistive nor capacitive touch techniques are currently on a technology trend that will lead to gesture interaction.
More sophisticated methods can be imagined to support our goals. One such approach would be to put a real camera array behind the screen. This would duplicate the measurements that the BiDi screen makes but will create dark spots by blocking the backlight. The Microsoft SecondLight uses a pair of cameras far behind the screen with a switchable diffuser. This adds significant depth to the device. Other near-field optical techniques exist as well, such as the Microsoft Research ThinSight and the work shown by Jeff Han, but do not addres off-screen interaction. Another option would be to place the cameras to the side of the screen. In this scenario there will be a transition region between touch and gesture that will be uncovered by the cameras. A multi-modal approach could help to fill this gap. But will still leave some region of gestural interaction uncovered. An extreem fish eye lens, which might cover the entire screen might fill the space, but will add distortion that will make object correspondance for depth from steredo difficult. Additional cameras might be turned in to face the screen but will add to the thickness of the device. Importantly, none of these solutions is poised to ride an existing technology trend. Manufacturers are already building LCDs with embedded light sensors.
Employing sensors specifically designed to measure depth is another option. Time-of-flight depth cameras can be placed behind the screen and coupled with projectors or other types of display, such as the Microsoft TouchLight. This approach generally yields a deep device not suitable for applications that require portability. These approaches cannot replicate the relighting demo that I showed earlier.
To achieve collocated display with a capture with a virtual array of cameras in a thin form factor, the BiDi screen uses an LCD as a spatial light modulator for mask-based light field capture, and also to display images to the user. A depth map for the scene in front of the screen is extracted from the light field using a depth from focus technique, enabling gesture recognition. In order to more fully explain this process I’ll briefly present some background information. Some of this may seem a little unrelated, but bear with me and I’ll tie it together at the end.
It is sufficient to consider a light field as the set of light rays taveling from objects in a scene to our sensor. We’ll consider the flatland case to make visualizing it simple. A ray striking the sensor on the x-axis at an angle theta is represent by a point in the x-theta parameterization of the light field on the right. A standard sensor integrates light rays striking it from all angles, measuring a line in x-theta space. A sensor array consisting of many sensors will measure a set of lines in x-theta space.
Here we consider what happens when we take the fourier transform in x-theta space. The set of rays emitted from a real-world scene will produce some function in x-theta space. This light field will have some spectrum when transformed. The type of measurement shown here is actually a projection in x-theta space on to the x axis. By the fourier slice theorm, this projection will have a fourier transform pair in a slice along the f-x axis in the frequency domain. The important point here is that because of the type of measurement made by a standard light sensor, we are only able to measure a slice in the frequency domain along the f-x axis. This limits us to measuring only this region of the spectrum of the light field.
Another interesting property of the light field is the skew transform that is applied to a set of rays that propogate through free space. Here we see a single ray passing through a mask and hitting a sensor at x. The ray is plotted in two light field spaces, one for the mask and one for the sensor. -- -- As we add rays we can see why the skew occurs.
Another important property in Spatial Heterodyning is the fact that convolving an arbitrary function witha delta function creates a copy of that function at the location of the delta function.
Recall that the light field in the x-theta plane has some spectrum when transformed. A critical property of the mask we choose for light field decoding is that its transform contain a series of delta functions.
The name Spatial Heterodyning is inspired by AM radio broadcasts, in which a voice signal is multiplied (or modulated) with a high frequnecy carrier to shift it in the frequency domain. We accomplish a similar multiplication each time a ray passes through our mask. Recall from signal processing that multiplication in the prime domain is convolution in the frequeny domain.
So to tie everything together, when the rays of the scene pass through our mask, their frequency spectrum is convolved with the series of deltas in the mask’s spectrum, creating copies of the scene spectrum. When we take into account the skew incurred by propogation through free space between the mask and sensor we can see something interesting.
Now, different regions of each spectral copy lie on the measurable f-x axis. By measuring these regions... ...and rearranging the measured sections, we can reconstruct the original light field spectrum, from which we may obtain the light field itself. Note that we must have a band limited light field on order for this technique to work, as the spectral copies cannot overlap. In practice this means we must limit the angle of incident light into the sensor.
So armed with this technique we went about building a prototype. One immediate stumbling block is the relative unavailablity of large area, high res light sensor arrays. Though these will be a commodity in the near future they are rare today. To compensate we replace the sensor array in our device with a diffuser/camera pair. The camera images light cast on the diffuser, simulating a sensor array. This is not ideal, as it creates a thick device, but it allows us to validate our design. Here you can see the cameras and LCD and diffuser frame for our prototype device.
The camera/diffuser pair has one convenient advantage: it provides an angle limiting property almost automatically. Light entering the system. If a ray enters the diffuser at a shallow angle it will be diffused and reach the camera. A ray entering at a steep angle will never reach the camera. With a sensor array, various angle limiting materials can be used to fulfill this function.
-- The diffuser also provides a nice surface to light with an array of backlight LEDs. Here you see the cameras we used (two point grey flea2s). (Two were used to cover the area of the screen with sufficient resolution). And here you see the frame we built to hold an LCD screen and diffuser.
I’ll setp through the elements of our processing pipeline in order to demonstrate how each works of the course of a frame capture. With the backlights off, the mask pattern is displayed to the screen. Raw data is captured from the sensor. Note the high frequnecy modulation in the hand photo on the left. Spatial heterodyne decoding is applied, giving us an array of virtual cameras, each with a slightly different view of the scene. Synthetic aperture refocusing is used to obtain a stack of images focused at different depths. We run through them in a video here. A maximum contrast operator is applied to each pixel in this focal stack to find the image with the sharpest focus. A depth map is obtained form the maximum contrast operator, and hands can be tracked in the depth map.
Finally, we render the scene based on the input we just received, and turn on the backlight.
We show some further analysis to understand what is possible with the BiDi screen. In the case of pinholes, we can see that the resolution of our array of virtual cameras decreases with distance (size of pixel projected into space).
The measureable resolution for an object given its distance from the screen is shown in the top plot. This accounts for diffraction and geometric blur as well. The gold bars represent the measured performance of our prototype. The measured resolution also depends on the separation between the sensor and the mask. For a given sensor and display pixel size (that of our prototype) we see that the optimal separation of mask and sensor is about 2.2 cm.
Returning to this analysis, we see in practice how the resolution of distant objects drops off (red arrow on targets). The plot in the bottom left shows the output of the maximum contrast operator (depth from focus) at the corresponding locations in the depth map on the right.
Here I show the user manipulating more objects. Notice how the manipulation is started with a touch, and that her hand is tracked in free space in front of the screen to move and zoom the models. Remember that these simple interactions are meant to demonstrate just the tip of what is possible with this technology.
Here the user uses 3d gesture to navigate a 3d world. Hand movement in to, and out of the screen move her view forward and back. The speed of the view turning up down left and right is determined by the hand position over the screen. You may notice the logo shown here is for the wrong conference. We’re pleased to announce that a technical paper based on this work has been accepted to SIGGRAPH Asia 2009.
The approach we lay out here has some clear limitations. The device requres a separation between LCD and sensor, which necessarily increases the thickness of the screen somewhat over a normal LCD. The sensitivity to room lighting of the device, while a plus for enabling lighting aware interaction, it becomes difficult or impossible to use the screen in low lighting conditions, unless there is an internal lighting source. This is a problem common to many optical capture techniques. Time multiplexing of our display and capture frames causes an undesirable flicker. High refresh rate LCDs coming into the market will make it simple to overcome this problem.
In the future, we will consider using masks that respond to the scene content to optimize the imaging properties of the display. With higher resolution devices we can facilitate novel video chat experiences with mixed reality rendering and live background subtraction. Look out for the technical paper we wrote on this topic, accepted to SIGGRAPH Asia 2009. Hopefully we’ll see you there.
BiDi Screen A Thin, Depth-Sensing LCD for 3D Interaction using Light Fields Matthew Hirsch Douglas Lanman Henry Holtzman Ramesh Raskar
BiDi Screen Depth and Lighting Aware Interactive Display Matthew Hirsch MIT Media Lab Douglas Lanman Brown University Ramesh Raskar MIT Media Lab Henry Holtzman MIT Media Lab
Alternatives to capture depth from an LCD <ul><li>Adapted Touch </li></ul><ul><ul><li>Capacitive </li></ul></ul><ul><ul><li>Resistive </li></ul></ul><ul><ul><li>Optical </li></ul></ul><ul><li>Camera arrays </li></ul><ul><ul><li>Behind screen </li></ul></ul><ul><ul><li>To sides </li></ul></ul><ul><li>Depth Sensors/Cameras </li></ul>
Adapting Traditional Touch? <ul><li>Resistive touch screen </li></ul><ul><ul><li>Confined to screen </li></ul></ul><ul><li>Capacitive </li></ul><ul><ul><li>Not lighting dependant </li></ul></ul><ul><ul><li>Calibration problems for gesture </li></ul></ul><ul><li>No existing technology trend for gesture </li></ul>?
Camera Arrays <ul><li>Cameras behind screen </li></ul><ul><ul><li>Interfere with backlight </li></ul></ul><ul><ul><li>Expensive, Large </li></ul></ul><ul><ul><li>Han , SecondLight, ThinSight </li></ul></ul><ul><ul><li>No tech curve </li></ul></ul><ul><li>Cameras to side </li></ul><ul><ul><li>Transition region </li></ul></ul><ul><ul><li>Second modality </li></ul></ul>?
Theory: Depth from light-field capture <ul><li>LCD used for Spatial Heterodyne Light Field Capture and Display </li></ul><ul><ul><li>Pinholes or tiled broadband masks </li></ul></ul><ul><ul><li>Separate Sensor and Mask </li></ul></ul><ul><li>Fourier Refocusing </li></ul><ul><li>Depth from focus </li></ul><ul><li>Blob tracking (gesture) </li></ul>
Theory: Spatial Heterodyning with MURA u d ref d m u s u =0 s =0 = u +( d m / d ref ) s L received ( f u , f s ) f u f s f u 0 f s 0 f s R s
Theory: Lightfield θ x y θ x . (Sensor) Sensor integrates these rays
Theory: Lightfield Frequency Domain f θ f x θ x Fourier Slice Theorm FT Lightfield FT
Theory: LF Skew in Free-Space Propogation θ x x y mask θ mask
Limitations <ul><li>Requires separation between display and sensor (adds thickness) </li></ul><ul><ul><li>2.5 cm for 50 cm range in our prototype </li></ul></ul><ul><ul><li>750mm for 8 cm range in iPhone like device </li></ul></ul><ul><li>Sensitive to room lighting or requires its own light source </li></ul><ul><ul><li>This is true of many optical systems </li></ul></ul><ul><li>Time multiplexing of display/capture </li></ul><ul><ul><li>Requires fast capture and screen refresh to stay below flicker fusion rate </li></ul></ul><ul><ul><li>240 Hz LCDs coming to market </li></ul></ul>
Conclusions <ul><li>Future Work </li></ul><ul><ul><li>Dynamic Masks </li></ul></ul><ul><ul><ul><li>Change frequency characteristics to match scene </li></ul></ul></ul><ul><ul><li>Video capture / Video chat (higher resolution) </li></ul></ul><ul><li>SIGGRAPH Asia 2009 Paper </li></ul><ul><ul><li>BiDi Screen: Depth and Lighting Aware Interaction and Display </li></ul></ul>
Conclusions <ul><li>Enable multitouch and gesture interaction on a thin display screen </li></ul><ul><ul><li>Walk-up interaction does not require gloves / fiducials </li></ul></ul><ul><li>Capture depth using array of vitual cameras </li></ul><ul><li>Thin portable devices possible with area sensor </li></ul>