Light field backgrounds for virtual reality and virtual production

Posted by Florian Schweiger on 3 Feb 2022, last updated 10 Feb 2022

We've developed a system to capture, edit and render light fields that can be used as realistic virtual reality backgrounds and potentially in virtual TV production. Our light fields are easy to capture, require very little processing, and are compatible with tools and workflows established in the broadcasting industry.

Light field background with animated foreground character

What is a Light Field?

A light field is a collection of light rays so dense that arbitrary views can be recreated from it. Unlike a 360° panorama that only captures light rays that come into a single point (the camera position), a light field contains many(!) more rays travelling through space all over the place. A 360° panorama only allows you to look around by turning your head (three degrees of freedom: pan, pitch and roll) but not to move anywhere other than the exact spot where the camera was - as if your head was stuck inside a huge balloon fixed to your shoulders.

However, a light field gives you six degrees of freedom (turning your head as above, plus moving forward/backwards, left/right and up/down). This adapts to your movements, allowing you to see view-dependant details that are vital for a realistic portrayal of an environment, such as parallax (closer objects moving more relative to head motion than those far away), reflections and occlusions. In virtual reality, these additional degrees of freedom make all the difference!

Typically, a light field is captured with , all pointing in the same direction (or, equivalently, with an array of lenslets in front of a single camera that split the image into multiple views).

This planar array then acts as a virtual window through which the scene can be viewed by recomposing and interpolating between light rays that were actually captured.

As long as the lenses are packed sufficiently close together, arbitrary rays falling through that window can be interpolated from the source data. Otherwise, aliasing becomes an issue.

Single-camera Light Fields

A light field camera on a rig, capturing the scene in East Enders' Queen Vic.

360° camera moving on a horizontal circle, capturing about five frames per degree.

Aiming for a simpler system that fits with established broadcasting workflows, we took a slightly different approach (based on ), deciding to capture our light fields with a single camera. To be fair, we use a 360° camera, so there are multiple lenses involved, but the recorded panoramic images are each from a single viewpoint - they just happen to have a larger field of view.

We use a motorised rig that slowly moves the camera around a horizontal circle. A complete revolution takes 30 seconds, during which we record exposure-locked video at 60 frames per second. This results in a dataset that contains 1800 frames, or five frames per degree. A typical diameter for the capture circle is one metre, which means that we effectively capture views less than 2mm apart. We can then synthesise views for virtual camera positions anywhere inside the capture circle from this dense set of source views.

An illustration of synthesising virtual views within the capture circle by sampling source views.

Synthesising virtual views within the capture circle by sampling source views (not to scale)

The simplicity of our setup comes, of course, with a few trade-offs. The most obvious limitation is that, because the camera moves, the scene itself must be perfectly still. Otherwise, ghosting artefacts would appear in views synthesised from inconsistent data. While this excludes certain use cases, it is rarely an issue for our purpose of creating backdrops for virtual reality that can then be populated with animated characters. And actually, there is a way to deal with limited background motion that we'll describe below.

Another difference to a full light field with six spatial degrees of freedom (6 DoF) is that our array of source views is only one-dimensional. All captured viewpoints lie on a line (albeit a bent one: the capture circle), but there aren't any viewpoints in our dataset that would cover the vertical direction. Consequently, a viewer only has five truly free degrees of freedom - three to turn their head and two to move in the plane of the capture circle. Moving up or down, however, they will not perceive vertical parallax as they would in real life. This is generally acceptable because horizontal parallax tends to be more important for human stereo vision due to the relative position of our eyes. Besides, if you think about it, horizontal is often the dominant direction of motion for humans in most everyday situations (particularly during seated VR experiences) and for most tracking shots in TV production.

The lack of vertical parallax also causes objects in a rendered view to appear at slightly incorrect height. This becomes most noticeable when the viewer moves forward, rightly expecting closer objects to grow taller faster than distant ones. But without vertical parallax, the height of all objects will change at the same rate, irrespective of their depth in the scene, which leads to straight lines visibly bending. More on these depth-dependent vertical distortions and how to mitigate them in a moment.

Processing and Rendering

From the panoramic video we captured, it only takes a few simple processing steps to get to a light field dataset that is ready for rendering.

The first step is to trim the video to contain exactly one complete revolution. This is achieved by minimising the pixel difference between the start and end frames. Next, we resample the panoramic video to a more efficient format. We use a cubemap representation that only contains the cube faces needed for rendering: Upward, (radially) outward, and downward.

A graphic showing the three points of a cubemap on the spherical panorama viewable from the eye; the upwards, outwards and downwards vertical layout of a cubemap; an image illustrating this taken in a church - a vertical image where the floor appears at the bottom, the view straight ahead is in the middle of the image, and the ceiling is at the top of the image.

Left to right: Format conversion from spherical (equirectangular) to cubemap, cubemap layout, an example frame

We have implemented our renderer in the Unity game engine as a set of shader materials that can be applied to surfaces that then display the light field of a captured scene. In the simplest case, that surface is the inside of a large sphere onto which the light field is projected (effectively assuming constant scene depth). While this leads to generally coherent images that include view-dependent effects, the lack of vertical parallax mentioned earlier causes distortions that vary with the depth of objects in the scene. In the example below, this is most visible in objects close to the camera that undergo a sort of barrel distortion (e.g. the pavement and the cornice above the pub windows). This becomes even more noticeable when the virtual camera moves towards nearby objects or when foreground objects are placed in the world that don't exhibit these distortions.

We manually add a crude 3D model in the Unity Editor to mitigate this unwanted effect and apply the light field material to it. The purpose of this proxy geometry is not to capture details of the environment but to merely distinguish closer structures (such as buildings) from the background. It also allows foreground objects (e.g. animated characters) to interact with the background light field, for example, to pass behind objects. In practice, a ground plane and potentially several primitives (e.g. cuboids or planes) to represent dominant structures are sufficient.

A street scene outside the Queen Vic pub, showing the effect without proxy geometry (described above).

Without proxy geometry

A street scene outside the Queen Vic pub, showing the effect with proxy geometry (described above).

With proxy geometry

A computer generated graphic showing the locations of structures in the Queen Vic street scene above.

Proxy geometry for the scene above

The rendering process is as simple as looking up the correct rays from our source views. For every pixel that needs to be rendered in the virtual camera (shown in blue below), we pick the source view (in orange) that lies in the direction of the corresponding ray. The scene depth determines the vertical pixel position to sample in the chosen source view.

Notice how acute the angle is between the virtual and source rays (even in the illustration, which exaggerates the distance between virtual and source cameras)? This means that the sample position doesn't change much, even if the depth of the proxy geometry is fairly inaccurate (within several centimetres in practice).

Illustrations showing the source view selection and vertical sampling based on proxy geometry.

Left to right: Source view selection (top view, not to scale), vertical sampling based on proxy geometry (side view, not to scale)

Additional Features

Light Field Editing

Since the source data in our system is conventional video, any image or video processing technique can be applied to alter the appearance of a light field. This ranges from relatively simple modifications, for example, colour grading or gamma correction, to more complex operations, such as rotoscoping to hide imperfections or even adding scene elements. The key advantage of our system is that the processing can happen offline and therefore be arbitrarily complex. The example below shows how applied to the source video completely changes the appearance of the rendered scene while preserving all view-dependent effects, such as reflections on the TV screens, occlusions and parallax. Another advantage to just using a video representation is that we can take advantage of high compression ratios of modern video encoders to greatly reduce that amount of data that has to be delivered to the end-user.

Illustration showing how altering an original video results in a new dataset for the rendering process.

Altering the original video (1) results in a new dataset (2) for the rendering process (3)

Light field background - style transfer switching

A light field rendered from stylised source video

Looped Light Fields

We mentioned that our system generally requires static scenes to prevent ghosting artefacts. Yet, it can still support certain types of background motion as long as the following criteria are met:

The background motion can be repeated periodically.
The dynamic object remains in view (of the cubemapped video) over a full capture rotation.
The dynamic object can be approximated with planes.

Examples of motion that can periodically repeat without raising suspicion include traffic or pedestrians walking in the distance, trees swinging in the wind, or flowing water, to name just a few. This looping can either be in one direction (e.g. cars driving past, over and over) or in a back-and-forth, yoyo-like pattern to avoid discontinuities (useful, for example, for treetops in the wind). The most appropriate motion types are true background activity, a movement that is not specific enough to stand out.

When the dynamic object is flat (such as a TV screen or the surface of a river), or when it is far enough in the background to be sufficiently accurately approximated by a plane, it can be warped from each source view that saw it the during the capture process into the virtual camera's view.

In our Unity implementation, dynamic objects are assigned a shader material that does exactly that: sample the source views based on when they captured the dynamic object instead of sampling the spatially closest ones as for static scene parts (as described in Processing and Rendering above).

An animation of static scene parts being sampled from the spatially closest source views, while dynamic scene elements loop over the source views that have been captured..

Static scene parts are sampled from the spatially closest source views (in blue), while dynamic scene elements loop over the source views that had in shot (in red).

The proxy geometry for the scene in the video - showing the structures in the scene - walls, a desk, a TV on the wall.

Proxy geometry for static (cyan) and dynamic (green) scene parts

Light field backgrounds - proxy geometry

What's Next?

This light field system is currently a prototype. It runs on standard gaming PC hardware and produces output resolutions suitable for virtual reality applications.

The requirements for virtual TV production are very different. While resolution must be significantly higher to meet broadcast standards, and the range of camera motion needs to be larger than for a seated VR experience, the range of viewing directions is generally more limited in a TV studio setup. Our system could accommodate this by increasing the radius of the capture circle, reducing the capture speed (to maintain a dense source view spacing) while limiting the capture range to only the portion of the capture circle that will eventually face into the studio, towards the set. A more flexible yet more complex approach would be to work with full 6-DoF light fields. With recent advances in Neural Rendering, the spacing between source cameras could be significantly increased to meet the motion range and viewing direction requirements without sacrificing visual quality.

Acknowledgements

Some of this work was part of the . More on that project in our .

�鶹Լ��

Accessibility links

Research & Development