NeRFs - A guide to creating impressive 3D fly-through videos from 2D smartphone images

Posted by Alia Sheikh on 18 Jan 2024

The �鶹Լ�� Research & Development visual computing team have been collaborating with �鶹Լ�� 6 Music to use cutting edge technology to support their recent annual T-shirt Day event, where listeners wearing their favourite band T-shirts can request songs.

Our brief was to create short, eye-catching videos for social media to communicate the concept of a T-shirt being used to request a song. To show how music feels, we really wanted to create something dynamic and maybe even a little surreal. We decided to create smooth moving shots to suggest the journey a song takes to reach our listeners.

It is unusual to create complicated moving shots for radio as the studios tend to be relatively small spaces, so we decided this might be a perfect use-case for .

NeRFs - A guide to creating impressive 3D fly-through videos from 2D smartphone images

NeRFs are neural networks that can represent real life scenes and let us move very freely around those scenes long after they are captured. They are trained using a set of 2D images from the environment (which can be created from simple handheld phone video) and, once processed, allow us to make whatever new visual paths (or moving shots) we want through that space. NeRFs are not new - we’ve been investigating them since 2020 - but recent advances in novel view synthesis algorithms mean that both the quality of the output and the speed we can create NeRFs have drastically improved. We expect the technology to continue to advance rapidly - for example, right now NeRFs cannot make video of moving subjects - but we were curious to see if we could create usable content now.

A few test shoots and many hours spent processing, rendering and re-rendering our assets later, and we had the videos below ready to go out on social media.

They tread an interesting line between showing frozen ‘moments’ but also a journey through these spaces, and we learned something new about the process of capturing or processing NeRFs while making each video.

Our guide to making NeRFs

Neural Radiance Fields - Capturing NeRFs

Capturing a video for our NeRF with a selfie stick.

Select a subject. Any people in shot mustn’t move (or ideally even blink) for 90 seconds.

Record video of the subject(s) and the space from all angles, trying not to cast a shadow or create any reflections (a selfie stick may be helpful). It does not matter if the video is horizontal or vertical or even if it’s well framed, but do set your camera up to avoid motion blur and don’t move too fast.

Use the video to train a neural network that can recreate the scene. There are several different models available which can do this, we chose , which combines elements from several recent models to produce high-quality renders. (You will find out at the end of this step if your capture was good enough.)

Decide what camera framing and motion you want to export through the scene.

Review the resulting video.

If the video turned out as you expected, congratulations! You can now put it into your edit.

If the video didn’t turn out the way you wanted, either re-export a different path (relatively fast and easy), retrain your neural network (slow and annoying) or maybe even go back and re-capture your video (sometimes the fastest way, but definitely annoying).

Someone at a desk editing the video from the shoot on a computer, a notepad is on the desk representing the space in scene in the video. Post it notes on the notepad represent the positions of band members in the scene.

Using post-it notes to help storyboard and plan out 3D camera fly-through paths.

Each video took a couple of minutes to shoot, up to an hour to process, approximately 9 hours to train the network, half an hour to select our camera path, and about 4 hours to export a usable final video. As someone pointed out, this process can be a lot like developing a roll of film and having to wait to see how it comes out.

We decided to export our videos as quite slow journeys through the space - this made it much easier to vary the speed of our camera path in a video editor later.

Screen recording where images from the capture video are aligned in the 3D space - and the subsequent NeRF, once training is complete.

The first NeRF we captured was of Mary Anne Hobbs, who was extremely patient in helping us break new ground. We placed a panel light in the studio to try to ensure that Mary Anne was captured the most clearly as compared to the background. We weren‘t sure what camera path we wanted, so we comprehensively captured the whole studio, including behind Mary Anne, and facing all sides of the room, from high and low angles. As a result, when stills were extracted from the 90 second video, much of the input was not of Mary Anne, but of the studio and all the details within it.

This meant the neural network had a lot of visual information to store, not only of Mary Anne but also all the details of the room, and rendered them all to approximately the same quality. This gave us a lot of flexibility on the choice of final camera path, but our actual subject, Mary Anne Hobbs, was of lower resolution than would usually pass an editorial standard.

We concluded from this first experiment that it is important to have an idea of the kind of camera path you wish to take prior to capturing data for the NeRF. This includes what you want to focus on as well as the range of viewing angles you are looking to see, so to focus the network’s attention on regions of interest.

The was the single largest space we have worked in to date, so the capturing of Dutch band also presented some interesting challenges.

To make sure we had options when editing we tried two methods, one where we stood very close to each member of the band and captured a very short video of each of them at very close range and another using a large monopole to get a very large range of camera angles and capture the entire band at once.

For the short range video, the goal was is to avoid any movement of the subject that would blur the result and to focus on only one band member at a time to capture fine detail. This gave us very high quality video, but we didn't get to see much of the studio, and would have introduced more cuts than we really wanted.

Capturing the entire studio and band at once allowed us to plan expansive unbroken camera moves, but as expected, at the expense of details on any one part of the scene. It did mean that we could experiment with some very interesting camera paths such as this overhead-to-face-on shot, which starts in an area for which the NeRF doesn’t have a lot of information, which leads to very surreal images at the start:

Ultimately, the video we chose was a simple push in from a high, far away vantage point to a close-up of the lead singer, something which would normally require a crane or a drone:

We wanted to experiment with making a NeRF ‘come to life’ and did this by moving an additional phone camera into place towards the end of the capture, then asking the band to ‘unfreeze’ and play the chorus of the song. It was important to make sure that the final frame in the NeRF dataset matched as closely as possible to the first frame of the video. We still had issues in the edit, going from a relatively low res NeRF to a high res video, so we added a distortion effect to soften the transition between the two:

With these takeaways from both shoots we repeated the process with another 6 Music host, Huw Stephens. We set up panel lighting and the selfie stick as before, though this time we went into this shoot knowing we wanted to see Huw’s face, his hand on the fader, his T-shirt in sharp detail, and to fly from above into the microphone. Consequently, we focused the capture on angles facing him directly, close-ups of his face, and angles over his head. We avoided capturing angles facing away from him or behind him, relying instead on the background being rendered from moments of disocclusion from shots of Huw himself.

The resulting NeRF was much crisper of our host and resulted in a much greater resolution on his face and T-shirt. We could still exhibit the nonphysical camera paths NeRFs are capable of, but focused within the tighter space around Huw that we had captured. We effectively prioritised quality over quantity, as more of the space in the neural network was dedicated to details pertaining to aspects of the scene we actually wanted to see in the final render, namely, Huw.

We took these lessons into the final shoots of the listener at home and at the tram stop who were captured in a small space, with most of the capture focusing on the subject. Our later NeRFs are of markedly better quality.

Why these collaborations are useful

It is extremely valuable to us as a research department to do these sorts of production tests. They tell us how close current technology is to being applied in the industry, but also allow us to accelerate our understanding, innovate on existing technology and prioritise work on the developments we think will be most useful.

Screenshot of the scene of the listener in their bedroom while being worked on in nerfstudio.

Using nerfstudio to make NeRF content.

For example, we currently plan camera moves with , which allows us easy access to many of the current state-of-the-art methods for producing NeRFs. We then edit our exported video inside a Non-Linear Editor (NLE). However, it would be interesting to consider where in the production workflow the path through a NeRF might best be decided and controlled. There are integrations which allow some control of the NeRF in either animation software or virtual production technologies such as or . What the editing process would be like if the camera angle could be controlled inside an NLE such as or instead? That is the point at which you can see the creative choice in context, so that would be incredibly powerful as it would allow us to re-make a shot to work better with the rest of the edit, and to see the effect of that swiftly.

Working with production colleagues helps us understand which creative possibilities spark the most editorial interest and what the future might hold. Right now, the processing cost of making a moving NeRF is prohibitive, but this may change quickly, and if that happens, we can see many practical and creative uses for this technology.

�鶹Լ��

Accessibility links

Research & Development