Authors:
(1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk);
(2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk);
(3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk);
(4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk);
(5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu).
Table of Links
2. Related Work
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
ABSTRACT
Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet [39]) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this generative approach for NeRF inpainting are twofold. First, after rough mask propagation, to complete or fill in previously occluded content, we can individually generate a small subset of completed images with plausible content, called seed images, from which simple 3D geometry proxies can be derived. Second and the remaining problem is thus 3D multiview consistency among all completed images, now guided by the seed images and their 3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF baseline framework is general which can be readily extended to 4D dynamic NeRFs, where temporal consistency can be naturally handled in a similar way as our multiview consistency
1. INTRODUCTION
Recent development of Neural Radiance Fields (NeRF) [18] and its dynamic variants including [20, 25, 28] have shown their great potential in modeling scenes in 3D and 4D. This representation is suitable for scene editing from straightforward user inputs such as text prompts, eliminating the need for modeling and animating in detail. One important task in scene editing is generative inpainting, which refers to generating plausible content that is consistent with the background scene. Generative inpainting has a wide range of potential applications, including digital art creation, and VR/AR.
Although several recent works have addressed the inpainting and text-guided content generation problem on NeRFs, they have various limitations in directly generating consistent novel content seamlessly with the background based on the text input. Specifically, [4, 24, 41] allow users to edit the appearance of an existing object in the NeRF based on the text prompt, but their generations are limited by the original target object, thus they cannot handle substantial geometry changes. In [19, 31] a given NeRF is inpainted by removing the target object and inferring the background, but their inpainting task is not generative as they do not match the inferred background with any other user input. Recent generative works [12, 21, 26, 32] produce new 3D or 4D content from the text input, but the generation is not conditioned on the existing background. While [27] can generate contents around the target object without removing it, generative inpainting on NeRF should allow users to remove target objects while filling the exposed region with plausible 3D content, which was previously totally or partially occluded in the given scene.
While applying recent diffusion models [22] on generative 2D inpainting may largely solve the problem, simply extending diffusion models to higher dimensions to static or dynamic NeRFs introduces extensive challenges, including new and more complex network structures and sufficient higher-dimensional training data. On the other hand, NeRFs are continuous multiview estimations on its training images which provide a natural connection between 2D images and NeRF, enabling us to propagate the 2D inpainting results by diffusion models to the underlying scene.
Thus, in this paper, we propose Inpaint4DNeRF, the first work on text-guided generative NeRF inpainting with diffusion models, which can be naturally extended to inpaint 4D dynamic NeRFs. Given a text prompt, and a target foreground region specified by the user by text or on the single individual image(s), our method first utilizes stable diffusion to inpaint a few independent training views. Next, we regard these inpainted views as seed images and infer a coarse geometry proxy from them. With the strong guidance by seed images and their geometry proxies, the other views being inpainted are constrained to be consistent with the seed images when refined with stable diffusion, which is followed by NeRF finetuning with progressive updates on the training views to obtain final multiview convergence. For dynamic cases, once generative inpainting is achieved on a static single frame, it can be regarded as the seed frame where the edited information can be propagated to other frames to naturally inpaint the scene in 4D without bells and whistles. Our generative Inpaint4DNeRF proposes a text-guided approach for inpainting the relevant background after removing static foreground objects specified by a user-supplied prompt, where multiview consistency is meticulously maintained.
To summarize, Inpaint4DNeRF presents the following contributions. First, harnessing the power of recent diffusion models on image inpainting, we can directly generate textguided content with new geometry while being consistent with the context given by the unmasked background. Second, our approach infers and refines other views from initially inpainted seed images to achieve multiview consistency across all the given views, where temporal consistency is naturally enforced and achieved.
This paper is available on arxiv under CC 4.0 license.