HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: engord
  • failed: ctable
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2308.06962v2 [cs.CV] 19 Dec 2023

Color-NeuS: Reconstructing Neural Implicit Surfaces with Color

Licheng Zhong\star  Lixin Yang1,2 \star  Kailin Li1  Haoyu Zhen1  Mei Han3  Cewu Lu1,2 bold-†\bm{\dagger}bold_†
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTShanghai Jiao Tong University  22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTShanghai Qi Zhi Institute  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTNational University of Singapore  
{zlicheng, siriusyang, kailinli, anye_zhen, lucewu}@sjtu.edu.cn {hanmei}@u.nus.edu
Abstract

The reconstruction of object surfaces from multi-view images or monocular video is a fundamental issue in computer vision. However, much of the recent research concentrates on reconstructing geometry through implicit or explicit methods. In this paper, we shift our focus towards reconstructing mesh in conjunction with color. We remove the view-dependent color from neural volume rendering while retaining volume rendering performance through a relighting network. Mesh is extracted from the signed distance function (SDF) network for the surface, and color for each surface vertex is drawn from the global color network. To evaluate our approach, we conceived a in hand object scanning task featuring numerous occlusions and dramatic shifts in lighting conditions. We’ve gathered several videos for this task, and the results surpass those of any existing methods capable of reconstructing mesh alongside color. Additionally, our method’s performance was assessed using public datasets, including DTU, BlendedMVS, and OmniObject3D. The results indicated that our method performs well across all these datasets. Project page: colmar-zlicheng.github.io/color_neus.

00footnotetext: These authors contributed equally. 00footnotetext: Cewu Lu is the corresponding author. He is the member of Qing Yuan Research Institute and MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, and Shanghai Qi Zhi Institute, China.

1 Introduction

The endeavor of reconstructing 3D objects from 2D images is a pivotal and ongoing challenge in the domains of computer vision and graphics. Previously, the structure-from-motion (SFM) method [20, 21] was widely used to reconstruct 3D objects from 2D images. However, it often struggled with scenes lacking texture or those with repetitive patterns, leading to ambiguous correspondences and incorrect depth estimations. Another limitation of SFM was its inability to effectively handle occlusions. When an object in the scene was partially obscured, it would often lead to errors in reconstruction. A further limitation of SFM lay in its dependency on point cloud representation, which falls short of accomplishing a fully dense reconstruction. Recently, the landscape of this field has been evolving, with a burgeoning interest in the investigation of implicit neural surfaces via volume rendering [23] which can represent pixel-level fine surface. It’s based on the neural radiance field (NeRF) [18].

Refer to caption
Figure 1: Color-NeuS decouple the view-dependent component from the color network and employs a relighting network to account for the view-dependent aspect. This allows Color-NeuS to reconstruct a 3D implicit surface with color, while maintaining its ability for 2D volume rendering.

Pioneering works like NeRF [18] and its successors [36, 4, 1, 16, 30] have convincingly demonstrated the power of neural networks in representing continuous 3D scenes. This is achieved by learning a mapping from 3D coordinates to volume density and view-dependent color, leading to highly effective novel view synthesis. Subsequently, NeuS [23] extended this concept, leveraging signed distance functions (SDF) to reconstruct a neural implicit surface. However, NeuS’s scope remains limited to the reconstruction of a mesh without vertex color. This limitation arises as the color of each point in NeuS’s neural volume rendering is determined by both its position and viewing ray direction. So, in the context of mesh reconstruction, where the view direction is absent, NeuS cannot assigns a specific color value to each point. In addition, most of the NeRF relighting works focus on 2D image rendering rather than textured surfaces reconstruction. To address these limitation, in this paper, we present a novel method for the reconstruction of neural implicit surfaces incorporating view-independent color.

We propose Color-NeuS, a NeuS-compatible approach that facilitates the reconstruction of 3D surface and global vertex Colors, independent of the viewpoint. Simultaneously, it upholds NeuS’s robust functionalities in rendering 2D images and executing 3D surface reconstruction. To this end, we replace the singular view-dependent color component in the neural volume rendering with the integration of a view-independent global color variable and a view-dependent relighting effect as shown in Fig. 1. This not only enables our model to be trained for standard volume rendering but also paves the way for the learning of global vertex colors. Importantly, Color-NeuS naturally handles substantial reflection on the object’s surface, and can deal with intermittent occlusion during dynamic interaction between objects (see Fig. 7).

The effectiveness of our Color-NeuS has been extensively validated on various datasets including the DTU [13] , BlendedMVS [32], and OmniObject3D [29] datasets. To demonstrate the superiority of our method, we conducted comparative evaluations with laser scanning and well-established methods such as COLMAP [20, 21] and HHOR [12]. The results reveal that Color-NeuS successfully reconstructs the object’s surface while extracting a reasonable texture. To underscore its practical application, we apply Color-NeuS to a real-world challenging task: hand-held object scanning [12, 10]. As part of this process, we collect a dataset for validation, which includes 3D scans of objects, videos of moving objects held in hand, and object-centric camera transformations.

Our contributions can be summarized as follows:

  • We propose a novel method for the reconstruction of a neural implicit surface with color that can be applied to any NeuS-like models.

  • We decouple the view-dependent process in neural volume rendering, enabling the handling of occlusion and reflection while obtaining global vertex colors.

  • We devise a challenging task of in-hand object scanning for the reconstruction of mesh with color and compile a real-world dataset for this task.

2 Related Work

Neural Radiance Field. Following the pioneering work of NeRF [18], a variety of studies have emerged in the realm of neural implicit fields. Notable among them, such as NeRF--- - [26], NoPe-NeRF [2], BARF [15], and GNeRF [17], have been investigating methods to estimate camera pose while training the implicit field. Concurrently, efforts such as Pixel-NeRF [36], MVSNeRF [4], and IBRNet [24] are focusing on the generalization of neural radiance fields. In parallel, research projects like KiloNeuS [7], NeX [28], NeRV [22], NeRD [3], PhySG [37], and NeRFactor [38] have been focusing on illumination and relighting in neural fields. Our method is compatible with neural volume rendering and retents the capacity of new view synthesis.

Surface Reconstruction. Implicit Differentiable Renderer (IDR) [33] represents geometry as the zero level set from Signed Distance Function (SDF) by leveraging implicit geometric regularization (IGR) [8]. NeuS [23] amalgamates the SDF field and volume rendering to reconstruct surfaces. Both VolSDF [34] and UNISURF [19] combine implicit surface representation with volume rendering: VolSDF [34] disentangles appearance from geometry, while UNISURF [19] formulates implicit surface models and radiance fields in a cohesive manner. PET-NeuS [25], an extension of NeuS [23], introduces new components such as an unique positional encoding type, tri-plane representation, and learnable convolution operations. However, these methodologies do not factor in the global view-independent color of the surface. Our method focuses on the extraction of surface color from a global color network while maintaining geometry learning and image rendering from a relighting network.

In-hand Object Scanning. ObMan [11] presents an end-to-end learnable model that employs a unique contact loss, which encourages physically plausible hand-object configurations. For the in-hand object scanning task, IHOI [35] reconstructs from a single RGB image, taking advantage of the estimated hand pose. BundleSDF [27], on the other hand, estimates the object pose using RGBD input sequential images, all the while reconstructing the implicit surface represented by the Signed Distance Function (SDF). HHOR [12] also utilizes SDF to represent the object surface, but it reconstructs the object in tandem with the hand, where the object is firmly held. A recent work [9] reconstructs the object from an RGB sequence and estimates the camera pose simultaneously by using an occupancy field to represent the surface. However, it makes the assumption that the light source is distant and the direction of light remains unchanged. In contrast, our method can handle arbitrary lighting conditions to model the object appearance.

3 Preliminary

We first introduce Neural Implicit Surface (NeuS) [23], which is the basis of our method. Given a camera with known intrinsic parameters, we can represent a ray in the camera coordinate system as

𝒑(z)=𝒐+z𝒅,𝒑𝑧𝒐𝑧𝒅\bm{p}(z)=\bm{o}+z\bm{d},bold_italic_p ( italic_z ) = bold_italic_o + italic_z bold_italic_d , (1)

where 𝒐𝒐\bm{o}bold_italic_o and 𝒅𝒅\bm{d}bold_italic_d are the origin and the direction of the ray, respectively, and z𝑧zitalic_z is the distance from the origin to the point on the ray. Then, a MLP network 𝒢𝒢\mathcal{G}caligraphic_G is used to encode 𝒑𝒑\bm{p}bold_italic_p to its signed distance function (SDF) s(𝒑)𝑠𝒑s(\bm{p})italic_s ( bold_italic_p ) and feature vector f(𝒑)𝑓𝒑f(\bm{p})italic_f ( bold_italic_p ), as

[s(𝒑),f(𝒑)]=𝒢(𝒑).𝑠𝒑𝑓𝒑𝒢𝒑[s(\bm{p}),f(\bm{p})]=\mathcal{G}(\bm{p}).[ italic_s ( bold_italic_p ) , italic_f ( bold_italic_p ) ] = caligraphic_G ( bold_italic_p ) . (2)

With the position 𝒑𝒑\bm{p}bold_italic_p, direction 𝒅𝒅\bm{d}bold_italic_d, feature vector f(𝒑)𝑓𝒑f(\bm{p})italic_f ( bold_italic_p ) and gradient g(𝒑)𝑔𝒑g(\bm{p})italic_g ( bold_italic_p ) as input, another MLP \mathcal{M}caligraphic_M outputs the color of the query point, as

c(𝒑,𝒅)=(𝒑,𝒅,f(𝒑),g(𝒑)),𝑐𝒑𝒅𝒑𝒅𝑓𝒑𝑔𝒑c(\bm{p},\bm{d})=\mathcal{M}(\bm{p},\bm{d},f(\bm{p}),g(\bm{p})),italic_c ( bold_italic_p , bold_italic_d ) = caligraphic_M ( bold_italic_p , bold_italic_d , italic_f ( bold_italic_p ) , italic_g ( bold_italic_p ) ) , (3)

where g(𝒑)=s(𝒑)𝑔𝒑𝑠𝒑g(\bm{p})=\nabla s(\bm{p})italic_g ( bold_italic_p ) = ∇ italic_s ( bold_italic_p ) is the gradient of SDF at point 𝒑𝒑\bm{p}bold_italic_p. Finally, the color of the query pixel C𝐶Citalic_C is obtained by integrating the color along the ray,

C=znzfw(z)c(𝒑,𝒅)𝑑z,𝐶superscriptsubscriptsubscript𝑧𝑛subscript𝑧𝑓𝑤𝑧𝑐𝒑𝒅differential-d𝑧C=\int_{z_{n}}^{z_{f}}w(z)c(\bm{p},\bm{d})dz,italic_C = ∫ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w ( italic_z ) italic_c ( bold_italic_p , bold_italic_d ) italic_d italic_z , (4)
w(z)=exp(znzσ(t)𝑑t)*σ(z),𝑤𝑧superscriptsubscriptsubscript𝑧𝑛𝑧𝜎𝑡differential-d𝑡𝜎𝑧w(z)=\exp\big{(}-\int_{z_{n}}^{z}\sigma(t)dt\big{)}*\sigma(z),italic_w ( italic_z ) = roman_exp ( - ∫ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_σ ( italic_t ) italic_d italic_t ) * italic_σ ( italic_z ) , (5)
σ(𝒑)=αeαs(𝒑)(1+eαs(𝒑))2,𝜎𝒑𝛼superscript𝑒𝛼𝑠𝒑superscript1superscript𝑒𝛼𝑠𝒑2\sigma(\bm{p})=\frac{\alpha e^{-\alpha s(\bm{p})}}{(1+e^{-\alpha s(\bm{p})})^{% 2}},italic_σ ( bold_italic_p ) = divide start_ARG italic_α italic_e start_POSTSUPERSCRIPT - italic_α italic_s ( bold_italic_p ) end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_e start_POSTSUPERSCRIPT - italic_α italic_s ( bold_italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (6)

where σ(𝒑)𝜎𝒑\sigma(\bm{p})italic_σ ( bold_italic_p ) designates the density of 𝒑𝒑\bm{p}bold_italic_p with α1𝛼superscript1\alpha\in\mathbb{R}^{1}italic_α ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT acting as a learnable parameter. The term w(z)𝑤𝑧w(z)italic_w ( italic_z ) refers to the weight assigned to the color at point z𝑧zitalic_z. Additionally, znsubscript𝑧𝑛z_{n}italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and zfsubscript𝑧𝑓z_{f}italic_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT represent the near and far plane of the camera, respectively.

Based on the input 𝒅𝒅\bm{d}bold_italic_d in Eq. 4, the output per-vertex color is view-dependent. Consequently, the original NeuS avoids incorporating color, focusing solely on the reconstruction of surface shape.

4 Method

Our goal is to both extract the color and geometry of objects,

cg(𝒙),𝒙𝒮,subscript𝑐𝑔𝒙𝒙𝒮c_{g}(\bm{x}),\bm{x}\in\mathcal{S},italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x ) , bold_italic_x ∈ caligraphic_S , (7)

where 𝒮={𝒙3|s(𝒙)=0}𝒮conditional-set𝒙superscript3𝑠𝒙0\mathcal{S}=\left\{\bm{x}\in\mathbb{R}^{3}|s(\bm{x})=0\right\}caligraphic_S = { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_s ( bold_italic_x ) = 0 } represents the object surface (a collection of mesh vertices) that is characterized by a set of points with zero-level signed distance value.

Property 1.

([23, Sec.3.1]) NeuS possesses an advantageous property wherein the standard deviation of the density σ(𝐩)𝜎𝐩\sigma(\bm{p})italic_σ ( bold_italic_p ) is dictated by the trainable parameter 1/α1𝛼1/\alpha1 / italic_α, which progressively approaches zero as the network training reaches convergence. normal-■\blacksquare

Implied by 1, the density σ(𝒑)𝜎𝒑\sigma(\bm{p})italic_σ ( bold_italic_p ) of the neural radiance field substantially condenses on the surface 𝒮𝒮\mathcal{S}caligraphic_S. This provides a focal point for us to extract surface (per-vertex) color of the object, when the networking training reaches convergence.

4.1 Naive Solution

One solution that might seem intuitive is to simply expunge the view-dependent term in Eq. 3, as:

c(𝒑)=(𝒑,f(𝒑),g(𝒑)).𝑐𝒑𝒑𝑓𝒑𝑔𝒑c(\bm{p})=\mathcal{M}(\bm{p},f(\bm{p}),g(\bm{p})).italic_c ( bold_italic_p ) = caligraphic_M ( bold_italic_p , italic_f ( bold_italic_p ) , italic_g ( bold_italic_p ) ) . (8)

Regrettably, undertaking this procedure can potentially impair the learned geometry and appearances within the neural radiance field. This is attributed to the incapacity of neural radiance field to accurately express light variations of points across different directions, when a view-dependent term is absent. Furthermore, such an approach may also culminate in the breakdown of the SDF field, subsequently triggering fragmentation of the surface. The qualitative results of naive solution are displayed in Fig. 6.

Refer to caption
Figure 2: Illustration of the three solutions: (a) Naive Solution: Utilize an SDF network to learn the implicit geometry, complemented with a global color network to grasp view-independent color; (b) Intermediate Solution: Employ a color network to learn view-dependent color and extract the surface color with a direction specified as per vertex normal; (c) Color-NeuS (Our Solution): Make use of a global color network to learn view-independent color, and deploy a relight network to compensate for variations with the view direction. During inference, only the global view-independent color will be used.

4.2 Intermediate Solution

The intermediate (interm.) solution is to utilize a constant direction as an input to the color network in order to extract color information from the surface. For instance, in HHOR [12], the vertex color is defined based on the surface’s normal direction. For any x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S, the normal direction is given by g(𝒙)𝑔𝒙-g(\bm{x})- italic_g ( bold_italic_x ). Consequently, the vertex color can be extracted using

c(𝒙)=(𝒙,g(𝒙),f(𝒙),g(𝒙)).𝑐𝒙𝒙𝑔𝒙𝑓𝒙𝑔𝒙c(\bm{x})=\mathcal{M}(\bm{x},-g(\bm{x}),f(\bm{x}),g(\bm{x})).italic_c ( bold_italic_x ) = caligraphic_M ( bold_italic_x , - italic_g ( bold_italic_x ) , italic_f ( bold_italic_x ) , italic_g ( bold_italic_x ) ) . (9)

Yet, given the considerable variation in an object’s surface color under changing lighting conditions, it may be insufficient to assign a singular ‘direction color’ as a representation of the object’s true color. The inherent complexity associated with light-surface interactions calls for a more nuanced approach to color extraction.

4.3 Color-NeuS

In our method Color-NeuS, we propose a workflow to disassociate the global color from the view-dependent formulation, all the while preserving the appropriate geometry and obtaining a reasonable view-independent color for each vertex. Specifically, we substitute the learning of a single view-dependent color component (as indicated in Eq. 3) with the learning of a view-independent global color variable (Sec. 4.3.1) coupled with a view-dependent relighting effect (Sec. 4.3.2).

4.3.1 Removing View-Dependence

Our proposed solution begins with the removal of the view-dependent term from Eq. 3, which transitions the model to a global color network, as:

cg(𝒑)=g(𝒑,f(𝒑),g(𝒑)).subscript𝑐𝑔𝒑subscript𝑔𝒑𝑓𝒑𝑔𝒑c_{g}(\bm{p})=\mathcal{M}_{g}(\bm{p},f(\bm{p}),g(\bm{p})).italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_p ) = caligraphic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_p , italic_f ( bold_italic_p ) , italic_g ( bold_italic_p ) ) . (10)

This formulation matches the naive solution presented in Eq. 8, an ideal starting point for extracting the view-independent vertex color. However, solely employing this equation (ignoring the view-dependence) for optimizing network to convergence will violate the 1. Consequently, this contravention can prevent the density σ(𝒑)𝜎𝒑\sigma(\bm{p})italic_σ ( bold_italic_p ) from condensing onto the object surface, inadvertently affecting the learned geometry.

Therefore, our proposed solution (Color-NeuS) is to separate the inference procedure from optimizing (training). During optimizing, we re-integrate the global color with a residual term that incorperates the view direction. The outcome of this integration meets the view requirement set forth in the volume rendering formulation Eq. 4, and preserves the 1 through the optimizing phase.

During inference, once the density σ(𝒑)𝜎𝒑\sigma(\bm{p})italic_σ ( bold_italic_p ) has condensed to the object surface, the shape of object (s(𝒑)=0𝑠𝒑0s(\bm{p})=0italic_s ( bold_italic_p ) = 0) can be guaranteed. Additionally, cg(𝒑)subscript𝑐𝑔𝒑c_{g}(\bm{p})italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_p ) acquires the capability to express the vertex color. Consequently, we can extract the vertex color on the surface (where x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S, that is s(𝒙)=0𝑠𝒙0s(\bm{x})=0italic_s ( bold_italic_x ) = 0) during inference using:

cg(𝒙)=g(𝒙,f(𝒙),g(𝒙)).subscript𝑐𝑔𝒙subscript𝑔𝒙𝑓𝒙𝑔𝒙c_{g}(\bm{x})=\mathcal{M}_{g}(\bm{x},f(\bm{x}),g(\bm{x})).italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x ) = caligraphic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x , italic_f ( bold_italic_x ) , italic_g ( bold_italic_x ) ) . (11)

4.3.2 Coupling Relighting Effect

To maintain the model’s performance after removing view-dependence in cg(𝒙)subscript𝑐𝑔𝒙c_{g}(\bm{x})italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x ), we introduce a relighting network to compensate for the discarded view-dependent term. The relighting network functions with respect to the position, direction, and view-independent color, generating a small view-dependent color adjustment for each point. This can be expressed as:

cr(𝒑,𝒅)=g(cg(𝒑),𝒑,𝒅,g(𝒑)).subscript𝑐𝑟𝒑𝒅subscript𝑔subscript𝑐𝑔𝒑𝒑𝒅𝑔𝒑c_{r}(\bm{p},\bm{d})=\mathcal{R}_{g}(c_{g}(\bm{p}),\bm{p},\bm{d},g(\bm{p})).italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_p , bold_italic_d ) = caligraphic_R start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_p ) , bold_italic_p , bold_italic_d , italic_g ( bold_italic_p ) ) . (12)

Subsequently, the ultimate color of each point is computed by integrating the relighting effect with the global color, as:

c(𝒑,𝒅)=Ψ(Ψ1(cg(𝒑))+cr(𝒑,𝒅)),𝑐𝒑𝒅ΨsuperscriptΨ1subscript𝑐𝑔𝒑subscript𝑐𝑟𝒑𝒅c(\bm{p},\bm{d})=\Psi(\Psi^{-1}(c_{g}(\bm{p}))+c_{r}(\bm{p},\bm{d})),italic_c ( bold_italic_p , bold_italic_d ) = roman_Ψ ( roman_Ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_p ) ) + italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_p , bold_italic_d ) ) , (13)

where ΨΨ\Psiroman_Ψ and Ψ1superscriptΨ1\Psi^{-1}roman_Ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT represent the sigmoid and inverse sigmoid function, respectively. At present, the color is a fusion of a global color (view-independent) and a relighting effect (view-dependent). The c(𝒑,𝒅)𝑐𝒑𝒅c(\bm{p},\bm{d})italic_c ( bold_italic_p , bold_italic_d ) is then incorporated in Eq. 4 to compute the volume integral C𝐶Citalic_C.

4.4 Optimization

The rendered color C𝐶Citalic_C along the sampled rays can be computed using Eqs. 4, 6 and 5. Let C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG denote the ground truth color, we can define the color loss as:

c=1Nri=1NrC^iCi22,subscript𝑐1subscript𝑁𝑟superscriptsubscript𝑖1subscript𝑁𝑟superscriptsubscriptnormsubscript^𝐶𝑖subscript𝐶𝑖22\mathcal{L}_{c}=\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\|\widehat{C}_{i}-C_{i}\|_{2}% ^{2},caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (14)

where Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the number of sampled rays.

To enforce the optimized neural representation fulfills a valid SDF, we impose the eikonal regularization [8] on the SDF prediction, as:

e=1NrNpi,jNr,Np(s(𝒑i,j)21)2,subscript𝑒1subscript𝑁𝑟subscript𝑁𝑝superscriptsubscript𝑖𝑗subscript𝑁𝑟subscript𝑁𝑝superscriptsubscriptnorm𝑠subscript𝒑𝑖𝑗212\mathcal{L}_{e}=\frac{1}{N_{r}N_{p}}\sum_{i,j}^{N_{r},N_{p}}(\left\|\nabla s(% \bm{p}_{i,j})\right\|_{2}-1)^{2},caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ∥ ∇ italic_s ( bold_italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (15)

where Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the number of sampling points on each ray.

In a bid to draw the global color closer to the actual color, we impose a constraint on the mean value of the relight color crsubscript𝑐𝑟c_{r}italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to be zero, as:

r=1NrNpi,j=1Nr,Npcri,j(𝒑,𝒅).subscript𝑟1subscript𝑁𝑟subscript𝑁𝑝superscriptsubscript𝑖𝑗1subscript𝑁𝑟subscript𝑁𝑝subscriptsubscript𝑐𝑟𝑖𝑗𝒑𝒅\mathcal{L}_{r}=\frac{1}{N_{r}N_{p}}\sum_{i,j=1}^{N_{r},N_{p}}{c_{r}}_{i,j}(% \bm{p},\bm{d}).caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_p , bold_italic_d ) . (16)

This strategy prompts the global color network to learn colors under average lighting conditions. In other words, this minimizes the impact of the relighting network on the global color. This loss term is necessary because we cannot directly supervise global color for the reasons mentioned in naive solution 4.1.

In the context of object reconstruction, an intuitive method involves using object (foreground) segmentation to eliminate background elements. In scenarios where foreground segmentation is available and free of object-scene occlusion, we recommend incorporating a mask loss Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

m=1NrNrBCE(M,O^),subscript𝑚1subscript𝑁𝑟subscriptsubscript𝑁𝑟𝐵𝐶𝐸𝑀^𝑂\mathcal{L}_{m}=\frac{1}{N_{r}}\sum_{N_{r}}BCE(M,\hat{O})~{},caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_B italic_C italic_E ( italic_M , over^ start_ARG italic_O end_ARG ) , (17)

where O^=jwj^𝑂subscript𝑗subscript𝑤𝑗\hat{O}=\sum_{j}w_{j}over^ start_ARG italic_O end_ARG = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the cumulative weights along a camera ray, and M𝑀Mitalic_M is a binary mask that signifies if a ray is within the boundaries of object segmentation.

In summary, our training loss is calculated as per Eq. 18, where λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, λrsubscript𝜆𝑟\lambda_{r}italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are hyperparameters.

=λcc+λee+λrr+λmm.subscript𝜆𝑐subscript𝑐subscript𝜆𝑒subscript𝑒subscript𝜆𝑟subscript𝑟subscript𝜆𝑚subscript𝑚\mathcal{L}=\lambda_{c}\mathcal{L}_{c}+\lambda_{e}\mathcal{L}_{e}+\lambda_{r}% \mathcal{L}_{r}+\lambda_{m}\mathcal{L}_{m}.caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . (18)

In addition to optimizing the network parameters mentioned above, for our wild dataset we also optimize the camera pose following NeRF--- - [26]. As in GNeRF [17], we utilize a continuous 6D vector 𝒓6𝒓superscript6\bm{r}\in\mathbb{R}^{6}bold_italic_r ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT to represent 3D rotations, which has been demonstrated to be more suitable for learning [39]. Jointly optimizing formulation can be expressed as:

Θ*,Π*=argminΘ,Π(Θ,Π),superscriptΘsuperscriptΠsubscriptargminΘΠΘΠ\Theta^{*},\Pi^{*}=\operatorname*{arg\,min}_{\Theta,\Pi}\mathcal{L}(\Theta,\Pi% )~{},roman_Θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , roman_Π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT roman_Θ , roman_Π end_POSTSUBSCRIPT caligraphic_L ( roman_Θ , roman_Π ) , (19)

where ΘΘ\Thetaroman_Θ and ΠΠ\Piroman_Π refer to the network parameters and camera pose, respectively.

5 Experiment

5.1 Implementation Details

We random sample 1024 rays per batch with 8 random images and train our model for 100k iterations on a single NVIDIA A10 GPU. Following NeuS, the learning rate is first linearly warmed up from 00 to 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT in the first 5k iterations, and then controlled by the cosine decay schedule to the minimum learning rate of 2.5×1052.5superscript1052.5\times 10^{-5}2.5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. For all datasets, we set λcsubscript𝜆𝑐\lambda_{c}italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to 1.01.01.01.0, λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to 0.10.10.10.1, λrsubscript𝜆𝑟\lambda_{r}italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to 1.01.01.01.0. For datasets where no object segmentation is available (OmniObject3D) or where object-scene occlusion occurs (IHO-Video), we set λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to 0.00.00.00.0. In contrast, for other datasets, we set λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is 0.10.10.10.1. If a segmentation mask is available, we apply a sample strategy so that the amount of light falling inside the mask gradually increases from 50% to 80% during training.

Refer to caption
Figure 3: Illustration of the monocular recording setup of the IHO-Video, HOS task. The footage was captured using a uncalibrated mobile phone camera (default 1080p-30fps). This simple setup allows the quick deployment of our method, enabling users to collect data on their objects of interest with ease.
Refer to caption
Figure 4: Illustration of hand-held scanning system based on Structure-Light 3D scanner. To fully scan the object, the subject must separately scan and then align its top and bottom surfaces. The scanned meshes serve as the ground-truth in the HOS task.
Figure 5: Results on IHO Video.
Refer to caption
Refer to caption
Figure 5: Results on IHO Video.
Figure 6: Results on BlendedMVS dataset and DTU dataset.

5.2 Empirical Evaluation - Hand-held Object Scan

We initiate the evaluation by deploying Color-NeuS to a real-world challenging task: ‘hand-held object scanning’ (HOS). In HOS, to reveal the object appearance from all directions, the hand must continuously rotate and flip the object, thereby causing continual variation in occlusion and shadow effects. Our performance evaluation is conducted in comparison with Color-NeuS’s alternative solutions. To evaluate these solutions both quantitatively and qualitatively, we have curated an original validation set termed ‘In-Hand Object Video’ (IHO-Video).

IHO-Video Dataset This dataset includes four video sequences, each documenting a hand-held object under a monocular recording setup (as illustrated in Fig. 3). The footage was captured using an iPhone with resolution set to 1920×\times×1080. Notably, these videos are characterized by their distinct dynamic attribute where objects and hands move in coordination, resulting in significant instances of dynamic occlusion and shadow interference. Furthermore, surface reflections that fluctuate as the objects move add an additional layer of complexity. To extract object masks for each frame, we employ Track-Anything [31], a method grounded in Segment-Anything [14] for segmentation and XMem [5] for instance tracking. To obtain camera poses, we utilize the off-the-shelf Structure-from-Motion toolkit COLMAP [20, 21]. Additionally, we used a professional hand-held Structured Light 3D scanner111SHINING 3D EinScan Pro 2X 2020 to obtain the ground-truth 3D object mesh, as shown in Fig. 4.

Refer to caption
Figure 7: Examples of occlusion and reflection phenomena in our IHO-Video dataset. The first row showcase the images used for training. The gray area represent the background elements that are eliminated by the mask. The second row showcase the output meshes in the image-aligned view. As illustrated, Color-NeuS consistently demonstrates robust performance in reconstructing both the shape and texture of the objects.
Refer to caption
Figure 8: This illustration demonstrates that, among the four methods we applied to the pink peach sequence, only our approach successfully reconstructs the correct geometry of the peach rod.

HOS Task Evaluation We compared the results of Color-NeuS with those derived from structured light scanning (see Fig. 4) and COLMAP’s dense reconstruction. Additionally, our solution was compared with both the naive and intermediate solutions discussed in Secs. 4.1 and 4.2. The respective results are depicted in Fig. 6.

When scanning with structured light, the additional fill-in light results in color renderings divergent from those observed in the video footage. The results of the Poisson reconstruction stemming from COLMAP are also not satisfactory due to the imperfect camera poses and low-resolution point cloud representation. Besides, the naive solution was unable to accurately learn the object’s geometry. The interm. solution, despite successfully learning the correct geometry, faced difficulties dealing with occlusion and reflection, resulting artifacts such as dark spots appearing on the object mesh. Contrarily, Color-NeuS adeptly managed such disturbances. Fig. 7 showcase several examples of occlusion and reflection. In spite of these challenges, Color-NeuS consistently exhibit robust performance in modeling the shape and appearances of the object surfaces. Furthermore, its geometric reconstruction results aligned well with the ground truth (see Tab. 3), showcasing its efficacy.

Our method has demonstrated considerable enhancement in object geometry learning. As evident in Fig. 8, traditional methods such as structured-light scanning, COLMAP, and NeuS fall short in accurately learning the structure of the peach rod. In contrast, our proposed Color-NeuS exhibits commendable proficiency. This superior performance is attributed to Color-NeuS’s ability to discern view-independent color, preventing the misinterpretation of the pod as a black section on the surface of the peach body.

5.3 Evaluation on Public Datasets

OmniObject3D. [29] OmniObject3D is a comprehensive 3D object dataset characterized by an extensive vocabulary and a large number of high-quality, real-scanned 3D objects. It includes 6,000 scanned objects drawn from 190 everyday categories. Furthermore, it also provides object-centric, photorealistic multi-view images rendered using Blender [6], thus facilitating a wide range of experiments and analyses. Due to this dataset’s extensive size, in this paper, we show the effect on two subsets from the rendered images, namely, the ‘doll’ set and the ‘toy_animals’ set.

Refer to caption
Figure 9: Results on OmniObject3D dataset.

DTU. [13] The DTU dataset comprises a broad diversity of materials, appearances, and geometric structures, posing significant challenges for reconstruction algorithms, including handling non-Lambertian surfaces and delicate, thin structures. Consistent with the NeuS [23] method, we employ 15 scenes from this dataset for our experiments.

BlendedMVS. [32] Same with NeuS [23], we also conducted tests on 8 challenging scenes drawn from the low-resolution subset of the BlendedMVS dataset. In this dataset, each image exhibits a resolution of 768×576768576768\times 576768 × 576 pixels, accompanied by corresponding masks.

HOD. [12] Hand-held Object Dataset (HOD) is a dataset contains 35 objects, which is divided into two subsets named Sculptures and Daily Objects. Because our task is different with HHOR (they reconstruct hand and object in a tightly held pose), we only use the ‘Giuliano’ sequence in Sculptures subset for evaluation. \blacksquare

We compare Color-NeuS with original NeuS [23] on BlendedMVS [32] and DTU [13] dataset as shown in Fig. 6, Color-NeuS can extract surface color while maintain the ability to learn geometry. In Fig. 9, we evaluate Color-NeuS on OmniObject3D [29] dataset. Beyond the ability to learn geometry, due to the design of religinting network, Color-NeuS can also maintain the ability for novel view synthesis. Checkout our supplementary materials for more results.

Our approach was also contrasted with that of HHOR [12] on the HOD [12] dataset. HHOR uses the intermediate solution for color extraction. As evidenced in Fig. 10 and Tab. 2, Color-NeuS demonstrates the ability to reconstruct the surface color with fewer impurities and enhanced geometrical accuracy.

Refer to caption
Figure 10: Results on Giuliano sequence in HOD dataset.

5.4 Quantitative Results

In this section, we employ the Chamfer Distance on OmniObject3D [29], Giuliano (a sequence of HOD [12]), and IHO-Video as a quantitative measure to evaluate the efficacy of our method. It is pertinent to mention that the geometric aspect of our approach is founded on NeuS [23]. We will demonstrate that the geometric learning capability of our method is at least on par with NeuS. In other words, our modifications will not degrade the performance inherent to NeuS. From a theoretical standpoint, the application of our technique to enhancements of NeuS, such as PET-NeuS [25], should yield superior geometric outcomes.

doll ID 002 008 049 074 085
NeuS 0.35 2.31 0.55 0.42 0.97
Ours 0.27 1.91 0.53 0.40 0.74
Table 1: Chamfer Distance on OmniObject3D.
Giuliano HHOR Ours
Chamfer Distance 8.25 5.89
Table 2: Chamfer Distance on Giuliano sequence.
IHO Video COLMAP Naive NeuS Our
ghost bear 7.84 197.65 1.18 1.32
game box 3.54 12.57 0.78 0.55
pink peach 12.69 450.20 19.39 8.13
drink 6.68 60.52 11.29 8.60
Table 3: Chamfer Distance on IHO Video.

We first normalize each mesh to unit size (1m1𝑚1m1 italic_m), then utilize the point-to-point iterative closest point (ICP) algorithm to align the reconstructed mesh with the ground truth. In all the tables, the Chamfer Distance (CD) is defined based on squared distance, with a unit of 1cm21𝑐superscript𝑚21cm^{2}1 italic_c italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Results is shown in Tabs. 1, 2 and 3.

5.5 Ablation Study

In the relighting network, the gradient of the Signed Distance Function (SDF), denoted as g(𝒑)𝑔𝒑g(\bm{p})italic_g ( bold_italic_p ), is incorporated as input. This decision’s efficacy is demonstrated in an ablation study, the results of which can be viewed in Fig. 11. We believe that this gradient input offers accurate geometric information to the network, consequently leading to a faithful color distribution.

Refer to caption
Figure 11: Ablation study.

In another ablation study, we investigated the significance of using the inverse sigmoid function. In Eq. 13, we initially apply the inverse sigmoid function, denoted by Ψ1superscriptΨ1\Psi^{-1}roman_Ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, to the RGB value. Subsequently, we introduce the relight color, followed by a sigmoid function, ΨΨ\Psiroman_Ψ. An alternative approach would be to directly restrict the relighted color to the interval [0, 1]:

c(𝒑,𝒅)=clamp(cg(𝒑)+Ψ(cr(𝒑,𝒅))0.5,0,1).𝑐𝒑𝒅clampsubscript𝑐𝑔𝒑Ψsubscript𝑐𝑟𝒑𝒅0.501c(\bm{p},\bm{d})=\mathrm{clamp}(c_{g}(\bm{p})+\Psi(c_{r}(\bm{p},\bm{d}))-0.5,0% ,1).\vspace{-3pt}italic_c ( bold_italic_p , bold_italic_d ) = roman_clamp ( italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_p ) + roman_Ψ ( italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_p , bold_italic_d ) ) - 0.5 , 0 , 1 ) . (20)

The outcomes are displayed in Fig. 11. The findings suggest that refraining from employing the inverse sigmoid function tends to result in marginally darker color outputs.

6 Conclusion

In this paper, we introduce Color-NeuS, a novel approach to 3D implicit textured surface reconstruction which is compatible with any NeuS-like models. By isolating the view-dependent element from the neural radiance field and employing a relighting network to sustain volume rendering, Color-NeuS is able to effectively retrieve surface color while accurately reconstructing the surface detail. We put Color-NeuS the test on a demanding in-hand object scanning task using our personally collected sequences as well as several public datasets. The results demonstrate Color-NeuS’s capability to reconstruct neural implicit surfaces with accurate color representation.   Acknowledgments This work was supported by the National Key R&D Program of China (No. 2021ZD0110704), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Shanghai Qi Zhi Institute, and Shanghai Science and Technology Commission (21511101200).

References

  • Barron et al. [2021] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Bian et al. [2023] Wenjing Bian, Zirui Wang, Kejie Li, Jiawang Bian, and Victor Adrian Prisacariu. NoPe-NeRF: Optimising neural radiance field with no pose prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Boss et al. [2021] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Barron, Ce Liu, and Hendrik P.A. Lensch. NeRD: Neural reflectance decomposition from image collections. In International Conference on Computer Vision (ICCV), 2021.
  • Chen et al. [2021] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In International Conference on Computer Vision (ICCV), pages 14124–14133, 2021.
  • Cheng and Schwing [2022] Ho Kei Cheng and Alexander G. Schwing. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision (ECCV), 2022.
  • Community [2018] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  • Esposito et al. [2022] Stefano Esposito, Daniele Baieri, Stefan Zellmann, André Hinkenjann, and Emanuele Rodolà. Kiloneus: Implicit neural representations with real-time global illumination. arXiv preprint arXiv:2206.10885, 2022.
  • Gropp et al. [2020] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. In International Conference on Machine Learning (ICML), 2020.
  • Hampali et al. [2023a] Shreyas Hampali, Tomas Hodan, Luan Tran, Lingni Ma, Cem Keskin, and Vincent Lepetit. In-hand 3d object scanning from an rgb sequence. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  • Hampali et al. [2023b] Shreyas Hampali, Tomas Hodan, Luan Tran, Lingni Ma, Cem Keskin, and Vincent Lepetit. In-hand 3d object scanning from an rgb sequence. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  • Hasson et al. [2019] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Huang et al. [2022] Di Huang, Xiaopeng Ji, Xingyi He, Jiaming Sun, Tong He, Qing Shuai, Wanli Ouyang, and Xiaowei Zhou. Reconstructing hand-held objects from monocular video. In SIGGRAPH Asia Conference Proceedings, 2022.
  • Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 406–413, 2014.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment Anything. arXiv preprint arXiv:2304.02643, 2023.
  • Lin et al. [2021] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In International Conference on Computer Vision (ICCV), 2021.
  • Liu et al. [2022] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Theobalt Christian, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Meng et al. [2021] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su, Lan Xu, Xuming He, and Jingyi Yu. GNeRF: GAN-based Neural Radiance Field without Posed Camera. In International Conference on Computer Vision (ICCV), 2021.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
  • Oechsle et al. [2021] Michael Oechsle, Songyou Peng, and Andreas Geiger. UNISURF: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In International Conference on Computer Vision (ICCV), 2021.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  • Srinivasan et al. [2021] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron. NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Wang et al. [2021a] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In Conference on Neural Information Processing Systems (NeurIPS), 2021a.
  • Wang et al. [2021b] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  • Wang et al. [2023] Yiqun Wang, Ivan Skorokhodov, and Peter Wonka. PET-NeuS: Positional encoding triplanes for neural surfaces. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Wang et al. [2021c] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. NeRF--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021c.
  • Wen et al. [2023] Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-dof tracking and 3d reconstruction of unknown objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Wizadwongsa et al. [2021] Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon Yenphraphai, and Supasorn Suwajanakorn. NeX: Real-time view synthesis with neural basis expansion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Wu et al. [2023] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. OmniObject3D: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Xu et al. [2022] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-NeRF: Point-based neural radiance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Yang et al. [2023] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track Anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
  • Yao et al. [2020] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Yariv et al. [2020] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2020.
  • Yariv et al. [2021] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • Ye et al. [2022] Yufei Ye, Abhinav Gupta, and Shubham Tulsiani. What’s in your hands? 3d reconstruction of generic objects in hands. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Yu et al. [2021] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Zhang et al. [2021a] Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and Noah Snavely. PhySG: Inverse rendering with spherical gaussians for physics-based material editing and relighting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
  • Zhang et al. [2021b] Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William T. Freeman, and Jonathan T. Barron. Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Trans. Graph., 2021b.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
\thetitle

Supplementary Material

7 Additional details

Refer to caption
Figure 12: Relight network architecture.

In NeRF [18], positional encoding, denoted by γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ), is utilized to allow the network to capture high-frequency details. In Color-NeuS, positional encoding is applied to the spatial location 𝒑𝒑\bm{p}bold_italic_p with 6 frequencies within the SDF network, and to the view direction 𝒅𝒅\bm{d}bold_italic_d with 4 frequencies in the relight network. We adopt a similar architecture for the SDF network and global color network as found in NeuS [23]. The architecture of the relight network, comprising 4 hidden layers each with a hidden size of 256, is depicted in Fig. 12. The global color information is fed into the final layer.

8 More Results

In this section, we present additional results on public datasets. Specifically, the quantitative results related to OmniObject3D are shown in Tab. 4, while the qualitative outcomes for both BlenderMVS and DTU are depicted in Fig. 13 and Fig. 14, respectively.

toy_animals ID 001 005 016 019 059
NeuS 0.53 12.05 4.36 3.43 1.09
Ours 0.91 8.66 2.47 1.06 1.18
Table 4: More results of Chamfer Distance on OmniObject3D.
Refer to caption
Figure 13: More results on the BlenderMVS dataset.
Refer to caption
Figure 14: More results on the DTU dataset.

9 Rendering Quality

To demonstrate that our method retains the capability of volume rendering and can perform new view synthesis, as described in Sec. 5.3, we conducted tests on the OmniObject3D dataset [29]. The results are presented in Tab. 5 and Fig. 15. For each sequence, we allocated 90%percent9090\%90 % of the images for training and the remaining 10%percent1010\%10 % for testing. The training configurations were identical to those used in the main paper, and we ensured that our method, NeuS [23], and NeRF [18] were trained under the same settings. The results reveal that our method can achieve rendering performance comparable to that of NeRF [18] and NeuS [23].

doll ID toy_animals ID
002 008 037 049 062 074 085 001 005 016 019 059 Mean
PSNR NeRF 37.45 36.27 36.71 36.77 37.79 37.13 37.21 37.76 39.64 36.70 40.35 35.84 37.47
NeuS 37.87 35.95 37.10 37.74 37.65 37.27 37.76 37.88 39.34 35.96 40.27 36.07 37.57
Ours 37.47 35.20 36.67 37.43 37.43 37.23 37.28 37.82 39.01 35.69 39.55 35.66 37.20
SSIM NeRF 0.983 0.975 0.987 0.979 0.979 0.987 0.985 0.985 0.987 0.980 0.989 0.982 0.983
NeuS 0.986 0.975 0.989 0.983 0.979 0.989 0.988 0.986 0.987 0.978 0.990 0.984 0.984
Ours 0.985 0.973 0.988 0.983 0.979 0.989 0.987 0.985 0.986 0.977 0.989 0.984 0.984
Table 5: Quantitative results for new view synthesis on the OmniObject3D dataset.

Refer to caption
Figure 15: New view synthesis (2D render) results on the OmniObject3D dataset.