Color-NeuS: Reconstructing Neural Implicit Surfaces with Color

Licheng Zhong^{1 $\star$} Lixin Yang^{1,2 $\star$} Kailin Li¹ Haoyu Zhen¹ Mei Han³ Cewu Lu^{1,2 $\bm{\dagger}$}

{}^{1}

Shanghai Jiao Tong University

{}^{2}

Shanghai Qi Zhi Institute

{}^{3}

National University of Singapore
{zlicheng, siriusyang, kailinli, anye_zhen, lucewu}@sjtu.edu.cn {hanmei}@u.nus.edu

Abstract

The reconstruction of object surfaces from multi-view images or monocular video is a fundamental issue in computer vision. However, much of the recent research concentrates on reconstructing geometry through implicit or explicit methods. In this paper, we shift our focus towards reconstructing mesh in conjunction with color. We remove the view-dependent color from neural volume rendering while retaining volume rendering performance through a relighting network. Mesh is extracted from the signed distance function (SDF) network for the surface, and color for each surface vertex is drawn from the global color network. To evaluate our approach, we conceived a in hand object scanning task featuring numerous occlusions and dramatic shifts in lighting conditions. We’ve gathered several videos for this task, and the results surpass those of any existing methods capable of reconstructing mesh alongside color. Additionally, our method’s performance was assessed using public datasets, including DTU, BlendedMVS, and OmniObject3D. The results indicated that our method performs well across all these datasets. Project page: colmar-zlicheng.github.io/color_neus.

⁰⁰footnotetext: These authors contributed equally. ⁰⁰footnotetext: Cewu Lu is the corresponding author. He is the member of Qing Yuan Research Institute and MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, and Shanghai Qi Zhi Institute, China.

1 Introduction

The endeavor of reconstructing 3D objects from 2D images is a pivotal and ongoing challenge in the domains of computer vision and graphics. Previously, the structure-from-motion (SFM) method [20, 21] was widely used to reconstruct 3D objects from 2D images. However, it often struggled with scenes lacking texture or those with repetitive patterns, leading to ambiguous correspondences and incorrect depth estimations. Another limitation of SFM was its inability to effectively handle occlusions. When an object in the scene was partially obscured, it would often lead to errors in reconstruction. A further limitation of SFM lay in its dependency on point cloud representation, which falls short of accomplishing a fully dense reconstruction. Recently, the landscape of this field has been evolving, with a burgeoning interest in the investigation of implicit neural surfaces via volume rendering [23] which can represent pixel-level fine surface. It’s based on the neural radiance field (NeRF) [18].

Refer to caption — Figure 1: Color-NeuS decouple the view-dependent component from the color network and employs a relighting network to account for the view-dependent aspect. This allows Color-NeuS to reconstruct a 3D implicit surface with color, while maintaining its ability for 2D volume rendering.

Pioneering works like NeRF [18] and its successors [36, 4, 1, 16, 30] have convincingly demonstrated the power of neural networks in representing continuous 3D scenes. This is achieved by learning a mapping from 3D coordinates to volume density and view-dependent color, leading to highly effective novel view synthesis. Subsequently, NeuS [23] extended this concept, leveraging signed distance functions (SDF) to reconstruct a neural implicit surface. However, NeuS’s scope remains limited to the reconstruction of a mesh without vertex color. This limitation arises as the color of each point in NeuS’s neural volume rendering is determined by both its position and viewing ray direction. So, in the context of mesh reconstruction, where the view direction is absent, NeuS cannot assigns a specific color value to each point. In addition, most of the NeRF relighting works focus on 2D image rendering rather than textured surfaces reconstruction. To address these limitation, in this paper, we present a novel method for the reconstruction of neural implicit surfaces incorporating view-independent color.

We propose Color-NeuS, a NeuS-compatible approach that facilitates the reconstruction of 3D surface and global vertex Colors, independent of the viewpoint. Simultaneously, it upholds NeuS’s robust functionalities in rendering 2D images and executing 3D surface reconstruction. To this end, we replace the singular view-dependent color component in the neural volume rendering with the integration of a view-independent global color variable and a view-dependent relighting effect as shown in Fig. 1. This not only enables our model to be trained for standard volume rendering but also paves the way for the learning of global vertex colors. Importantly, Color-NeuS naturally handles substantial reflection on the object’s surface, and can deal with intermittent occlusion during dynamic interaction between objects (see Fig. 7).

The effectiveness of our Color-NeuS has been extensively validated on various datasets including the DTU [13] , BlendedMVS [32], and OmniObject3D [29] datasets. To demonstrate the superiority of our method, we conducted comparative evaluations with laser scanning and well-established methods such as COLMAP [20, 21] and HHOR [12]. The results reveal that Color-NeuS successfully reconstructs the object’s surface while extracting a reasonable texture. To underscore its practical application, we apply Color-NeuS to a real-world challenging task: hand-held object scanning [12, 10]. As part of this process, we collect a dataset for validation, which includes 3D scans of objects, videos of moving objects held in hand, and object-centric camera transformations.

Our contributions can be summarized as follows:

•

We propose a novel method for the reconstruction of a neural implicit surface with color that can be applied to any NeuS-like models.
•

We decouple the view-dependent process in neural volume rendering, enabling the handling of occlusion and reflection while obtaining global vertex colors.
•

We devise a challenging task of in-hand object scanning for the reconstruction of mesh with color and compile a real-world dataset for this task.

2 Related Work

Neural Radiance Field. Following the pioneering work of NeRF [18], a variety of studies have emerged in the realm of neural implicit fields. Notable among them, such as NeRF $--$ [26], NoPe-NeRF [2], BARF [15], and GNeRF [17], have been investigating methods to estimate camera pose while training the implicit field. Concurrently, efforts such as Pixel-NeRF [36], MVSNeRF [4], and IBRNet [24] are focusing on the generalization of neural radiance fields. In parallel, research projects like KiloNeuS [7], NeX [28], NeRV [22], NeRD [3], PhySG [37], and NeRFactor [38] have been focusing on illumination and relighting in neural fields. Our method is compatible with neural volume rendering and retents the capacity of new view synthesis.

Surface Reconstruction. Implicit Differentiable Renderer (IDR) [33] represents geometry as the zero level set from Signed Distance Function (SDF) by leveraging implicit geometric regularization (IGR) [8]. NeuS [23] amalgamates the SDF field and volume rendering to reconstruct surfaces. Both VolSDF [34] and UNISURF [19] combine implicit surface representation with volume rendering: VolSDF [34] disentangles appearance from geometry, while UNISURF [19] formulates implicit surface models and radiance fields in a cohesive manner. PET-NeuS [25], an extension of NeuS [23], introduces new components such as an unique positional encoding type, tri-plane representation, and learnable convolution operations. However, these methodologies do not factor in the global view-independent color of the surface. Our method focuses on the extraction of surface color from a global color network while maintaining geometry learning and image rendering from a relighting network.

In-hand Object Scanning. ObMan [11] presents an end-to-end learnable model that employs a unique contact loss, which encourages physically plausible hand-object configurations. For the in-hand object scanning task, IHOI [35] reconstructs from a single RGB image, taking advantage of the estimated hand pose. BundleSDF [27], on the other hand, estimates the object pose using RGBD input sequential images, all the while reconstructing the implicit surface represented by the Signed Distance Function (SDF). HHOR [12] also utilizes SDF to represent the object surface, but it reconstructs the object in tandem with the hand, where the object is firmly held. A recent work [9] reconstructs the object from an RGB sequence and estimates the camera pose simultaneously by using an occupancy field to represent the surface. However, it makes the assumption that the light source is distant and the direction of light remains unchanged. In contrast, our method can handle arbitrary lighting conditions to model the object appearance.

3 Preliminary

We first introduce Neural Implicit Surface (NeuS) [23], which is the basis of our method. Given a camera with known intrinsic parameters, we can represent a ray in the camera coordinate system as

\bm{p}(z)=\bm{o}+z\bm{d},

(1)

where $\bm{o}$ and $\bm{d}$ are the origin and the direction of the ray, respectively, and $z$ is the distance from the origin to the point on the ray. Then, a MLP network $\mathcal{G}$ is used to encode $\bm{p}$ to its signed distance function (SDF) $s(\bm{p})$ and feature vector $f(\bm{p})$ , as

[s(\bm{p}),f(\bm{p})]=\mathcal{G}(\bm{p}).

(2)

With the position $\bm{p}$ , direction $\bm{d}$ , feature vector $f(\bm{p})$ and gradient $g(\bm{p})$ as input, another MLP $\mathcal{M}$ outputs the color of the query point, as

c(\bm{p},\bm{d})=\mathcal{M}(\bm{p},\bm{d},f(\bm{p}),g(\bm{p})),

(3)

where $g(\bm{p})=\nabla s(\bm{p})$ is the gradient of SDF at point $\bm{p}$ . Finally, the color of the query pixel $C$ is obtained by integrating the color along the ray,

C=\int_{z_{n}}^{z_{f}}w(z)c(\bm{p},\bm{d})dz,

(4)

w(z)=\exp\big{(}-\int_{z_{n}}^{z}\sigma(t)dt\big{)}*\sigma(z),

(5)

\sigma(\bm{p})=\frac{\alpha e^{-\alpha s(\bm{p})}}{(1+e^{-\alpha s(\bm{p})})^{% 2}},

(6)

where $\sigma(\bm{p})$ designates the density of $\bm{p}$ with $\alpha\in\mathbb{R}^{1}$ acting as a learnable parameter. The term $w(z)$ refers to the weight assigned to the color at point $z$ . Additionally, $z_{n}$ and $z_{f}$ represent the near and far plane of the camera, respectively.

Based on the input $\bm{d}$ in Eq. 4, the output per-vertex color is view-dependent. Consequently, the original NeuS avoids incorporating color, focusing solely on the reconstruction of surface shape.

4 Method

Our goal is to both extract the color and geometry of objects,

c_{g}(\bm{x}),\bm{x}\in\mathcal{S},

(7)

where $\mathcal{S}=\left\{\bm{x}\in\mathbb{R}^{3}|s(\bm{x})=0\right\}$ represents the object surface (a collection of mesh vertices) that is characterized by a set of points with zero-level signed distance value.

Property 1.

([23, Sec.3.1]) NeuS possesses an advantageous property wherein the standard deviation of the density $\sigma(\bm{p})$ is dictated by the trainable parameter $1/\alpha$ , which progressively approaches zero as the network training reaches convergence. $\blacksquare$

Implied by 1, the density $\sigma(\bm{p})$ of the neural radiance field substantially condenses on the surface $\mathcal{S}$ . This provides a focal point for us to extract surface (per-vertex) color of the object, when the networking training reaches convergence.

4.1 Naive Solution

One solution that might seem intuitive is to simply expunge the view-dependent term in Eq. 3, as:

c(\bm{p})=\mathcal{M}(\bm{p},f(\bm{p}),g(\bm{p})).

(8)

Regrettably, undertaking this procedure can potentially impair the learned geometry and appearances within the neural radiance field. This is attributed to the incapacity of neural radiance field to accurately express light variations of points across different directions, when a view-dependent term is absent. Furthermore, such an approach may also culminate in the breakdown of the SDF field, subsequently triggering fragmentation of the surface. The qualitative results of naive solution are displayed in Fig. 6.

4.2 Intermediate Solution

The intermediate (interm.) solution is to utilize a constant direction as an input to the color network in order to extract color information from the surface. For instance, in HHOR [12], the vertex color is defined based on the surface’s normal direction. For any $x\in\mathcal{S}$ , the normal direction is given by $-g(\bm{x})$ . Consequently, the vertex color can be extracted using

c(\bm{x})=\mathcal{M}(\bm{x},-g(\bm{x}),f(\bm{x}),g(\bm{x})).

(9)

Yet, given the considerable variation in an object’s surface color under changing lighting conditions, it may be insufficient to assign a singular ‘direction color’ as a representation of the object’s true color. The inherent complexity associated with light-surface interactions calls for a more nuanced approach to color extraction.

4.3 Color-NeuS

In our method Color-NeuS, we propose a workflow to disassociate the global color from the view-dependent formulation, all the while preserving the appropriate geometry and obtaining a reasonable view-independent color for each vertex. Specifically, we substitute the learning of a single view-dependent color component (as indicated in Eq. 3) with the learning of a view-independent global color variable (Sec. 4.3.1) coupled with a view-dependent relighting effect (Sec. 4.3.2).

4.3.1 Removing View-Dependence

Our proposed solution begins with the removal of the view-dependent term from Eq. 3, which transitions the model to a global color network, as:

c_{g}(\bm{p})=\mathcal{M}_{g}(\bm{p},f(\bm{p}),g(\bm{p})).

(10)

This formulation matches the naive solution presented in Eq. 8, an ideal starting point for extracting the view-independent vertex color. However, solely employing this equation (ignoring the view-dependence) for optimizing network to convergence will violate the 1. Consequently, this contravention can prevent the density $\sigma(\bm{p})$ from condensing onto the object surface, inadvertently affecting the learned geometry.

Therefore, our proposed solution (Color-NeuS) is to separate the inference procedure from optimizing (training). During optimizing, we re-integrate the global color with a residual term that incorperates the view direction. The outcome of this integration meets the view requirement set forth in the volume rendering formulation Eq. 4, and preserves the 1 through the optimizing phase.

During inference, once the density $\sigma(\bm{p})$ has condensed to the object surface, the shape of object ( $s(\bm{p})=0$ ) can be guaranteed. Additionally, $c_{g}(\bm{p})$ acquires the capability to express the vertex color. Consequently, we can extract the vertex color on the surface (where $x\in\mathcal{S}$ , that is $s(\bm{x})=0$ ) during inference using:

c_{g}(\bm{x})=\mathcal{M}_{g}(\bm{x},f(\bm{x}),g(\bm{x})).

(11)

4.3.2 Coupling Relighting Effect

To maintain the model’s performance after removing view-dependence in $c_{g}(\bm{x})$ , we introduce a relighting network to compensate for the discarded view-dependent term. The relighting network functions with respect to the position, direction, and view-independent color, generating a small view-dependent color adjustment for each point. This can be expressed as:

c_{r}(\bm{p},\bm{d})=\mathcal{R}_{g}(c_{g}(\bm{p}),\bm{p},\bm{d},g(\bm{p})).

(12)

Subsequently, the ultimate color of each point is computed by integrating the relighting effect with the global color, as:

c(\bm{p},\bm{d})=\Psi(\Psi^{-1}(c_{g}(\bm{p}))+c_{r}(\bm{p},\bm{d})),

(13)

where $\Psi$ and $\Psi^{-1}$ represent the sigmoid and inverse sigmoid function, respectively. At present, the color is a fusion of a global color (view-independent) and a relighting effect (view-dependent). The $c(\bm{p},\bm{d})$ is then incorporated in Eq. 4 to compute the volume integral $C$ .

4.4 Optimization

The rendered color $C$ along the sampled rays can be computed using Eqs. 4, 6 and 5. Let $\widehat{C}$ denote the ground truth color, we can define the color loss as:

\mathcal{L}_{c}=\frac{1}{N_{r}}\sum_{i=1}^{N_{r}}\|\widehat{C}_{i}-C_{i}\|_{2}% ^{2},

(14)

where $N_{r}$ denotes the number of sampled rays.

To enforce the optimized neural representation fulfills a valid SDF, we impose the eikonal regularization [8] on the SDF prediction, as:

\mathcal{L}_{e}=\frac{1}{N_{r}N_{p}}\sum_{i,j}^{N_{r},N_{p}}(\left\|\nabla s(% \bm{p}_{i,j})\right\|_{2}-1)^{2},

(15)

where $N_{p}$ is the number of sampling points on each ray.

In a bid to draw the global color closer to the actual color, we impose a constraint on the mean value of the relight color $c_{r}$ to be zero, as:

\mathcal{L}_{r}=\frac{1}{N_{r}N_{p}}\sum_{i,j=1}^{N_{r},N_{p}}{c_{r}}_{i,j}(% \bm{p},\bm{d}).

(16)

This strategy prompts the global color network to learn colors under average lighting conditions. In other words, this minimizes the impact of the relighting network on the global color. This loss term is necessary because we cannot directly supervise global color for the reasons mentioned in naive solution 4.1.

In the context of object reconstruction, an intuitive method involves using object (foreground) segmentation to eliminate background elements. In scenarios where foreground segmentation is available and free of object-scene occlusion, we recommend incorporating a mask loss $L_{m}$

\mathcal{L}_{m}=\frac{1}{N_{r}}\sum_{N_{r}}BCE(M,\hat{O})~{},

(17)

where $\hat{O}=\sum_{j}w_{j}$ is the cumulative weights along a camera ray, and $M$ is a binary mask that signifies if a ray is within the boundaries of object segmentation.

In summary, our training loss is calculated as per Eq. 18, where $\lambda_{c}$ , $\lambda_{e}$ , $\lambda_{r}$ , and $\lambda_{m}$ are hyperparameters.

\mathcal{L}=\lambda_{c}\mathcal{L}_{c}+\lambda_{e}\mathcal{L}_{e}+\lambda_{r}% \mathcal{L}_{r}+\lambda_{m}\mathcal{L}_{m}.

(18)

In addition to optimizing the network parameters mentioned above, for our wild dataset we also optimize the camera pose following NeRF $--$ [26]. As in GNeRF [17], we utilize a continuous 6D vector $\bm{r}\in\mathbb{R}^{6}$ to represent 3D rotations, which has been demonstrated to be more suitable for learning [39]. Jointly optimizing formulation can be expressed as:

\Theta^{*},\Pi^{*}=\operatorname*{arg\,min}_{\Theta,\Pi}\mathcal{L}(\Theta,\Pi% )~{},

(19)

where $\Theta$ and $\Pi$ refer to the network parameters and camera pose, respectively.

5 Experiment

5.1 Implementation Details

We random sample 1024 rays per batch with 8 random images and train our model for 100k iterations on a single NVIDIA A10 GPU. Following NeuS, the learning rate is first linearly warmed up from $0$ to $5\times 10^{-4}$ in the first 5k iterations, and then controlled by the cosine decay schedule to the minimum learning rate of $2.5\times 10^{-5}$ . For all datasets, we set $\lambda_{c}$ to $1.0$ , $\lambda_{e}$ to $0.1$ , $\lambda_{r}$ to $1.0$ . For datasets where no object segmentation is available (OmniObject3D) or where object-scene occlusion occurs (IHO-Video), we set $\lambda_{m}$ to $0.0$ . In contrast, for other datasets, we set $\lambda_{m}$ is $0.1$ . If a segmentation mask is available, we apply a sample strategy so that the amount of light falling inside the mask gradually increases from 50% to 80% during training.

5.2 Empirical Evaluation - Hand-held Object Scan

We initiate the evaluation by deploying Color-NeuS to a real-world challenging task: ‘hand-held object scanning’ (HOS). In HOS, to reveal the object appearance from all directions, the hand must continuously rotate and flip the object, thereby causing continual variation in occlusion and shadow effects. Our performance evaluation is conducted in comparison with Color-NeuS’s alternative solutions. To evaluate these solutions both quantitatively and qualitatively, we have curated an original validation set termed ‘In-Hand Object Video’ (IHO-Video).

IHO-Video Dataset This dataset includes four video sequences, each documenting a hand-held object under a monocular recording setup (as illustrated in Fig. 3). The footage was captured using an iPhone with resolution set to 1920 $\times$ 1080. Notably, these videos are characterized by their distinct dynamic attribute where objects and hands move in coordination, resulting in significant instances of dynamic occlusion and shadow interference. Furthermore, surface reflections that fluctuate as the objects move add an additional layer of complexity. To extract object masks for each frame, we employ Track-Anything [31], a method grounded in Segment-Anything [14] for segmentation and XMem [5] for instance tracking. To obtain camera poses, we utilize the off-the-shelf Structure-from-Motion toolkit COLMAP [20, 21]. Additionally, we used a professional hand-held Structured Light 3D scanner¹¹1SHINING 3D EinScan Pro 2X 2020 to obtain the ground-truth 3D object mesh, as shown in Fig. 4.

HOS Task Evaluation We compared the results of Color-NeuS with those derived from structured light scanning (see Fig. 4) and COLMAP’s dense reconstruction. Additionally, our solution was compared with both the naive and intermediate solutions discussed in Secs. 4.1 and 4.2. The respective results are depicted in Fig. 6.

When scanning with structured light, the additional fill-in light results in color renderings divergent from those observed in the video footage. The results of the Poisson reconstruction stemming from COLMAP are also not satisfactory due to the imperfect camera poses and low-resolution point cloud representation. Besides, the naive solution was unable to accurately learn the object’s geometry. The interm. solution, despite successfully learning the correct geometry, faced difficulties dealing with occlusion and reflection, resulting artifacts such as dark spots appearing on the object mesh. Contrarily, Color-NeuS adeptly managed such disturbances. Fig. 7 showcase several examples of occlusion and reflection. In spite of these challenges, Color-NeuS consistently exhibit robust performance in modeling the shape and appearances of the object surfaces. Furthermore, its geometric reconstruction results aligned well with the ground truth (see Tab. 3), showcasing its efficacy.

Our method has demonstrated considerable enhancement in object geometry learning. As evident in Fig. 8, traditional methods such as structured-light scanning, COLMAP, and NeuS fall short in accurately learning the structure of the peach rod. In contrast, our proposed Color-NeuS exhibits commendable proficiency. This superior performance is attributed to Color-NeuS’s ability to discern view-independent color, preventing the misinterpretation of the pod as a black section on the surface of the peach body.

5.3 Evaluation on Public Datasets

OmniObject3D. [29] OmniObject3D is a comprehensive 3D object dataset characterized by an extensive vocabulary and a large number of high-quality, real-scanned 3D objects. It includes 6,000 scanned objects drawn from 190 everyday categories. Furthermore, it also provides object-centric, photorealistic multi-view images rendered using Blender [6], thus facilitating a wide range of experiments and analyses. Due to this dataset’s extensive size, in this paper, we show the effect on two subsets from the rendered images, namely, the ‘doll’ set and the ‘toy_animals’ set.

DTU. [13] The DTU dataset comprises a broad diversity of materials, appearances, and geometric structures, posing significant challenges for reconstruction algorithms, including handling non-Lambertian surfaces and delicate, thin structures. Consistent with the NeuS [23] method, we employ 15 scenes from this dataset for our experiments.

BlendedMVS. [32] Same with NeuS [23], we also conducted tests on 8 challenging scenes drawn from the low-resolution subset of the BlendedMVS dataset. In this dataset, each image exhibits a resolution of $768\times 576$ pixels, accompanied by corresponding masks.

HOD. [12] Hand-held Object Dataset (HOD) is a dataset contains 35 objects, which is divided into two subsets named Sculptures and Daily Objects. Because our task is different with HHOR (they reconstruct hand and object in a tightly held pose), we only use the ‘Giuliano’ sequence in Sculptures subset for evaluation. $\blacksquare$

We compare Color-NeuS with original NeuS [23] on BlendedMVS [32] and DTU [13] dataset as shown in Fig. 6, Color-NeuS can extract surface color while maintain the ability to learn geometry. In Fig. 9, we evaluate Color-NeuS on OmniObject3D [29] dataset. Beyond the ability to learn geometry, due to the design of religinting network, Color-NeuS can also maintain the ability for novel view synthesis. Checkout our supplementary materials for more results.

Our approach was also contrasted with that of HHOR [12] on the HOD [12] dataset. HHOR uses the intermediate solution for color extraction. As evidenced in Fig. 10 and Tab. 2, Color-NeuS demonstrates the ability to reconstruct the surface color with fewer impurities and enhanced geometrical accuracy.

5.4 Quantitative Results

In this section, we employ the Chamfer Distance on OmniObject3D [29], Giuliano (a sequence of HOD [12]), and IHO-Video as a quantitative measure to evaluate the efficacy of our method. It is pertinent to mention that the geometric aspect of our approach is founded on NeuS [23]. We will demonstrate that the geometric learning capability of our method is at least on par with NeuS. In other words, our modifications will not degrade the performance inherent to NeuS. From a theoretical standpoint, the application of our technique to enhancements of NeuS, such as PET-NeuS [25], should yield superior geometric outcomes.

doll ID	002	008	049	074	085
NeuS	0.35	2.31	0.55	0.42	0.97
Ours	0.27	1.91	0.53	0.40	0.74

Table 1: Chamfer Distance on OmniObject3D.

Giuliano	HHOR	Ours
Chamfer Distance	8.25	5.89

Table 2: Chamfer Distance on Giuliano sequence.

IHO Video	COLMAP	Naive	NeuS	Our
ghost bear	7.84	197.65	1.18	1.32
game box	3.54	12.57	0.78	0.55
pink peach	12.69	450.20	19.39	8.13
drink	6.68	60.52	11.29	8.60

Table 3: Chamfer Distance on IHO Video.

We first normalize each mesh to unit size ( $1m$ ), then utilize the point-to-point iterative closest point (ICP) algorithm to align the reconstructed mesh with the ground truth. In all the tables, the Chamfer Distance (CD) is defined based on squared distance, with a unit of $1cm^{2}$ . Results is shown in Tabs. 1, 2 and 3.

5.5 Ablation Study

In the relighting network, the gradient of the Signed Distance Function (SDF), denoted as $g(\bm{p})$ , is incorporated as input. This decision’s efficacy is demonstrated in an ablation study, the results of which can be viewed in Fig. 11. We believe that this gradient input offers accurate geometric information to the network, consequently leading to a faithful color distribution.

In another ablation study, we investigated the significance of using the inverse sigmoid function. In Eq. 13, we initially apply the inverse sigmoid function, denoted by $\Psi^{-1}$ , to the RGB value. Subsequently, we introduce the relight color, followed by a sigmoid function, $\Psi$ . An alternative approach would be to directly restrict the relighted color to the interval [0, 1]:

c(\bm{p},\bm{d})=\mathrm{clamp}(c_{g}(\bm{p})+\Psi(c_{r}(\bm{p},\bm{d}))-0.5,0% ,1).\vspace{-3pt}

(20)

The outcomes are displayed in Fig. 11. The findings suggest that refraining from employing the inverse sigmoid function tends to result in marginally darker color outputs.

6 Conclusion

In this paper, we introduce Color-NeuS, a novel approach to 3D implicit textured surface reconstruction which is compatible with any NeuS-like models. By isolating the view-dependent element from the neural radiance field and employing a relighting network to sustain volume rendering, Color-NeuS is able to effectively retrieve surface color while accurately reconstructing the surface detail. We put Color-NeuS the test on a demanding in-hand object scanning task using our personally collected sequences as well as several public datasets. The results demonstrate Color-NeuS’s capability to reconstruct neural implicit surfaces with accurate color representation. Acknowledgments This work was supported by the National Key R&D Program of China (No. 2021ZD0110704), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Shanghai Qi Zhi Institute, and Shanghai Science and Technology Commission (21511101200).

References

Barron et al. [2021] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Bian et al. [2023] Wenjing Bian, Zirui Wang, Kejie Li, Jiawang Bian, and Victor Adrian Prisacariu. NoPe-NeRF: Optimising neural radiance field with no pose prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Boss et al. [2021] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Barron, Ce Liu, and Hendrik P.A. Lensch. NeRD: Neural reflectance decomposition from image collections. In International Conference on Computer Vision (ICCV), 2021.
Chen et al. [2021] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In International Conference on Computer Vision (ICCV), pages 14124–14133, 2021.
Cheng and Schwing [2022] Ho Kei Cheng and Alexander G. Schwing. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In European Conference on Computer Vision (ECCV), 2022.
Community [2018] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
Esposito et al. [2022] Stefano Esposito, Daniele Baieri, Stefan Zellmann, André Hinkenjann, and Emanuele Rodolà. Kiloneus: Implicit neural representations with real-time global illumination. arXiv preprint arXiv:2206.10885, 2022.
Gropp et al. [2020] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. In International Conference on Machine Learning (ICML), 2020.
Hampali et al. [2023a] Shreyas Hampali, Tomas Hodan, Luan Tran, Lingni Ma, Cem Keskin, and Vincent Lepetit. In-hand 3d object scanning from an rgb sequence. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
Hampali et al. [2023b] Shreyas Hampali, Tomas Hodan, Luan Tran, Lingni Ma, Cem Keskin, and Vincent Lepetit. In-hand 3d object scanning from an rgb sequence. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
Hasson et al. [2019] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Huang et al. [2022] Di Huang, Xiaopeng Ji, Xingyi He, Jiaming Sun, Tong He, Qing Shuai, Wanli Ouyang, and Xiaowei Zhou. Reconstructing hand-held objects from monocular video. In SIGGRAPH Asia Conference Proceedings, 2022.
Jensen et al. [2014] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 406–413, 2014.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment Anything. arXiv preprint arXiv:2304.02643, 2023.
Lin et al. [2021] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In International Conference on Computer Vision (ICCV), 2021.
Liu et al. [2022] Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Theobalt Christian, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Meng et al. [2021] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su, Lan Xu, Xuming He, and Jingyi Yu. GNeRF: GAN-based Neural Radiance Field without Posed Camera. In International Conference on Computer Vision (ICCV), 2021.
Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
Oechsle et al. [2021] Michael Oechsle, Songyou Peng, and Andreas Geiger. UNISURF: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In International Conference on Computer Vision (ICCV), 2021.
Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Schönberger et al. [2016] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
Srinivasan et al. [2021] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron. NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Wang et al. [2021a] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In Conference on Neural Information Processing Systems (NeurIPS), 2021a.
Wang et al. [2021b] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
Wang et al. [2023] Yiqun Wang, Ivan Skorokhodov, and Peter Wonka. PET-NeuS: Positional encoding triplanes for neural surfaces. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Wang et al. [2021c] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. NeRF $--$ : Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021c.
Wen et al. [2023] Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-dof tracking and 3d reconstruction of unknown objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Wizadwongsa et al. [2021] Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon Yenphraphai, and Supasorn Suwajanakorn. NeX: Real-time view synthesis with neural basis expansion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Wu et al. [2023] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. OmniObject3D: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Xu et al. [2022] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-NeRF: Point-based neural radiance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Yang et al. [2023] Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track Anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023.
Yao et al. [2020] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Yariv et al. [2020] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2020.
Yariv et al. [2021] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
Ye et al. [2022] Yufei Ye, Abhinav Gupta, and Shubham Tulsiani. What’s in your hands? 3d reconstruction of generic objects in hands. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Yu et al. [2021] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Zhang et al. [2021a] Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and Noah Snavely. PhySG: Inverse rendering with spherical gaussians for physics-based material editing and relighting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
Zhang et al. [2021b] Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William T. Freeman, and Jonathan T. Barron. Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Trans. Graph., 2021b.
Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

\thetitle

Supplementary Material

7 Additional details

In NeRF [18], positional encoding, denoted by $\gamma(\cdot)$ , is utilized to allow the network to capture high-frequency details. In Color-NeuS, positional encoding is applied to the spatial location $\bm{p}$ with 6 frequencies within the SDF network, and to the view direction $\bm{d}$ with 4 frequencies in the relight network. We adopt a similar architecture for the SDF network and global color network as found in NeuS [23]. The architecture of the relight network, comprising 4 hidden layers each with a hidden size of 256, is depicted in Fig. 12. The global color information is fed into the final layer.

8 More Results

In this section, we present additional results on public datasets. Specifically, the quantitative results related to OmniObject3D are shown in Tab. 4, while the qualitative outcomes for both BlenderMVS and DTU are depicted in Fig. 13 and Fig. 14, respectively.

toy_animals ID	001	005	016	019	059
NeuS	0.53	12.05	4.36	3.43	1.09
Ours	0.91	8.66	2.47	1.06	1.18

Table 4: More results of Chamfer Distance on OmniObject3D.

9 Rendering Quality

To demonstrate that our method retains the capability of volume rendering and can perform new view synthesis, as described in Sec. 5.3, we conducted tests on the OmniObject3D dataset [29]. The results are presented in Tab. 5 and Fig. 15. For each sequence, we allocated $90\%$ of the images for training and the remaining $10\%$ for testing. The training configurations were identical to those used in the main paper, and we ensured that our method, NeuS [23], and NeRF [18] were trained under the same settings. The results reveal that our method can achieve rendering performance comparable to that of NeRF [18] and NeuS [23].

		doll ID							toy_animals ID
		002	008	037	049	062	074	085	001	005	016	019	059	Mean
PSNR	NeRF	37.45	36.27	36.71	36.77	37.79	37.13	37.21	37.76	39.64	36.70	40.35	35.84	37.47
	NeuS	37.87	35.95	37.10	37.74	37.65	37.27	37.76	37.88	39.34	35.96	40.27	36.07	37.57
	Ours	37.47	35.20	36.67	37.43	37.43	37.23	37.28	37.82	39.01	35.69	39.55	35.66	37.20
SSIM	NeRF	0.983	0.975	0.987	0.979	0.979	0.987	0.985	0.985	0.987	0.980	0.989	0.982	0.983
	NeuS	0.986	0.975	0.989	0.983	0.979	0.989	0.988	0.986	0.987	0.978	0.990	0.984	0.984
	Ours	0.985	0.973	0.988	0.983	0.979	0.989	0.987	0.985	0.986	0.977	0.989	0.984	0.984

Table 5: Quantitative results for new view synthesis on the OmniObject3D dataset.