# Pixel2Mesh：

# Generate 3D Grid Model from a Single RGB Image

## Abstract

We propose an end-to-end deep learning architecture that can be from a single color The 3D shape in the triangle mesh is generated in the image. Limited by the nature of deep neural networks, previous methods usually represent 3D shapes in volume or point clouds, and converting them into easier-to-use mesh models is not easy. Unlike existing methods, our network represents a 3D mesh in a graph-based convolutional neural network and produces the correct geometry by gradually deforming the ellipsoid, utilizing the perceptual features extracted from the input image. We use a rough strategy to stabilize the entire deformation process and define various grid-related losses to capture different levels of attributes to ensure visually appealing and physically accurate 3D geometry. A large number of experiments show that our method can not only qualitatively generate mesh models with better details, but also achieve higher 3D shape estimation accuracy than the prior art.

**Keyword**: 3D shape generation · Graph convolutional neural network · Grid reconstruction · Rough end - end-to-end framework

## 引言

Inferring 3D shape from a single perspective is one A basic human visual function, but extremely challenging for computer vision. Recently, the use of deep learning techniques to generate 3D shapes from a single color image has been very successful [6, 9]. Using the convolutional layer or multi-layer sensing on a conventional grid, the estimated 3D shape, as the output of the neural network, is expressed as volume [6] or point cloud [9]. However, both representations lose important surface details and reconstruct the surface model (Figure 1), the mesh, which is ideal for many practical applications because it is lightweight and can model shape details, so very easy. For the animation deformation, just to name a few.

In this paper, we proceed along the direction of single image reconstruction and propose an algorithm for extracting 3D triangle mesh from a single color image. Our model does not directly synthesize, but learns to transform the mesh from the average shape to the target geometry. This benefits us in several ways. First, deep networks better predict residuals, such as spatial distortions, rather than structured outputs, such as charts. Second, a series of deformations can be added together to gradually refine the shape. It also controls the trade-off between the complexity of the deep learning model and the quality of the results. Finally, it provides the opportunity to encode any prior knowledge into the initial mesh, for example, topology. As a pioneering study, in this work, we have specifically studied objects that can be approximated using a 3D mesh with a class 0 by using an ellipsoid with a fixed size. In practice, we have found that most common categories can handle settings well here, such as cars, airplanes, tables, and so on. To achieve this goal, there are some inherent challenges.

The first challenge is how to represent a mesh model in a neural network that is essentially irregular and still be able to efficiently extract shape details from a given color image represented in a 2D regular grid. It needs to integrate the knowledge learned from the two data models. In terms of 3D geometry, we build a graph-based complete convolutional network (GCN) [3, 8, 18] directly on the mesh model, where the vertices and edges in the mesh are directly represented as nodes and connections in the graph. The network feature encoding information of the 3D shape is saved at each vertex. With forward propagation, the convolutional layer enables feature exchange across adjacent nodes and ultimately returns to the 3D position of each vertex. ** In terms of 2D images, we use a VGG-16-like architecture to extract features because it has been proven to successfully accomplish many tasks [10, 20]. To bridge the two, we designed a perceptual feature collection layer that allows each node in the GCN to assemble image features from 2D projections on the image, which can be easily obtained by assuming a known camera internal matrix. The perceptual feature aggregation is enabled once after several convolutions (ie, the deformation blocks described in Section 3.4) using the updated 3D position, so image features from the correct location can be effectively integrated with the 3D shape. **

Given the graphical representation, the next challenge is how to effectively update the vertex position according to the standard. In practice, we observe that training a network that directly predicts a mesh with a large number of vertices may be erroneous at the beginning and difficult to perform. One reason is that vertices cannot effectively retrieve features from other vertices with multiple edges, ie a limited receive domain. To solve this problem, ** We designed a graphical layer that allows the network to start with a smaller number of vertices and increase during forward propagation. Since there are fewer vertices at the beginning, the network learns to distribute the vertices to the most representative locations and then add local details as the number of vertices increases. In addition to graphical layering, we use the depth GCN enhanced by the shortcut connection [13] as the backbone of our architecture, which provides a larger receiving domain for global context and more moving steps.**

Representing shapes in a chart is also beneficial to the learning process. The known connectivity allows us to define higher order loss functions on adjacent nodes, which is important for normalizing 3D shapes. Specifically, we define normal surface loss for smooth surfaces; edge loss to promote uniform distribution of mesh vertices to achieve high recall rates; and Laplace loss to prevent mesh faces from intersecting each other. All of these losses are essential to generating a good mesh model, and they cannot be simply defined without a graphical representation.

The contribution of this article has three main aspects. First, we propose a novel end-to-end neural network architecture that generates a 3D mesh model from a single RGB image. Second, we designed a mapping layer that includes perceptual image features in 3D geometry represented by GCN. Third, our network predicts 3D geometry in a rough and concise manner, which is more reliable and easier to learn.

## 相关工作

Based on the multi-view geometry (MVG) [12] in the literature, an in-depth study of 3D reconstruction has been carried out. His main research interests include motion structure (SfM) [27] for large-scale high-quality reconstruction and simultaneous positioning and navigation mapping (SLAM) [4]. Although they are very successful in these scenarios, they are subject to the following limitations: 1) the coverage that multiple views can provide and the appearance of 2) the target you want to rebuild. The former limitation means that MVG cannot reconstruct the invisible part of the target, so it usually takes a long time to get enough views for good reconstruction; the latter limitation means that MVG cannot reconstruct non-Lambert (eg reflection or transparency) ) or an untextured object. These limitations lead to a learning-based approach. The

learning-based approach usually considers a single or a few images because it relies heavily on the shape a priori it can learn from the data. Early work can be traced back to Hoiem et al. [14] and Saxena et al. [25]. Recently, with the success of deep learning architectures and the large 3D shape datasets (such as the release of ShapeNet [5]), the learning-based approach has made great progress. Huang et al. [15] and Su et al. [29] retrieve shape components from large data sets, assemble them and deform the assembled shape to view the observed images. However, retrieving shapes from the image itself is a problem that cannot be ignored. To avoid this problem, Kar et al. [16] learn a three-dimensional deformable model for each object class and capture shape changes in different images. However, reconstruction is limited to popular categories and their reconstruction results often lack detail. Another study is to learn 3D shapes directly from a single image. Due to the limitations of the popular grid-based deep learning architecture, most works [6, 11] output 3D elements, which typically have low resolution due to memory limitations on modern GPUs. Recently, Tatarchenko et al. [30] have proposed an octree representation that allows for higher resolution output to be reconstructed with a limited memory budget. However, 3D elements are still not a popular shape representation in the gaming and film industry. To avoid the shortcomings of element representation, Fan et al. [9] suggested generating a point cloud from a single image. **点云 indicates that there is no local connection between the points, so the point position has a very large degree of freedom. Therefore, the generated point cloud is usually not close to the surface and cannot be used to directly recover the 3D mesh. In addition to these typical 3D representations, there is an interesting work [28] that uses so-called "geometric images" to represent 3D shapes. Therefore, their network is a 2D convolutional neural network that performs image to image mapping. Our work is mainly related to the two recent work [17] and [24]. However, the former uses simple contour monitoring, so it does not perform well for complex objects such as cars and lamps; the latter requires a large model library to generate a combined model. **除了这些典型的3D表示外，还有一项有趣的工作[28]，它使用所谓的“几何图像”来表示3D形状。因此，它们的网络是2D卷积神经网络，其进行图像到图像的映射。我们的工作主要与最近的两个工作有关[17]和[24]。然而，前者采用简单的轮廓监控，因此对于汽车，灯具等复杂物体表现不佳;后者需要一个大型模型库来生成组合模型。

Our basic network is a graph neural network [26]; this structure has been used for shape analysis [31]. At the same time, a graph-based method directly applies convolution [2, 22, 23] to the surface manifold for shape analysis. To the best of our knowledge, these architectures have never been used for 3D reconstruction of a single image, although graphics and surface manifolds are natural representations of meshes. For a comprehensive understanding of graph neural networks, graph-based methods and their applications, please refer to this study [3].

## 方法

### Preliminary: Graphics-based convolution

We first provided some background knowledge of graph-based convolution; a more detailed introduction can be found in [3]. The **3D grid is a collection of vertices, edges and faces that define the shape of a 3D object; ** which can be represented by the graph M=(V, E, F), where V={vi}Ni= 1 is a set of N vertices in the mesh, E={ei}Ei=1 is a set of E edges connecting each of the two vertices, and F = {fi}Ni = 1 is a feature vector attached to the vertices. The graph-based convolutional layer is defined on the irregular graph as:

where flp∈Rdl, fl + 1p∈Rdl+ 1 is the feature vector on the vertex p before and after convolution, and N(p) is the adjacent vertex of p ; w0 and w1 are the learnable parameter matrices of dl × dl + 1 applied to all vertices. Note that w1 is shared by all edges, so (1) applies to nodes with different vertex degrees. In our example **, the additional feature vector fp is the 3D vertex coordinates, the feature encodes the 3D shape and the connection of the features learned from the input color image (if present). Run the convolution update feature, which is equivalent to applying deformation. **

### 系统 overview

Our model is an end-to-end deep learning framework that takes a single color image as input and generates a 3D mesh model in camera coordinates. An overview of our framework is shown in Figure 2. ** The entire network includes an image feature network and a cascading mesh deformation network. The image feature network is a 2D CNN that extracts a perceptual feature from the input image, and the mesh deformation network uses the perceptual feature to gradually transform the ellipsoidal mesh into a desired 3D model. The cascading mesh deformation network is a graphics-based convolutional network (GCN) that contains three deformed blocks that intersect the two graphical de-splicing layers. Each deformed block employs an input map representing the current mesh model, with 3D shape features attached to the vertices and generating new vertex positions and features. The graphical puzzle ******** layer increases the number of vertices to increase the ability to process details while still maintaining a triangular mesh topology. ** Starting with a small number of vertices, our model learns to gradually deform and add detail to the rough mesh model. In order to train the network to produce stable deformations and generate accurate meshes, we extend the chamfer distance loss used by Fan et al. [9]. Specific losses with the other three grids - normal surface loss, Laplacian regularization loss and side length loss. The rest of this section describes the details of these sections.

### 初椭圆体

Our model does not require any prior knowledge of any 3D shape and is always deformed from the initial ellipsoid, with the average size placed in the common position of the camera coordinates. The ellipsoid is located 0.8m in front of the camera and is centered on a three-axis radius of 0.2m, 0.2m, and 0.4m. The mesh model is generated by the implicit surface algorithm in Meshlab [7] and contains 156 vertices. We use this ellipsoid to initialize our input graph, where the initial features contain only the 3D coordinates of each vertex.

### Grid deformation block

The structure of the mesh deformation block is shown in Figure 3(a). In order to generate a 3D mesh model that is consistent with the object displayed in the input image, the deformed block needs to assemble features (P) from the input image. This is done in conjunction with the image feature network and the perceptual feature pooling layer that gives the location of the vertices (Ci-1) in the current mesh model. The pooled perceptual features are then connected to the 3D shape features attached to the vertices of the input map (Fi-1) and fed to a series of graph-based ResNet (G-ResNet). G-ResNet also produces new coordinates (Ci) and 3D shape features (Fi) for each vertex as the output of the mesh deformation block.

**感觉池池层**We use the VGG-16 architecture until the conv5_3 layer is used as an image feature network because it is widely used. Given the 3D coordinates of the vertices, we use the camera intrinsic function to calculate its 2D projection on the input image plane, and then use bilinear interpolation to assemble the feature from four nearby pixels. In particular, we connect the features extracted from layers 'conv3_3', 'conv4_3' and 'conv5_3', which results in a total dimension of 1280. Then, this perceptual feature is connected to the 128-dimensional 3D feature from the input mesh, The overall size is 1408. This is shown in Figure 3(b). Note that because the shape features are not learned at the beginning, in the first block, the perceptual features are linked to the three-dimensional features (coordinates).

**G-ResNet**After obtaining the 1408-dimensional features of each vertex characterized from 3D shape and 2D image information, we designed a graph-based convolutional neural network to predict the new position and 3D of each vertex. Shape feature. This requires efficient exchange of information between vertices. However, as defined in (1), each convolution only makes feature exchange between adjacent pixels possible, which seriously impairs the efficiency of information exchange. This is equivalent to a small receive domain problem on 2D CNN.

To solve this problem, we built a very deep network [13] with a fast connection and represented it as G-ResNet (Fig. 3(a)). In this work, G-ResNet in all blocks has the same structure, consisting of 14 graphics residual convolutional layers and 128 channels. The sequence of G-ResNet blocks produces new 128-dimensional 3D features. In addition to the feature output, there is a branch that applies an additional graphics convolution layer to the last layer of features and outputs the 3D coordinates of the vertex. The

### 图拼层

解拼层 is designed to increase the number of vertices in the GCNN. It allows us to start with a grid with fewer vertices, adding more only when necessary, which reduces memory costs and produces better results. An easy way is to add a vertex at the center of each triangle and connect it to the three vertices of the triangle (Figure 4(b) is based on the face). However, this leads to an unbalanced vertex level, ie the number of edges on the vertex. Inspired by the vertex addition strategy of the mesh subdivision algorithm prevalent in computer graphics, we add a vertex at the center of each edge and connect it to the two endpoints of the edge (Fig. 4(a)). The 3D feature of the newly added vertex is set to the average of its two neighbors. If we add three vertices to the same triangle (dashed line), we also connect three vertices. Therefore, we create 4 new triangles for each triangle in the original mesh, and the number of vertices increases the number of edges in the original mesh. This edge-based de-spelling uniformly upsamples the vertices, as shown in Figure 4(b). Based on the edge.

### 亏

We defined four losses to constrain the nature of the output shape and deformation process to ensure good results.**We use chamfer loss [9] to constrain the position of the mesh vertices, normal loss to enforce the consistency of the surface normals, Laplacian regularization to maintain the relative position between adjacent vertices during deformation, and edges The length is regularized to prevent outliers. These losses impose the same weight on the middle and final grid. **

Unless otherwise stated, we use p for predicting vertices in the mesh, q for vertices in the standard mesh, and N(p) for adjacent pixels of p until the end of this section.

**Chamfer loss**Chamfer distance measuring the distance from each point to another point

It is quite good to return the vertex to its correct position, but the efficiency is not high (see Fan et al in Figure 1). Human results [9]).

**Normal Loss**We further determine the loss on the surface normal to characterize the higher order properties:

where q is the nearest vertex of p found in the calculation of the chamfer loss, k is the neighbor of p The pixel, h·, i is the inner product of the two vectors, and nq is the surface normal observed from the standard case.

Basically, this loss requires that the edge between the vertex and its neighbor be perpendicular to the observation from the standard case. Unless it is on a plane, it can be found that the loss is not equal to zero. However, optimizing this loss is equivalent to forcing the normal of the tangent plane of the local superposition to be consistent with the observations, which is actually very effective in our experiments. Moreover, this normal loss is completely distinguishable and easy to optimize.

**正化化**Even in the case of chamfering loss and normal loss, the optimization is easy to fall into the local minimum. More specifically, the network may produce some oversized deformations to support some local consistency, which is especially detrimental when it is estimated to be far from the standard and leads to flight vertices (Figure 5).

**拉普拉斯正化**To solve these problems, we first propose a Laplace term to prevent the vertices from moving too freely, which may avoid self-intersection of the mesh. The Laplace term is used as a local detail hold operator that encourages adjacent vertices to have the same movement. ** In the first deformed block, it acts like a surface smoothness term because the input to the block is a smooth ellipsoid; from the second block, it prevents the 3D mesh model from deforming Too many, so only fine-grained details are added to the mesh model **. To calculate this loss, we first define the Laplace coordinates of each vertex p as

and Laplacian regularization is defined as: llap =

where δ~p and δp are deformation blocks Laplace coordinates after and after the vertices.

**边长正化化**. In order to punish the floating-point vertices that usually lead to long edges, we add the regularization loss of the side length:

total loss is the weighted sum of all four losses, lall = lc + λ1ln + λ2llap + λ3lloc, where λ1 = 1.6e-4 , λ2 = 0.3 and λ3 = 0.1 are hyperparameters which balance the loss and fixation of all experiments.

## 实验

In this section, we have extensively evaluated our model. In addition to evaluating the reconstruction accuracy in comparison to previous 3D shape generation efforts, we also analyzed the importance of each component in the model. The qualitative results of the composite image and the real image further indicate that our model generates a triangular mesh with a smooth surface and still retains the details depicted in the input image.

### 实验装置

**数据**. We use the data set provided by Choy et al. [6]. The dataset contains a rendered image of a 50k model belonging to 13 object classes of ShapeNet [5], which is a collection of 3D CAD models organized according to the WordNet hierarchy. Render models from various camera viewpoints and record camera intrinsic and extrinsic matrices. For a fair comparison, we used the same training/test grouping as Choy et al. [6].

**评价指标**. We use standard 3D reconstruction indicators. We first unify the sampling points from the results and the basic facts. We calculate the accuracy and recall rate by examining the percentage of points in the prediction or standard case, which can find each other's nearest neighbors within a certain threshold τ. Then calculate **算F-score[19] as the harmonic mean of the precision and recall. **We also report the chamfer distance (CD) and the Earth's moving distance (EMD). ** For F-Score, the bigger the better. For CDs and EMDs, the smaller the better. **

On the other hand, we realize that the commonly used evaluation indicators for shape generation may not fully reflect the shape quality. They usually capture occupancy or point-by-point distances rather than surface properties, such as continuity, smoothness, high-level detail, and there are few standard evaluation indicators in the literature. Therefore, we recommend focusing on qualitative results in order to better understand these aspects.

**基地. **We compare the proposed method with the latest single image reconstruction method. Specifically, we will compare two of the most advanced methods, Choy et al. [6] (3D-R2N2), to generate 3D volumes, while Fan et al. [9] (PSG) generate point clouds. Since the indicator is defined on the point cloud, we can evaluate the PSG directly on its output, evaluate our method by evenly sampling on the surface, by evenly sampling points from the mesh created using the Marching Cube [21] method. Evaluate 3D-R2N2.

We also compared the Neuro 3D Mesh Renderer (N3MR) [17], which is the only mesh generation model based on deep learning to date, with common code available. For a fair comparison, use the same amount of time to train the model using the same data.

**Train and run **. Our network receives an input image of size 224 x 224 and an initial ellipsoid with 156 vertices and 462 edges. The network was implemented in Tensorflow and optimized using Adam with a weight attenuation of 1e-5. The batch size is 1; the total number of training rounds is 50; the learning rate is initialized to 3e-5 and drops to 1e-5 after 40 periods. On the Nvidia Titan X, the total training time is 72 hours. During the test, our model took 15.58 ms to generate a grid of 2466 vertices.

## Compare with prior art

Table 1 shows F-score with different thresholds for different methods. Our approach is superior to other methods in all categories except technology. It is worth noting that at smaller thresholds τ, our results are significantly better than other results in all categories, showing at least a 10% improvement in F-score. N3MR is not performing well, and the result is about 50% worse than ours, probably because their model can only learn from the limited contour signals in the image, and lacks explicit processing of the 3D mesh.

We also show all categories of CDs and EMDs in Table 2. Our approach outperforms other methods in most categories and achieves the best average score. **The main competitor is PSG, which generates point clouds and has the greatest degree of freedom; this freedom leads to smaller CDs and EMDs, but without proper regularization, it does not necessarily lead to better mesh models. . **To prove this, we show qualitative results to analyze why our approach is superior to other methods. Figure 8 shows the visual results. To compare the quality of the mesh model, we use standard methods to convert the volume and point cloud into a mesh [21,1]. As we have seen, the 3D volumetric results produced by 3D-R2N2 lack detail due to low resolution, for example, the lack of legs in the chair example, as shown in line 4 of Figure 8. We tried the octree-based solution [30] to increase the volume resolution, but found it difficult to restore surface level details like our model. PSG produces a sparse 3D point cloud from which it is important to recover the mesh. This is because the chamfering loss of the application is like the regression loss, providing too much freedom for the point cloud. The N3MR produces a very rough shape, which may be sufficient for some rendering tasks, but it is impossible to recover complex objects such as chairs and tables. In contrast, our model does not encounter these problems by using grid representations, the integration of perceptual features, and well-defined losses during training. Due to the limited memory budget, our results are not limited by resolution and contain smooth continuous surfaces and local details.

### 消融研究

Now we conduct a controlled experiment to analyze the importance of each component in the model. Table 3 reports the performance of each model by removing a component from the full model. Similarly, we believe that these commonly used evaluation indicators do not necessarily reflect the quality of the 3D geometry covered. For example, a model with no edge length regularization achieves the best performance in all cases, but actually produces the worst grid (Figure 5, final column 2). Therefore, we use qualitative results Figure 5 to show the contribution of each component in our system.

**图**. We first delete the graphic layer, so each block has the same number of vertices as in the last block of our full model. It has been observed that the deformation made the error easy at the beginning and could not be repaired later. Therefore, there are some obvious artifacts in some parts of the object.

**G-ResNet**We removed the shortcut connection in G-ResNet and made it a regular GCN. As can be seen from Table 3, there is a huge performance gap for all four measurements, which means the failure to optimize the chamfer distance. The main reason is the degradation problem observed in very deep 2D convolutional neural networks. This problem leads to higher training errors (and thus higher test errors) when adding more layers to the appropriate depth model [13]. In fact, our network has 42 graphics convolutional layers. Therefore, this phenomenon has also been observed in our depth map neural network experiments.

Loss clause In addition to the chamfer loss, we also evaluate the functionality of each additional clause. As can be seen from Figure 5, removal of normal loss can seriously impair surface smoothness and local detail, for example: on the seat back; removing the Laplace term leads to cross geometry because of local topological changes, for example, placing the handle on the chair Removing the edge length causes the vertices and surfaces to float, which completely destroys the surface features. These results indicate that all of the components proposed in this work contribute to the ultimate performance.

**变形块数.**We now analyze the impact of the number of blocks. Figure 6 (left) shows the average F-score(τ) and CD for the number of blocks. The results show that increasing the number of blocks helps, but it is beneficial that more and more blocks, for example, in our experiments, we found that 4 blocks will cause too many vertices and edges, which will greatly slow down our Method, even if it provides better accuracy on the evaluation indicators. Therefore, we ** use 3 blocks in all experiments to achieve the best balance of performance and efficiency**. Figure 6 (right) shows the output of the model after each deformed block. Notice how to use more vertices to determine the density of the mesh and add new details.

### Rebuild the image of the real world

Following Choy et al. [6], we tested our network on the Online Products dataset and Internet images to qualitatively evaluate real images. We use the model trained from the ShapeNet dataset to run directly on the real image without fine-tuning and display the results in Figure 7. It can be seen that the models we train on synthetic data provide a good overview of real-world images in various categories.

## Conclusion

We have proposed a method for extracting a three-dimensional triangular mesh from a single image. We take advantage of the key benefits that grid presentation can bring to us, as well as the key issues needed to succeed. The former includes surface normal constraints and information propagation along the edges; the latter includes perceptual features extracted from the image as a guide. We have carefully designed our network structure and proposed a very deep cascaded convolutional neural network with "fast" connections. The mesh is trained end-to-end through our network with chamfering losses and normal losses, and the mesh can be determined step by step. Our results are significantly better than the previous state using other shape representations (such as 3D volume or 3D point clouds). Therefore, we believe that grid representation is the next focus in this direction, and we hope that the key components found in our work can support subsequent work, which will further advance the direct 3D mesh reconstruction of a single image.

**未来的工作**Our method only generates a mesh with the same topology as the initial mesh. In the future, we will extend our approach to more general situations, such as scene-level reconstruction, and learn multi-view reconstruction from multiple images.