CasTex

This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the-box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.

Overall pipeline. Our method consists of two stages. On the first stage, given a 3D mesh and a prompt our method employs a differentiable rendering pipeline to generate random views of the model under various lighting. We update the randomly initialized model texture using Score Distillation Sampling (SDS). On the second stage, refine the texture from the first stage using SDS with a super-resolution diffusion model. For the same view and lighting, we render two frames: a frame with the fixed texture taken from the first step and a frame with a current texture. Using the former frame as the condition, we back-propagate the SDS gradients through the latter frame.

Qualitative comparison. We synthesized textures for models from the Objaverse dataset using the proposed approach and a number of recent competing methods. Our method generates seamless textures with softer colors compared with latent diffusion-based approaches.

User preference study. Human preference study comparing our method with competing optimization based and back-projection baselines.

Quantitative comparison. Comparison of FID and KID scores for different text-to-texture generation methods. Notably, even with the smallest diffusion model, our method outperforms optimization-based baselines. As expected, the super-resolution stage improves the results for both stages. Surprisingly, in our evaluation, the older back-projection-based method outperforms the optimization-based method and is only rivaled by our largest setup. We use NVIDIA A100 80GB for time measurements and generate textures with a resolution of 1024x1024 pixels.

BibTeX


      @article{aliev2025castex,
        title={CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading},
        author={Aliev, Mishan and Baranchuk, Dmitry and Struminsky, Kirill},
        journal={arXiv preprint arXiv:2504.06856},
        year={2025}
      }

CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

Abstract

Method Overview

Comparison with competing methods

Human evaluation

Automated Metrics

BibTeX