CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

1HSE University       2Yandex Research       

Abstract

This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the-box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.

Method Overview

CasTex method results
Overall pipeline. Our method consists of two stages. On the first stage, given a 3D mesh and a prompt our method employs a differentiable rendering pipeline to generate random views of the model under various lighting. We update the randomly initialized model texture using Score Distillation Sampling (SDS). On the second stage, refine the texture from the first stage using SDS with a super-resolution diffusion model. For the same view and lighting, we render two frames: a frame with the fixed texture taken from the first step and a frame with a current texture. Using the former frame as the condition, we back-propagate the SDS gradients through the latter frame.

Comparison with competing methods

Qualitative comparison. We synthesized textures for models from the Objaverse dataset using the proposed approach and a number of recent competing methods. Our method generates seamless textures with softer colors compared with latent diffusion-based approaches.

Human evaluation

CasTex method results
User preference study. Human preference study comparing our method with competing optimization based and back-projection baselines.

Automated Metrics

CasTex method results
Quantitative comparison. Comparison of FID and KID scores for different text-to-texture generation methods. Notably, even with the smallest diffusion model, our method outperforms optimization-based baselines. As expected, the super-resolution stage improves the results for both stages. Surprisingly, in our evaluation, the older back-projection-based method outperforms the optimization-based method and is only rivaled by our largest setup. We use NVIDIA A100 80GB for time measurements and generate textures with a resolution of 1024x1024 pixels.

BibTeX


      @article{aliev2025castex,
        title={CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading},
        author={Aliev, Mishan and Baranchuk, Dmitry and Struminsky, Kirill},
        journal={arXiv preprint arXiv:2504.06856},
        year={2025}
      }