SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

What is SwiftSketch?

SwiftSketch is a diffusion-based model for object sketching that offers several key benefits:

It generates vector sketches, which are resolution-independent, highly editable, and effectively capture the sequential and abstract nature of sketches.

It operates efficiently, generating a single sketch in under a second.

It generalizes effectively across a diverse range of object classes.

Sketches Generated with SwiftSketch

How Does it Work?

SwiftSketch generates vector sketches by gradually denoising a Gaussian in stroke coordinate space, conditioned on an input image:

To train SwiftSketch, we introduce the ControlSketch Dataset, a synthetic, high-quality, paired image-vector sketch dataset.

The sketches in the ControlSketch Dataset are generated using ControlSketch, a new optimization-based technique for producing high-quality vector sketches from images, with high fiidelity.

The ControlSketch Dataset

Human-drawn sketch datasets present a tradeoff: large-scale collections like QuickDraw (50M sketches) lack proffesional appearance, amateur
while professionally curated datasets are often small and domain-specific, such as the dataset by Berger et al., which focuses on portraits, and OpenSketch, which contains product design sketches: pro pro
To address this, we introduce a synthetic image-sketch dataset. We use SDXL to generate the images, and the corresponding vector sketches are produced by our optimization-based technique, ControlSketch. Our synthetic sketches are provided in vector format, maintaining high fidelity to the input image while exhibiting a professional, yet natural and abstract style:

The ConrtolSketch dataset comprises 35,000 pairs of images and their corresponding sketches in SVG format, spanning 100 object categories. Download the data

SwiftSketch - Training Details

We utilize 15 categories of the ControlSketch dataset to train a generative model M_θ that learns to efficiently produce a vector sketch from an input image. The model generates a new sketch by progressively denoising randomly sampled Gaussian noise S^T ∼ 𝒩(0, I) conditioned on the image embedding.

The training of the model follows the standard conditional diffusion framework, with task-specific modifications to address the characteristics of vector data and the image-to-sketch task. In our case, the model learns to denoise the set of (x, y) coordinates that define the strokes in the sketch. At each training iteration, an image I is passed through a frozen CLIP image encoder, followed by a lightweight CNN, to produce the image embedding I_e. The corresponding vector sketch S⁰ is noised based on the sampled timestep t and noise ε, forming S^t (with 𝓡(S^t) illustrating the rasterized noised sketch, which is not used in training). The network M_θ, a transformer decoder, receives the noised signal S^t and is tasked with predicting the clean signal S⁰, conditioned on the image embedding I_e and the timestep t (fed through the cross-attention mechanism). The network is trained with two loss functions: one based on the distance between the control points and the other on the similarity of the rasterized sketches.

ControlSketch

Given an input image depicting an object, our goal is to generate a corresponding sketch that maintains high fidelity to the input while preserving a natural sketch-like appearance. Following common practice in the field, We define a sketch as a set of n black strokes where each stroke is a two-dimensional cubic Bézier curve. We optimize the set of strokes using the standard SDS-based optimization pipeline with two key enhancements: an improved stroke initialization process and the introduction of spatial control.

For stroke initialization we use the image attention map extracted using DDIM inversion. To emphasize critical areas while ensuring comprehensive object coverage, the object area is divided into k equal-area regions, using a weighted K-Means method that accounts for both attention weights and pixel locations. We distribute n points based on the attention values of each region while ensuring a minimum allocation per region. The initial strokes are derived from these points. At each optimization step, the rasterized sketch is noised based on t and ε, then fed into a depth ControlNet text-to-image diffusion model instead of a standard text-to-image diffusion model. The model predicts the noise ê conditioned on the image caption and the depth map of the image. The predicted noise is used for the SDS loss. We balance the weighting between the spatial and textual conditions to achieve an optimal trade-off between semantic fidelity, derived from the caption (ensuring the sketch is recognizable), and geometric fidelity, derived from the depth map, which governs the accuracy of the spatial structure.