SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons

Abstract

Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task.

Method

Problem Formulation

Most contemporary icon platforms distribute abstract icons as flattened monochrome or duotone SVGs, typically represented as a single merged path or a collection of compound paths. This flattening process introduces four primary challenges:

Structural Ambiguity: Multiple vector parameterizations can yield visually identical shapes, making it difficult to recover a unique structural representation.
Semantic Degradation: The abstraction process removes color cues and semantic annotations, making it challenging to identify and separate meaningful parts.
Hierarchical Fragmentation: Flattening disrupts the original layer hierarchy by merging or truncating semantic components, resulting in incomplete part boundaries.
Occlusion-Order Indeterminacy: Even after recovering the completed shape of each part, the original drawing order remains unknown.

Let the input icon be a set of Bézier paths \(\mathcal{P} = \{P_1, \ldots, P_N\}\). To avoid structural ambiguity, we rasterize the icon into a binary silhouette \(I \in \{0,1\}^{H \times W}\). We then seek to decompose \(I\) into visible semantic masks \(\mathcal{V} = \{V_1, \ldots, V_K\}\) satisfying \(I = \bigcup_k V_k\), recover their complete amodal shapes \(\mathcal{A} = \{A_1, \ldots, A_K\}\) where \(V_k \subseteq A_k\), determine a layering permutation \(\{A_1, \ldots, A_K\} \to \{A_1^*, \ldots, A_K^*\}\), and finally vectorize the ordered layers back into editable Bézier paths \(\{A_1^*, \ldots, A_K^*\} \xrightarrow{\text{vec}} \{\hat{P}_1, \ldots, \hat{P}_K\}\).

SemLayer addresses these challenges through a three-stage pipeline: (1) Semantic-aware Generative Segmentation that leverages diffusion models to decompose the icon into semantic parts via controllable colorization; (2) Amodal Layer Completion that recovers the full shape of each semantic part, including regions occluded by other parts; and (3) Layer Ordering via integer linear programming that determines the spatial arrangement of layers. Finally, the ordered layers are vectorized through a curve reuse strategy that preserves original Bézier segments and repairs missing regions.

Semantic-aware Generative Segmentation

Semantic part segmentation of abstract icons is challenging due to high abstraction and minimal color cues. Classical segmentation methods like SAM often fail under such conditions, confusing strokes with filled regions and producing unstable masks. We reformulate segmentation as a colorization task: given a monochrome icon, our model generates a colorized rendering where each distinct color corresponds to a semantic part. To preserve structural integrity while enabling semantic colorization, we adopt EasyControl as the backbone, which enforces explicit structural conditioning in diffusion transformers.

We train on triplets \((p, I_{\text{cond}}, I_{\text{tgt}})\) from our custom-built dataset, where \(p\) is the text prompt, \(I_{\text{cond}}\) is the binary silhouette, and \(I_{\text{tgt}}\) is the colorized target. For a randomly sampled timestep \(t \in [0, 1]\), we draw noise \(\varepsilon \sim \mathcal{N}(0, \mathbf{I})\) and construct the noisy latent \(z_n = t\,\varepsilon + (1-t)\,x_0\). The combined tokens are passed through a Transformer with conditional LoRA modules and causal conditional attention. We optimize a flow-matching objective:

\[\mathcal{L}_{\text{FM}} = \mathbb{E}_{t,\,\varepsilon \sim \mathcal{N}(0, \mathbf{I})} \left\| v_\theta(z_n, t, z_c) - (\varepsilon - x_0) \right\|_2^2\]

At inference, given a binary silhouette \(I\), the model generates a colorized segmentation \(I \xrightarrow{\text{seg}} \{\hat{V}_1, \ldots, \hat{V}_K\}\), where each \(\hat{V}_i\) denotes the colorized version of a semantic part. We extract binary masks by thresholding each color channel, providing clean semantic part segmentations. Each \(V_k\) is enforced to be a single connected component; if a semantic object is split into multiple disjoint fragments due to occlusion, they are handled independently in the subsequent amodal completion stage.

Segmentation comparison — Qualitative comparison of segmentation quality. SAM2* frequently generates fragmented or improperly aligned color regions, whereas gpt-image-1 often fails to maintain structural integrity. Our approach generates clean, structurally consistent segmentation.

Amodal Layer Completion

Given the semantic part segmentations, we next recover the complete amodal shape of each part. We build upon pix2gestalt, a latent-diffusion-based amodal completion model originally designed for natural images, and fine-tune it on our custom icon-domain dataset to bridge the substantial domain gap.

The model takes as input an occluded image \(x_{\text{occ}}\) and its visible-region mask \(m_{\text{vis}}\). Conditioning is provided through two streams: (1) a CLIP image embedding \(c_{\text{clip}} = \text{CLIP}(x_{\text{occ}})\) encoding high-level semantics, and (2) a concatenated latent input \(\tilde{z}_t = \text{concat}(z_t, z_{\text{occ}}, m_{\text{vis}})\). The UNet iteratively denoises from random noise while attending to \(c_{\text{clip}}\), yielding the completed amodal shape \(x_{\text{whole}}\). We minimize the standard noise-prediction objective:

\[\mathcal{L} = \left\| \varepsilon - \hat{\varepsilon} \right\|_2^2, \quad \hat{\varepsilon} = \epsilon_\theta(\tilde{z}_t, t, c_{\text{clip}})\]

To handle cases where a single semantic object is split into multiple disjoint visible fragments, we train each fragment independently while sharing the same ground-truth target, enforcing many-to-one completion behavior. At inference, we apply IoU-based merging (\(\tau = 0.7\)): for any pair of completed shapes \((A_i, A_j)\) exceeding the IoU threshold, they are merged into a single amodal layer.

Completion comparison — Qualitative comparison of completion results. gpt-image-1 introduces unnatural artifacts, and MP3D yields incomplete shapes. Our method faithfully reconstructs both visible and occluded geometry.

Layer Ordering

Given the completed amodal masks \(\mathcal{A} = \{A_1, \ldots, A_K\}\), we determine a plausible layering order via integer linear programming (ILP). For each part \(k\), the completion process recovers extra pixels \(E_k = A_k \setminus I\) that should be occluded, and we define a fill region \(F_k\) as the solid area enclosed by the outermost contour of \(A_k\). For each ordered pair \((i, j)\), we introduce a binary variable \(x_{ij} = 1\) if part \(i\) is above part \(j\), with constraints \(x_{ij} + x_{ji} = 1\) and transitivity \(x_{ij} + x_{jk} + x_{ki} \le 2\). The objective balances rewarding correct occlusion of extra regions (\(y_i\)) and penalizing incorrect occlusion of visible regions (\(z_i\)):

\[\max_{x, y, z} \sum_i y_i - \lambda \sum_i z_i\]

where \(\lambda = 1\). The solved permutation \(\pi^*\) produces the ordered layers \(\{A_1^*, \ldots, A_K^*\}\), which are then vectorized into the final editable icon representation.

Dataset

We introduce two purpose-built datasets. SemLayer-Segmentation is constructed from two sources: (1) 4,920 real-world SVGs curated from LayerPeeler, filtered to contain 4–10 paths each with a single closed contour, abstracted by removing fills and rendering black strokes, followed by manual quality verification; and (2) 3,647 synthetic icons generated using GPT-4o for color-agnostic structural descriptions and gpt-image-1 for stroke-based icon synthesis, with human inspection for quality. In total, the segmentation dataset contains 8,567 training samples. SemLayer-Completion is constructed using the LayerPeeler collection: for each sample, two distinct icons are selected as object and occluder, with the occluder resized and positioned to create meaningful occlusions via parallelized rejection sampling. Each sample provides an occluded composite image, the full occluded object, and a binary visible-region mask, yielding 50,000 training triplets. A key advantage of our icon setting is that SVG layers are complete by construction, so occluded shapes serve as exact ground truth rather than heuristic approximations. For evaluation, we curate an additional set of 48 high-quality real-world SVG icons with rich semantic structures and meaningful part-level occlusions, ensuring no overlap with any training samples.

Experiments

Segmentation Model Comparison

We evaluate SemLayer across two sequential stages. Stage 1 measures segmentation performance; Stage 2 measures completion quality with a fixed completion module applied to Stage 1 outputs.

Segmentation Model	Segmentation				Completion
Segmentation Model	mIoU (%) \(\uparrow\)	PQ (%) \(\uparrow\)	mIoU_R (%) \(\uparrow\)	PQ_R (%) \(\uparrow\)	mIoU (%) \(\uparrow\)	CD \(\downarrow\)
gpt-image-1	25.4	6.20	57.2	39.3	60.9	71.4
SAM2	51.1	26.2	62.2	37.8	69.2	61.7
SAM2*	79.3	59.4	85.3	78.0	80.7	49.1
Ours	84.3	76.1	86.4	78.3	85.2	46.6

Completion Model Comparison

With the segmentation model fixed, we evaluate different completion models with identical inputs.

Completion Model	mIoU (%) \(\uparrow\)	CD (pix) \(\downarrow\)
gpt-image-1	10.7	98.6
MP3D	70.5	79.4
MP3D-finetuned	75.3	68.9
Ours	85.2	46.6

Qualitative Results

Downstream Applications

Recovering semantically structured layers enables applications that are difficult with flattened vector graphics. Our semantic-layered representation separates merged components into independent primitives, enabling local and semantically meaningful manipulation.

SVG Semantic Understanding: Each recovered layer corresponds to a meaningful component, enabling part-level interpretation and reasoning.
Part-level Editing: Semantic layers allow intuitive edits on individual components—such as recoloring, rotation, or rescaling—without reconstructing the icon structure.
Animation: Separated primitives enable simple animation through geometric transformations, such as rotating tools, bouncing wheels, or flapping wings.