Non-THD Training Data - GenRender6 (Apr 15, 2026)

Total Non-THD Rows

~4M+

across ~40 parquets

T2I Stock

1.2B

disco-feature-store

IE Single-Ref

3.21M

18 task categories

MultiRef

721K

10 glow_worm parquets

S3 Buckets

5

clio, disco, hz-os, myedit, genie

GPU Allocation

50%

28 GPUs THD : 28 GPUs non-THD

IE Single-Ref Samples 3.21M pairs across 18 tasks

me_attribute, osvt_extraction, depth, colorize, edge, harmonize, ocr, stylize...

View Details ➔

➔

ME_ATTRIBUTE

➔

OSVT_EXTRACTION

➔

ME_ATTRIBUTE

MultiRef Samples 721K triplets from glow_worm

research, pixel-aligned, tryon, character consistency, layered data...

View Details ➔

Ref 1

Ref 2

➔

Target

Ref 1

Ref 2

➔

Target

Non-THD vs THD Data Volume

IE Single-Ref

Non-THD 3.21M (98.7%)

THD 41K

MultiRef

Non-THD 721K (99.4%)

THD 4K

THD data is <1.5% of total - the model learns general capabilities from non-THD, then THD data teaches domain-specific product knowledge.

Non-THD T2I Data disco-feature-store + clio.corp.adobe.com

Stock (lean_and_long)

1.2B

disco-feature-store, ~44GB per shard

Express Templates

441.7K

template rendering images

Synthetic Text

38.7M

synthetic text overlays

Augmented Design

57K

design templates

Multipage

637K

multi-page documents

Parquet Sources

Source	Rows	S3 Bucket	Config
Stock (lean_and_long gen6)	~1.2B	disco-feature-store/features-main/lean_and_long/gen6/train/	img_1024px_20251204.yaml
Express Template Rendering	441,700	clio.corp.adobe.com/njindal_parquet/scratch/	img_1024px
Synthetic Text	38,700,000	clio.corp.adobe.com/njindal_parquet/scratch/	img_1024px
Augmented Design	57,000	clio.corp.adobe.com/njindal_parquet/scratch/	img_1024px
Multipage Combined	637,000	clio.corp.adobe.com/njindal_parquet/scratch/	img_1024px

Sample T2I Stock Images from mldp-image bucket (QOI format, decoded to JPEG)

Note: The 1.2B stock dataset is the backbone of T2I training. It uses disco-feature-store bucket with partitioned parquets (~44GB each shard). Images stored in QOI format on mldp-image bucket. This is the same dataset used in production GR6 training. THD's 162K stock images are a tiny filtered subset. All images have SCAP v2 JSON captions.

Non-THD IE Single-Ref Data clio5/reference_captions - 18 task parquets

Total IE Rows

3.21M

18 parquets (v3.0 + v3.1)

Largest

1.0M

scs_filtered_batch1

Second

987K

me_stock_human

Task Types

18

depth, edge, colorize, ocr...

Caption Model

Qwen3

32B, Nov 2025

vs THD IE

78x

3.21M vs 41K THD

IE Dataset Size by Task (3.21M total)

Sample IE Pairs From me_attribute and osvt_extraction

Schema (shared across all 18 parquets)

edit_instruction Natural language editing instruction
reference_images Array of source image S3 paths
target_image Result image S3 path or hash
before_caption SCAP v2 JSON of source
after_caption SCAP v2 JSON of target
dataset_source Task type identifier
ie_score / dists_score / edge_sim_score Quality metrics
is_bad_caption / is_same_caption / same_ar Filter flags

Training filters: is_same_caption=False, is_bad_caption=False, length(after_caption)>500, same_ar=True

Non-THD MultiRef Data glow_worm parquets - 10 datasets

Total MultiRef

721K

10 glow_worm parquets

Largest

250K

research

Character

169K

consistency shots

Layered

174K

layered + augmented

Tryon

11.5K

virtual try-on

vs THD MR

180x

721K vs 4K THD

MultiRef Dataset Sizes (721K total)

Sample MultiRef Triplets From glow_worm research dataset

Schema

reference_images Array of ref image S3 paths (2-4 refs)
target_image Composite result S3 path
edit_instruction Detailed composition instruction
before_caption / after_caption SCAP v2 JSON
num_reference Number of reference images
dataset_source Source dataset identifier

S3 bucket: hz-os-data/glow-worm/editing/multi-reference/
Training filters: num_reference 2-4, edit_instruction IS NOT NULL, is_same_caption=False

THD vs Non-THD Side-by-Side Comparison

Training Strategy: The model uses 50:50 GPU allocation (28 GPUs THD : 28 GPUs non-THD). YAML comment mentions "recommended 80-20" but actual config allocates equal GPU groups per task. Non-THD data teaches the model general image understanding (depth, edges, colorization, object extraction, text editing), while THD data teaches Home Depot-specific product knowledge.

Key Insight: Despite 50:50 GPU split, non-THD datasets are 78-180x larger in row count. THD data gets heavily cycled/repeated during training while non-THD data is sampled more sparsely. Source: 20260330_lite_thd.yaml line 1847 + GPU group allocation lines 1758-1770 (branch: joshuamunoz/multiref).

Metric	THD	Non-THD	Ratio
T2I Data	224.7K (62K HD + 162K stock)	~1.2B stock + 39M synthetic + 1M other	~5,400x
IE Single-Ref	40,964	3,208,657	78x
MultiRef	4,017	721,129	180x
IE Task Types	1 (THD video frames)	18 (depth, edge, colorize, ocr, stylize...)	18x variety
IE Sources	YT + Brightcove + Missions	Stock, MADA, Express, Abaka, OSVT	5x+ sources
MultiRef Types	1 (vendor scene+tool)	6 (research, tryon, character, layered...)	6x variety
S3 Buckets	foundry-thd, adobe-xingtail	clio, disco, hz-os, myedit, genie	More distributed
Caption Model	GPT-4 (March 2026)	Qwen3-32B (Nov 2025)	Different models
GPU Allocation	28 GPUs (50%)	28 GPUs (50%)	1:1 equal
T2I GPUs	12 GPUs (512+1024+2048)	12 GPUs (512+1024+2048)	1:1
IE GPUs	8 GPUs (1024+2048)	8 GPUs (1024)	1:1
MultiRef GPUs	8 GPUs (1024+2048)	8 GPUs (1024)	1:1
Data Cycling	Heavy repeat (small data)	Sparse sampling (large data)	THD sees same data ~100x more
YAML Comment	"Use recommended 80-20 for THD vs non-THD" - but actual config is 50:50		Mismatch

Implication for Data Curation

Why adding THD data has outsized impact:

1. MultiRef is most starved: 4K THD vs 721K non-THD = 180x gap. Even 1K new THD triplets = 25% increase in THD multiref data.
2. IE has moderate gap: 41K THD vs 3.2M non-THD = 78x. THD IE frames are all from HD-specific video sources.
3. T2I is least starved: 225K THD-adjacent vs 1.2B general stock. More T2I data helps but marginal returns are lower.
4. Non-THD teaches general skills: depth estimation, edge detection, colorization, text editing, style transfer - THD data doesn't need to cover these.
5. THD uniqueness: Product placement in home scenes, tool-in-use compositions, branded product swaps - these can ONLY come from THD data.