Stock (lean_and_long)
1.2B
disco-feature-store, ~44GB per shard
Express Templates
441.7K
template rendering images
Synthetic Text
38.7M
synthetic text overlays
Augmented Design
57K
design templates
Multipage
637K
multi-page documents
| Source | Rows | S3 Bucket | Config |
| Stock (lean_and_long gen6) | ~1.2B | disco-feature-store/features-main/lean_and_long/gen6/train/ | img_1024px_20251204.yaml |
| Express Template Rendering | 441,700 | clio.corp.adobe.com/njindal_parquet/scratch/ | img_1024px |
| Synthetic Text | 38,700,000 | clio.corp.adobe.com/njindal_parquet/scratch/ | img_1024px |
| Augmented Design | 57,000 | clio.corp.adobe.com/njindal_parquet/scratch/ | img_1024px |
| Multipage Combined | 637,000 | clio.corp.adobe.com/njindal_parquet/scratch/ | img_1024px |
Note: The 1.2B stock dataset is the backbone of T2I training. It uses disco-feature-store bucket with partitioned parquets (~44GB each shard). Images stored in QOI format on mldp-image bucket. This is the same dataset used in production GR6 training. THD's 162K stock images are a tiny filtered subset. All images have SCAP v2 JSON captions.
Total IE Rows
3.21M
18 parquets (v3.0 + v3.1)
Largest
1.0M
scs_filtered_batch1
Task Types
18
depth, edge, colorize, ocr...
Caption Model
Qwen3
32B, Nov 2025
vs THD IE
78x
3.21M vs 41K THD
IE Dataset Size by Task (3.21M total)
edit_instruction Natural language editing instruction
reference_images Array of source image S3 paths
target_image Result image S3 path or hash
before_caption SCAP v2 JSON of source
after_caption SCAP v2 JSON of target
dataset_source Task type identifier
ie_score / dists_score / edge_sim_score Quality metrics
is_bad_caption / is_same_caption / same_ar Filter flags
Training filters: is_same_caption=False, is_bad_caption=False, length(after_caption)>500, same_ar=True
Total MultiRef
721K
10 glow_worm parquets
Character
169K
consistency shots
Layered
174K
layered + augmented
vs THD MR
180x
721K vs 4K THD
MultiRef Dataset Sizes (721K total)
reference_images Array of ref image S3 paths (2-4 refs)
target_image Composite result S3 path
edit_instruction Detailed composition instruction
before_caption / after_caption SCAP v2 JSON
num_reference Number of reference images
dataset_source Source dataset identifier
S3 bucket: hz-os-data/glow-worm/editing/multi-reference/
Training filters: num_reference 2-4, edit_instruction IS NOT NULL, is_same_caption=False
Training Strategy: The model uses 50:50 GPU allocation (28 GPUs THD : 28 GPUs non-THD). YAML comment mentions "recommended 80-20" but actual config allocates equal GPU groups per task. Non-THD data teaches the model general image understanding (depth, edges, colorization, object extraction, text editing), while THD data teaches Home Depot-specific product knowledge.
Key Insight: Despite 50:50 GPU split, non-THD datasets are 78-180x larger in row count. THD data gets heavily cycled/repeated during training while non-THD data is sampled more sparsely. Source: 20260330_lite_thd.yaml line 1847 + GPU group allocation lines 1758-1770 (branch: joshuamunoz/multiref).
| Metric | THD | Non-THD | Ratio |
| T2I Data | 224.7K (62K HD + 162K stock) | ~1.2B stock + 39M synthetic + 1M other | ~5,400x |
| IE Single-Ref | 40,964 | 3,208,657 | 78x |
| MultiRef | 4,017 | 721,129 | 180x |
| IE Task Types | 1 (THD video frames) | 18 (depth, edge, colorize, ocr, stylize...) | 18x variety |
| IE Sources | YT + Brightcove + Missions | Stock, MADA, Express, Abaka, OSVT | 5x+ sources |
| MultiRef Types | 1 (vendor scene+tool) | 6 (research, tryon, character, layered...) | 6x variety |
| S3 Buckets | foundry-thd, adobe-xingtail | clio, disco, hz-os, myedit, genie | More distributed |
| Caption Model | GPT-4 (March 2026) | Qwen3-32B (Nov 2025) | Different models |
| GPU Allocation | 28 GPUs (50%) | 28 GPUs (50%) | 1:1 equal |
| T2I GPUs | 12 GPUs (512+1024+2048) | 12 GPUs (512+1024+2048) | 1:1 |
| IE GPUs | 8 GPUs (1024+2048) | 8 GPUs (1024) | 1:1 |
| MultiRef GPUs | 8 GPUs (1024+2048) | 8 GPUs (1024) | 1:1 |
| Data Cycling | Heavy repeat (small data) | Sparse sampling (large data) | THD sees same data ~100x more |
| YAML Comment | "Use recommended 80-20 for THD vs non-THD" - but actual config is 50:50 | Mismatch |
Why adding THD data has outsized impact:
1. MultiRef is most starved: 4K THD vs 721K non-THD = 180x gap. Even 1K new THD triplets = 25% increase in THD multiref data.
2. IE has moderate gap: 41K THD vs 3.2M non-THD = 78x. THD IE frames are all from HD-specific video sources.
3. T2I is least starved: 225K THD-adjacent vs 1.2B general stock. More T2I data helps but marginal returns are lower.
4. Non-THD teaches general skills: depth estimation, edge detection, colorization, text editing, style transfer - THD data doesn't need to cover these.
5. THD uniqueness: Product placement in home scenes, tool-in-use compositions, branded product swaps - these can ONLY come from THD data.