The data behind the directory.
Two views, side by side: the full population of 1,601,155 candidate pairs (everything the embedding can produce), and the curated 9,304 pairs surfaced as pages. The curated set is a deliberate editorial slice — these numbers tell you how it differs from the underlying population, and where the biases come from.
Where pairs sit, before any curation
Across all candidate pairs, 99.04% are neutral (no strong signal either way). The "interesting" pairs — classic, complementary, substitute — together account for just 0.96%. The curated set below cherry-picks from these.
After stratified ranking, the curated set deliberately over-represents classic and complementary pairs. This is editorial choice, not a property of the data.
Min / median / p95 / p99 / max for each cosine measure. Population row is the truth; curated row shows how aggressive the cut is. Negative cosines exist in all three — most users never see them because the curated set is far above the median.
| Sibling | Set | Min | Median | p95 | p99 | Max |
|---|---|---|---|---|---|---|
| Cooks with cooc | Population | -0.180 | 0.091 | 0.221 | 0.289 | 0.642 |
| Curated | 0.040 | 0.336 | 0.415 | 0.465 | 0.642 | |
| Blended core | Population | -0.088 | 0.336 | 0.531 | 0.636 | 0.931 |
| Curated | -0.031 | 0.625 | 0.795 | 0.843 | 0.931 | |
| Tastes like chem | Population | -0.139 | 0.107 | 0.234 | 0.335 | 0.810 |
| Curated | -0.084 | 0.381 | 0.568 | 0.654 | 0.810 |
2,000 of the curated pairs plotted on (cooks-with, tastes-like). Sample is deterministic across builds. Dashed lines mark the quadrant thresholds (cooc ≥ 0.30, chem ≥ 0.40). The full 1.6M population would be one dense blob near the origin — this is the editorial selection on top.
Score distributions — population (top row) vs curated (bottom row)
Same x-axis range in each column so you can read the curation cut directly. Each sibling has its own range because the geometries differ. Out-of-range counts: cooc 0 · core 0 · chem 0.
All 1,601,155 pairs. Range [-0.20, 0.65].
9,304 pairs. Same range, so heights compare directly.
All 1,601,155 pairs. Range [-0.10, 0.95].
9,304 pairs. Same range, so heights compare directly.
All 1,601,155 pairs. Range [-0.15, 0.85].
9,304 pairs. Same range, so heights compare directly.
Number of curated pairs each ingredient appears in. NOT a measure of 'culinary centrality' — it's a measure of which ingredients survived the stratified top-9,304 cut. Asian ingredients dominate because the source corpus has dense cooc clusters around Asian dishes; reranking the curation tier would shift this list.
The 543 flavor modes
Median 81 members; range 13–254.
Broad themes the embedding settled on.
- Processed deli meats cheeses and condiments254
- Mediterranean savory herbs and cheeses243
- Chinese savory pantry staples240
- East Asian savory pantry staples239
- Mediterranean savory pantry staples222
- Mediterranean savory pantry staples218
- East Asian savory pantry staples211
- East Asian processed condiments and sauces209
Tight clusters — often near-perfect substitution sets.
The paper labels each mode with an axis: either a continuous flavor property (sweet_score, nova_level, fatty_score…) or one of the emergent factor axes (F_0, F_3, …). Count per property tells you which axes were most prolific in the clustering.
Hand-curated list of obvious classic pairings. Each row shows the actual cosine scores and the quadrant the model assigns. Look for surprises — these reveal where the embedding's recipe corpus is weak.
| Canon pair | Cooks with | Tastes like | Blended | Verdict |
|---|---|---|---|---|
| Tomato + Basil | 0.32 | 0.15 | 0.47 | Complementary |
| Lemon + Garlic | 0.26 | 0.13 | 0.45 | Neutral |
| Beef + Rosemary | 0.22 | 0.06 | 0.39 | Neutral |
| Chocolate + Strawberry | 0.31 | 0.32 | 0.38 | Complementary |
| Peanut + Chocolate | 0.15 | 0.18 | 0.35 | Neutral |
| Apple + Cinnamon | 0.39 | 0.22 | 0.46 | Complementary |
| Coffee + Cream | 0.34 | 0.34 | 0.31 | Complementary |
| Bacon + Egg | 0.11 | 0.24 | 0.44 | Neutral |
| Lime + Cilantro | Not in vocabulary | |||
| Honey + Thyme | 0.26 | 0.10 | 0.31 | Neutral |
| Tomato + Mozzarella cheese | 0.27 | 0.23 | 0.51 | Neutral |
| Chicken + Lemon | 0.14 | 0.09 | 0.28 | Neutral |
| Pork + Apple | 0.10 | 0.10 | 0.36 | Neutral |
| Miso + Ginger | 0.18 | 0.22 | 0.46 | Neutral |
| Soy sauce + Ginger | 0.47 | 0.23 | 0.55 | Complementary |
| Vanilla + Chocolate | 0.49 | 0.40 | 0.51 | Complementary |
| Mint + Lamb | 0.25 | 0.18 | 0.42 | Neutral |
| Fish + Lemon | 0.22 | 0.13 | 0.42 | Neutral |
| Onion + Garlic | 0.45 | 0.53 | 0.73 | Classic |
| Olive oil + Garlic | 0.40 | 0.28 | 0.69 | Complementary |
| Dill + Salmon | 0.27 | 0.17 | 0.39 | Neutral |
| Caraway + Rye flour | Not in vocabulary | |||
| Fennel + Sausage | 0.13 | 0.13 | 0.44 | Neutral |
| Maple syrup + Bacon | 0.16 | 0.22 | 0.49 | Neutral |
| Banana + Peanut | 0.20 | 0.06 | 0.34 | Neutral |
The corpus is heavily East-Asian-weighted (39% of recipes). Western European classics like lemon+garlic and beef+rosemary score weaker than intuition expects — this is the corpus bias showing through, not a model failure.
Pearson r over the full 1.6M pair population. Higher = the two siblings say similar things about pair strength. Lower = the three-lens framing has real signal.
All three correlate moderately (0.54–0.62) — they don't collapse into one another, which justifies showing all three on every pair page. Blended (Core) correlates highest with both, as expected: it's designed to blend the signals.
Outlier pairs in the curated set
Often near-substitutes: things that always appear together because they're regional cousins.
The strongest aroma-chemistry matches in the curated set.
Largest |cooc − chem| gap. Where one lens says yes and the other says no.
The Epicure corpus is heavily skewed: East Asian recipes make up ~38% of the source data. This shapes every cooc-based statistic on this page.
| Region | Recipes | Share | Modes | Traditions |
|---|---|---|---|---|
| East Asian | 1,549,034 | 67.5% | 77 | Chinese, Korean |
| Western Atlantic | 198,086 | 8.6% | 7 | American, British, German, Scandinavian |
| Mediterranean | 164,107 | 7.1% | 41 | Italian, French, Iberian, Greek, Levantine, North African, Turkish |
| Eastern European | 154,479 | 6.7% | 1 | Russian, Ukrainian, Polish, Hungarian, Georgian |
| Southeast Asian | 107,964 | 4.7% | 6 | Thai, Vietnamese, Filipino, Indonesian, Malay |
| South Asian | 47,462 | 2.1% | 8 | Indian, Pakistani, Sri Lankan, Bangladeshi |
| Latin American | 40,618 | 1.8% | 14 | Mexican, Caribbean, Brazilian, Peruvian, Colombian |
| Japanese | 33,923 | 1.5% | 2 | Japanese |
Rule-based classifier (regex over canonical names). 736 of 1790 (41%) didn't match any category — heuristics are conservative. Real coverage is better than this number suggests.
Distribution of how many curated pairs each ingredient appears in. Of 1,790 ingredients, 1,671 appear in at least one curated pair; 119 are absent from the curated set (they exist in the vocab but no pair survived the cut).
Mode cohesion — how tight are the clusters?
Mean intra-mode cosine across all member pairs. Higher = members really do cluster together. Range: 0.10 (loosest single mode) to 0.61 (tightest); median 0.20.
Members are very similar in the mode's own sibling.
- core East Asian soy and spice staples0.61
- core Aged and semi-aged cheeses0.60
- core Chinese savory pantry staples0.59
- core Chinese savory umami pantry0.58
- core South Asian spice blends and seeds0.58
- core Latin American dried chiles0.57
- core East and Southeast Asian noodles and condiments0.57
- core Chinese braising pantry ingredients0.57
Members are nominally clustered but spread out — broader themes or noisier groupings.
- cooc East Asian roots and exotic mushrooms0.10
- chem Wholesome seeds and natural sweeteners0.11
- cooc Savory-sweet pantry staples and fortified beverages0.12
- cooc Processed convenience pantry products0.12
- cooc East Asian herbal teas and cooling desserts0.12
- cooc Savory condiments and spiced sauces0.12
- chem Earthy grain and seafood staples0.12
- cooc Sweet dessert and confection bases0.12
What these numbers don't tell you
- Recipe-language bias: the corpus is multilingual but uneven. Chinese and other Asian recipes are over-represented in the dense cooc clusters, which inflates Asian ingredients in every cooc-driven stat above.
- Ingredient-cut bias: the vocab uses base entries (chicken, beef, pork) without cut-level subdivisions. "Chicken breast" and "chicken thigh" both fold into "chicken." Cuisines with fine-grained cut vocabulary lose detail.
- Chemistry coverage: Chem is from FlavorDB, which catalogs aroma compounds for ~1,500 ingredients globally. Anything outside FlavorDB has a sparse Chem profile — particularly fermented and processed ingredients.
- No quantitative recipe weight: the cooc graph treats "1 cup of garlic" and "1 clove of garlic" identically. It counts co-presence, not relative importance.
- Quadrant thresholds are pragmatic: cooc ≥ 0.30, chem ≥ 0.40 are calibrated to ~70th percentiles per axis, not derived from theory. Adjusting them reshapes every quadrant count.
Last regenerated