Search and Ranking Systems

Staff Engineer @ BloomReach — Oct 2013 to Mar 2018
Location: Bengaluru, India

Patents and Blog

Overview
Led the design and implementation of an automated synonym generation pipeline for BloomReach Discovery, processing 100 million+ product descriptions and 30 million+ queries weekly to extract high-quality synonym pairs for e-commerce search. As deep learning started emerging, designed deep NNs to match user queries with document text & images.

E-commerce search faces messy, short queries (“nj devils tees”), shifting language (“jorts”), and ambiguity (“frozen” the temperature vs. Frozen the movie). Synonyms aren’t just dictionary swaps; they’re a controlled expansion of user intent.

Goal. Bridge query–catalog mismatch without flooding results. Boost recall while preserving precision and rank quality.

Synonym Types.

Variants: spelling/lemma/plural (“adapter ↔ adaptor”, “shoe ↔ shoes”).
Equivalents: true alternates (“graph paper ↔ grid paper”).
Hierarchy: hypernym/hyponym (“desktop pc ↔ computers”).
Backstops: closely related substitutes when inventory is thin (“paper plates ↔ plastic plates”).

Data Scale. Weekly mining over ~100M product descriptions, billions of web lines, and ~30M queries to capture new vocabulary, trends, and seasonality.

Pipeline (Mining, not per-query).

Candidate extraction: phrase pairs from catalog, crawl, and logs.
Representations:
- Context windows / embeddings (“a word by the company it keeps”).
- Behavioral signals: clicked/viewed/bought product distributions after a query.
- Document vectors: TF-IDF-weighted phrase contexts.
- Session neighbors: queries issued together or sequentially.
Similarity scoring:
- Symmetric measures (e.g., Jensen–Shannon divergence on click dists) for equivalence.
- Asymmetric measures for hypernym/hyponym directionality.
- Heuristics (edit distance, morphology) to nudge borderline cases.
Decisioning:
- Phrase-specific thresholds (relative to known baselines like “shoe↔shoes”).
- Supervised model over multi-score features to grade and type pairs.
- Guardrails so low-confidence expansions don’t bubble to the top after sorts (price/popularity).

Application.

Build a typed synonym graph (variant/equivalent/hierarchy/related) with confidence scores.
Apply class-specific rules at query time to expand cautiously.
Refresh weekly to reflect new products, slang, and events.

Outcomes.

Higher findability and better long-tail recall with minimal precision loss.
Resilient to language drift; fewer manual thesaurus edits.
Honest note: Like any ML system, it has “good and bad days,” so monitoring and iteration matter.