Building a Unified Multimodal Content Engine: A Step-by-Step Guide for Travel Platforms

Introduction

In the competitive travel industry, connecting visual content — such as hotel images — with textual reviews unlocks deeper discovery. Platforms like Agoda have built a multimodal content system that unifies over 700 million images and multilingual guest reviews using a shared topic taxonomy. This system enables users to search and explore based on both what they see and what others have experienced. In this guide, you’ll learn how to recreate such a system with offline enrichment and low-latency serving, step by step.

Building a Unified Multimodal Content Engine: A Step-by-Step Guide for Travel Platforms
Source: www.infoq.com

What You Need

Step-by-Step Guide

Step 1: Define a Unified Topic Taxonomy

Begin by creating a shared topic taxonomy that bridges images and reviews. This taxonomy should cover key aspects of a hotel experience, such as:

Each topic should be definable in both visual and textual terms. For example, “cleanliness” might be detected in images via object recognition (e.g., tidy bed) and in reviews via keyword extraction (e.g., “spotless”). Use domain experts and A/B testing to validate your taxonomy.

Step 2: Ingest and Preprocess Multimodal Data with Offline Enrichment

Aggregate all hotel images and multilingual reviews into a unified data lake. For images, perform deduplication and resize to a standard format. For reviews, normalize language using language detection, translate if necessary, and tokenize. Then, run an offline enrichment pipeline that applies your taxonomy to each data point:

Store enriched metadata alongside raw data for indexing.

Step 3: Align Features Across Modalities

To enable retrieval, both images and reviews must be represented in a shared embedding space. Train or fine-tune a multimodal encoder (e.g., CLIP or ViLBERT) on your enriched dataset so that image embeddings and text embeddings for the same topic cluster together. Alternatively, use a two-tower architecture where images and reviews are encoded separately but trained with a contrastive loss on matching image-review pairs (e.g., images of a pool and reviews mentioning “great pool”). This step ensures that a search for “cozy atmosphere” can retrieve both images of cozy rooms and reviews describing a cozy vibe.

Step 4: Build a Unified Index for Multimodal Retrieval

Construct an index that supports querying across both images and reviews using the shared embedding space. Options include:

Building a Unified Multimodal Content Engine: A Step-by-Step Guide for Travel Platforms
Source: www.infoq.com

Ensure the index can handle scalability (700M images and corresponding reviews). Partition by geography or hotel cluster to reduce query latency.

Step 5: Implement Low-Latency Serving

Deploy a serving layer that handles real-time user queries with sub-second latency. Key considerations:

Step 6: Validate and Iterate

Test your system with real user interactions. Metrics to track:

Continuously update your taxonomy as travel trends change (e.g., post-pandemic emphasis on hygiene). Retrain models periodically with new data and feedback loops (e.g., user clicks as positive signals).

Tips for Success

Tags:

Recommended

Discover More

Rocket Lab's Dramatic Comeback: Key Q&A on Its Space Industry Milestones10 Ways Amazon's Alexa+ AI Is Revolutionizing Your Shopping ExperienceGlobal Momentum Away from Fossil Fuels: Santa Marta Summit and Key Climate DevelopmentsAmazon S3 Marks 20th Anniversary with 500 Trillion Objects; Route 53 Global Resolver Reaches General AvailabilityRethinking Your CSS Strategy: When Mobile-First Isn't the Answer