GEM: Optimized Framework for Engaging Multimodal Digital Content Generation

NU 2023-124

INVENTORS

  • Venkatramanan Subrahmanian* (McCormick School of Engineering, Computer Science)
  • Chongyang Gao
  • Natalia Denisenko
  • Soroush Vosoughi
  • Yiren Jian

SHORT DESCRIPTION

For digital advertisers and social media platforms, GEM generates engaging image-text pairs using an iterative, engagement-guided algorithm. It improves content coherence and boosts user engagement with streamlined automation.

BACKGROUND

Online advertising struggles with content that fails to capture user attention. Most current systems emphasize image realism or text fluency without addressing engagement. This gap limits campaign impact and elevates production costs.

ABSTRACT

GEM is a framework that produces engaging multimodal image-text pairs. It combines a pre-trained engagement discriminator with a method for continuous prompt learning using stable diffusion models. The system iteratively refines outputs by leveraging a language model and CLIP similarity to align image and text. Experimental results and human evaluations confirm enhanced engagement and alignment compared to existing methods. GEM offers a practical solution for creating captivating digital content.

MARKET OPPORTUNITY

The burgeoning market for generative AI in marketing and advertising is valued at $2.8 billion in 2025 and is projected to surge to $12.5 billion by 2030, expanding at a compound annual growth rate (CAGR) of 34.9%. This growth is fueled by the intense competition for consumer attention and rising customer acquisition costs within the $750 billion global digital advertising industry. The primary market includes digital marketing agencies and in-house brand teams who are under constant pressure to improve campaign performance and return on investment. (Source: Statista: "Digital Advertising - Worldwide").

DEVELOPMENT STAGE

TRL-4 – Prototype Validated in Lab: Key functions including engagement optimization and image-text alignment have been demonstrated in controlled experimental settings.

APPLICATIONS

  • Generate multimodal image-text ads from a given topic for online advertising.
  • Create engaging post content tailored for social media companies.
  • Produce datasets of potential phishing messages for training detection systems.

ADVANTAGES

  • Explicit engagement optimization: Prioritizes content that captures user attention.
  • Efficient continuous prompt learning: Reduces computational demand by freezing model parameters.
  • Iterative refinement for alignment: Enhances coherence between image and text outputs.
  • Open-vocabulary capability: Generates engaging content across diverse topics.

PUBLICATIONS

IP STATUS
US Patent US20250111569A1 Pending

Patent Information: