Interactive Extended Reality Worlds with JSON Data Generated by Large Language Models

Recent LLM-XR systems mostly drive experiences by generating and compiling code on the fly. In complex scenes with thousands of artifacts and properties, models struggle to pick the right context, inflating token usage and still missing crucial constraints—like keeping cups on tables. Hallucinated or malformed scripts often fail to compile or crash at runtime, so teams add multi-pass “builder–inspector” loops that increase latency and hurt responsiveness. Other approaches stream control signals for animation, but don’t fuse deeply with XR scene graphs, sensors, or prefabs. Overall, today’s pipelines are brittle, slow, expensive, and context-noisy, yielding inconsistent results and frustrating user interactions.

GW and Penn State University researchers have developed a novel framework, LLMER, that turns spoken requests into structured JSON instead of fragile runtime code. A two-stage LLM wrapper first classifies the task and selects only essential context from a scene via a curated library, then emits JSON that conforms to a schema. Prebuilt modules—Virtual Object Creator, Animation Library, and a Reality Fusion Engine—execute that JSON to spawn objects, animate them, and bind them to real environments. A voice avatar handles Whisper transcription and TTS responses. In preliminary studies, LLMER cut tokens by over 80% and task time by about 60%.

 

Figure: An illustration of our system to enhance the immersive user-XR interaction

 

Advantages

  • Uses JSON, not code; ~80% fewer tokens used.
  • Two-stage context filtering; ~60% faster task completion.
  • Prebuilt modules cut crashes; lower latency, more consistent results.

Applications

  • Voice commands create scenes: tables, mugs, positions instantly.
  • AR/VR training: labs, safety drills, medical demos on demand.
  • Rapid prototyping for games, design, retail showrooms and try-ons.

 

 

Patent Information: