Recent LLM-XR systems mostly drive experiences by generating and compiling code on the fly. In complex scenes with thousands of artifacts and properties, models struggle to pick the right context, inflating token usage and still missing crucial constraints—like keeping cups on tables. Hallucinated or malformed scripts often fail to compile or crash at runtime, so teams add multi-pass “builder–inspector” loops that increase latency and hurt responsiveness. Other approaches stream control signals for animation, but don’t fuse deeply with XR scene graphs, sensors, or prefabs. Overall, today’s pipelines are brittle, slow, expensive, and context-noisy, yielding inconsistent results and frustrating user interactions.
GW and Penn State University researchers have developed a novel framework, LLMER, that turns spoken requests into structured JSON instead of fragile runtime code. A two-stage LLM wrapper first classifies the task and selects only essential context from a scene via a curated library, then emits JSON that conforms to a schema. Prebuilt modules—Virtual Object Creator, Animation Library, and a Reality Fusion Engine—execute that JSON to spawn objects, animate them, and bind them to real environments. A voice avatar handles Whisper transcription and TTS responses. In preliminary studies, LLMER cut tokens by over 80% and task time by about 60%.
Figure: An illustration of our system to enhance the immersive user-XR interaction