Write a PREreview

Filter Before Mixing: Per-Modality Denoising for Multimodal RL with Application to Health Management

by Tsuyoshi Okita

Posted: May 12, 2026
Server: Preprints.org
DOI: 10.20944/preprints202605.0736.v1

Multimodal reinforcement learning agents must fuse signals with vastly different noise profiles—yet existing architectures, whether monolithic (π0, DreamerV3) or modular (MSDP, VTDexManip), allow noise from unreliable modalities to contaminate reliable ones at the point of fusion. We propose filter-before-mixing: each modality’s representation is independently refined by a per-modality Flow Matching module before spectral-domain fusion via a Fourier Neural Operator (FNO), with a residual gate ensuring that refinement is never harmful. The resulting architecture, FreamerV1 (Filter-before-mixing dreamer), has 93M parameters (0.4M trainable). On MiniGrid, FreamerV1 reaches 100% success at 5000 episodes, surpassing the 94% encoder-only baseline which degrades to 78% due to catastrophic forgetting. On Crafter (no language modality), it scores 16.0%, exceeding DreamerV3 (14.5%). On PAMAP2 wearable sensors—where no pre-trained encoder exists—the foundation encoder achieves 2.4× higher reward and 16× lower variance than a vanilla MLP, confirming that the filter-before-mixing advantage grows with encoder noise.

You can write a PREreview of Filter Before Mixing: Per-Modality Denoising for Multimodal RL with Application to Health Management. A PREreview is a review of a preprint and can vary from a few sentences to a lengthy report, similar to a journal-organized peer-review report.

Before you start

We will ask you to log in with your ORCID iD. If you don’t have an iD, you can create one.