MPG learns a new task from limited demonstrations by decomposing the reward function into two synergistic components: a multi-task discriminator that transfers knowledge across related tasks, and a proximity function that provides corrective guidance when the agent deviates from expert behavior. Together, these components produce a generalizable and informative reward for effective few-shot IRL.
Our approach learns a two-part reward function: a multi-task discriminator approximates the expert distribution, and a proximity reward estimates distance to expert states for corrective guidance.
To generalize beyond few-shot demonstrations, our key insight is twofold: demonstrations from other tasks can be leveraged to infer the expert distribution of a new task, and online interaction can be used to estimate temporal distance to this distribution, providing corrective guidance when the agent deviates.
We propose a demonstration-conditioned discriminator that predicts whether a state-action pair is expert for a given task. Trained on both target and multi-task demonstrations, it transfers shared structure across tasks to recognize expert behavior under diverse intra-task variations. The discriminator score d(s, a) is used directly as a reward to encourage expert-consistent behavior.
To provide informative rewards in non-expert states, we introduce a proximity function p(s) that estimates temporal distance to the expert state distribution. It is trained with temporal consistency constraints inspired by quasimetric learning, anchoring expert states at zero proximity and enforcing p(st) ≤ p(st+1) + ζ along policy transitions.
The full reward integrates expert recognition and proximity guidance:
R̃(st, at, st+1) = d(st, at) + λ [p(st) − p(st+1)]
The policy is optimized with standard RL (PPO) using this decomposed reward. At each iteration, the agent collects transitions, updates the discriminator and proximity function, and improves the policy.
FM-IRL reflects a practical robotics scenario: multi-task expert demonstrations are often available, but interaction with multi-task training environments is difficult to obtain. The agent must learn a new target task using only:
Unlike meta-IRL, FM-IRL does not require access to multi-task environments. Unlike standard imitation learning, MPG learns a reward function that supports online policy optimization and recovery from unseen states.
We evaluate MPG on navigation and manipulation tasks with substantial intra-task variation:
(a) Maze2D (D4RL): The agent navigates to a fixed goal location among four fixed objects. Intra-task variation comes from the agent's random starting position.
(b) Block Stacking: The agent picks up a block of one color and places it on a block of another color, with random initial block positions creating variation.
(c) FactorWorld: A multi-task benchmark of robot manipulation tasks with variations in object position, table position, distractor objects, and arm position.
Maze2D
Block Stacking
FactorWorld
MPG consistently outperforms all baselines across Maze2D, Block Stacking, and seven FactorWorld tasks, achieving an average 35.4% success rate improvement over the next best method. We compare against BC, SQIL, GAIL, MT-AIRL, PEMIRL, GoalPro, and DVD.
MPG comparison against all baselines. Dashed lines (BC, SQIL) denote performance at convergence.
MPG vs. IRL baselines: Although GAIL and MT-AIRL have access to the same data resources, they struggle to leverage heterogeneous demonstrations under scarce supervision and substantial task variation. PEMIRL underperforms due to the high sample cost of meta-training within a fixed budget.
MPG vs. IL baselines: BC is a strong baseline under limited demonstrations, but cannot improve through online interaction. SQIL relies on an extremely sparse reward signal and performs poorly on most tasks.
MPG vs. discriminator/proximity baselines: GoalPro and DVD fail to learn effective policies in most tasks, highlighting the importance of combining generalizable expert recognition with dense proximity guidance in non-expert states.
Performance improves with more target demonstrations and with greater task diversity in the multi-task dataset, though gains plateau once a moderate level of diversity is reached.
Effect of the number of target demonstrations.
Effect of the number of tasks in the multi-task dataset.
We ablate the two parts of our reward function by training policies with either a discriminator only reward or a proximity only reward. While each component individually provides benefits, combining both yields the best performance. Removing the multi-task dataset (no multi-task data) also degrades performance, indicating that information from related tasks is important for generalization.
We further illustrate the two components in a simple empty Minigrid environment (below). The red flag indicates the goal and arrows show the expert demonstration. Lighter colors correspond to higher rewards. The two parts of our reward function provide complementary and informative learning signal: the discriminator generalizes to certain goal-reaching paths (e.g., directly above the goal), while the proximity provides a smooth gradient in non-expert regions.
Ablation on Lever Pull.
Discriminator reward heatmap.
Proximity reward heatmap.
Coming soon.