Generalize and Guide: Decomposing Rewards for Few-Shot Inverse Reinforcement Learning

Ziyi Liu1, Grace Zhang1,
1University of Southern California

Reinforcement Learning Conference (RLC) 2026
RSS RC4Robotics Workshop 2025

Abstract

Inverse reinforcement learning (IRL) provides a powerful framework for learning from demonstrations. However, real-world tasks often exhibit substantial natural variations (e.g., picking up mugs with varying shapes), making it impractical to collect demonstrations that fully specify a new task under every possible scenario. In practice, while demonstrations for the target task are limited, it is often easier to obtain datasets of heterogeneous but related behaviors. This motivates the problem of few-shot IRL with multi-task demonstrations (FM-IRL), where an agent must learn a new task with substantial variations from only a limited number of target-task demonstrations, together with sufficient demonstrations of related tasks and online agent experience.

We introduce Multitask discriminator Proximity-Guided IRL (MPG), which learns two complementary reward components: (1) a generalizable discriminator that transfers shared structure across related tasks to identify expert behavior in a new task, and (2) a proximity function that measures how far a state deviates from expert behavior and provides corrective guidance during exploration. We demonstrate the effectiveness of our method on multiple challenging navigation and manipulation tasks under significant variations (e.g., object configurations, table layouts, and initial robot poses), resulting in an average 35.4% increase in success rate over the next best method.

MPG overview


Multitask Discriminator Proximity-Guided IRL (MPG) Method

MPG learns a new task from limited demonstrations by decomposing the reward function into two synergistic components: a multi-task discriminator that transfers knowledge across related tasks, and a proximity function that provides corrective guidance when the agent deviates from expert behavior. Together, these components produce a generalizable and informative reward for effective few-shot IRL.

MPG learns a two-part reward from a multi-task discriminator and a proximity function.

Our approach learns a two-part reward function: a multi-task discriminator approximates the expert distribution, and a proximity reward estimates distance to expert states for corrective guidance.

To generalize beyond few-shot demonstrations, our key insight is twofold: demonstrations from other tasks can be leveraged to infer the expert distribution of a new task, and online interaction can be used to estimate temporal distance to this distribution, providing corrective guidance when the agent deviates.

1. Multi-Task Discriminator

We propose a demonstration-conditioned discriminator that predicts whether a state-action pair is expert for a given task. Trained on both target and multi-task demonstrations, it transfers shared structure across tasks to recognize expert behavior under diverse intra-task variations. The discriminator score d(s, a) is used directly as a reward to encourage expert-consistent behavior.

2. Proximity Function

To provide informative rewards in non-expert states, we introduce a proximity function p(s) that estimates temporal distance to the expert state distribution. It is trained with temporal consistency constraints inspired by quasimetric learning, anchoring expert states at zero proximity and enforcing p(st) ≤ p(st+1) + ζ along policy transitions.

3. Combined MPG Reward

The full reward integrates expert recognition and proximity guidance:

R̃(st, at, st+1) = d(st, at) + λ [p(st) − p(st+1)]

The policy is optimized with standard RL (PPO) using this decomposed reward. At each iteration, the agent collects transitions, updates the discriminator and proximity function, and improves the policy.



Problem Setting: FM-IRL

FM-IRL reflects a practical robotics scenario: multi-task expert demonstrations are often available, but interaction with multi-task training environments is difficult to obtain. The agent must learn a new target task using only:

  1. A small set of target-task demonstrations
  2. A larger multi-task demonstration dataset from related tasks
  3. Online interaction with the target-task environment

Unlike meta-IRL, FM-IRL does not require access to multi-task environments. Unlike standard imitation learning, MPG learns a reward function that supports online policy optimization and recovery from unseen states.



Environments

We evaluate MPG on navigation and manipulation tasks with substantial intra-task variation:

(a) Maze2D (D4RL): The agent navigates to a fixed goal location among four fixed objects. Intra-task variation comes from the agent's random starting position.

(b) Block Stacking: The agent picks up a block of one color and places it on a block of another color, with random initial block positions creating variation.

(c) FactorWorld: A multi-task benchmark of robot manipulation tasks with variations in object position, table position, distractor objects, and arm position.

Maze2D environment

Maze2D

Block Stacking environment

Block Stacking

FactorWorld environment

FactorWorld



Results

Quantitative Comparison

MPG consistently outperforms all baselines across Maze2D, Block Stacking, and seven FactorWorld tasks, achieving an average 35.4% success rate improvement over the next best method. We compare against BC, SQIL, GAIL, MT-AIRL, PEMIRL, GoalPro, and DVD.

Maze2D results
Block Stacking results
Door Open results
Door Unlock results
Plate Slide Back results
Button Press Wall results
Results legend

MPG comparison against all baselines. Dashed lines (BC, SQIL) denote performance at convergence.

MPG vs. IRL baselines: Although GAIL and MT-AIRL have access to the same data resources, they struggle to leverage heterogeneous demonstrations under scarce supervision and substantial task variation. PEMIRL underperforms due to the high sample cost of meta-training within a fixed budget.

MPG vs. IL baselines: BC is a strong baseline under limited demonstrations, but cannot improve through online interaction. SQIL relies on an extremely sparse reward signal and performs poorly on most tasks.

MPG vs. discriminator/proximity baselines: GoalPro and DVD fail to learn effective policies in most tasks, highlighting the importance of combining generalizable expert recognition with dense proximity guidance in non-expert states.

Data Efficiency Analysis

Performance improves with more target demonstrations and with greater task diversity in the multi-task dataset, though gains plateau once a moderate level of diversity is reached.

Effect of number of target demonstrations

Effect of the number of target demonstrations.

Effect of number of tasks in multi-task dataset

Effect of the number of tasks in the multi-task dataset.

Ablation Study

We ablate the two parts of our reward function by training policies with either a discriminator only reward or a proximity only reward. While each component individually provides benefits, combining both yields the best performance. Removing the multi-task dataset (no multi-task data) also degrades performance, indicating that information from related tasks is important for generalization.

We further illustrate the two components in a simple empty Minigrid environment (below). The red flag indicates the goal and arrows show the expert demonstration. Lighter colors correspond to higher rewards. The two parts of our reward function provide complementary and informative learning signal: the discriminator generalizes to certain goal-reaching paths (e.g., directly above the goal), while the proximity provides a smooth gradient in non-expert regions.

Ablation of MPG reward components

Ablation on Lever Pull.

Discriminator reward heatmap

Discriminator reward heatmap.

Proximity reward heatmap

Proximity reward heatmap.



BibTeX

Coming soon.