Efficient RL via Disentangled
Environment and Agent Representations

Carnegie Mellon University
ICML 2023

In robotics and AI, the question of how a machine perceives and interacts with its environment is a complex puzzle. While humans and animals have an innate "sense of self" that allows them to navigate and manipulate the world efficiently, most robotic systems struggle with this concept. They often require massive amounts of data to learn even simple tasks, and their adaptability across different environments or tasks is limited.

What if robots could have a more nuanced understanding of themselves in relation to their surroundings? What if they could distinguish between their "inner-self" and the "outer-environment," much like biological entities do? Motivated by these questions, our paper seeks to tackle the following question:
Can we learn and leverage the distinction between inner-self and outer-environment to improve visual RL?


Agents that are aware of the separation between themselves and their environments can leverage this understanding to form effective representations of visual input. We propose an approach for learning such structured representations for RL algorithms, using visual knowledge of the agent, such as its shape or mask, which is often inexpensive to obtain. This is incorporated into the RL objective using a simple auxiliary loss. We show that our method, Structured Environment-Agent Representations (SEAR), outperforms state-of-the-art model-free approaches over 18 different challenging visual simulation environments spanning 5 different robots.

How do we model agent-environment decoupling in visual RL?

Rather than directly model a single latent variable from an input image, we add a latent, ZR, to represent agent-specific visual information.

How do we obtain robot-specific visual supervision?

We can directly get robot masks from a simulator.

Or we can fine-tune a segmentation model. Furthermore, there are many off-the-shelf segmentation models, such as segment-anything, that can be used to provide robot-specific visual information.

Robot masks are a natural and readily-available form of robot-specific visual information.

How do we incorporate this into a visual RL algorithm?

We augment the RL loss with agent-centric and environment-centric losses. In particular, we reconstruct a robot mask from the agent-centric latent and reconstruct the input image from the environment-centric latent. We supervise our input image encoder jointly with an RL loss, a robot mask reconstruction loss, and an image reconstruction loss.

Experimental Setup

We train SEAR on 18 tasks spanning 5 robots across 4 simulation suites, in single-task, transfer, and multi-task settings.

How does SEAR perform in single-task settings?



SEAR matches or exceeds baselines in single-task settings.

What about transfer learning?

SEAR seems to learn representations useful for transfer learning.

How does SEAR perform for multi-task learning?

While SEAR matches baselines, more work is needed to improve SEAR for the multi-task setting.

How do noisy mask labels impact SEAR's peformance?

We generate noisy mask labels in simulation by randomly dropping pixels or downsampling the mask.

While noisy masks hurt performance, SEAR is still able to outperform baselines.

Key Takeaways

  • Takeaway 1: Decoupled representation boosts performance.
  • Takeaway 2: SEAR can help with transfer.
  • Takeaway 3: Masks are readily available from sim or shelf-supervised models.
  • Takeaway 4: SEAR can be easily added to any visual RL approach.


  title = {Efficient {RL} via Disentangled Environment and Agent Representations},
  author = {Gmelin, Kevin and Bahl, Shikhar and Mendonca, Russell and Pathak, Deepak},
  booktitle = {Proceedings of the 40th International Conference on Machine Learning},
  pages = {11525--11545},
  year = {2023},
  editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, 
            Sivan and Scarlett, Jonathan},
  volume = {202},
  series = {Proceedings of Machine Learning Research},
  month = {23--29 Jul},
  publisher = {PMLR},
  pdf = {https://proceedings.mlr.press/v202/gmelin23a/gmelin23a.pdf},
  url = {https://proceedings.mlr.press/v202/gmelin23a.html},


We would like to thank Alexander C. Li and Murtaza Dalal for fruitful discussions. This work is supported by Sony Faculty Research Award and NSF IIS-2024594.