Sim Scene Configuration#

This page explains the simulation scene API in python/rcs/envs/scenes.py and shows how the example scene configs in python/rcs/envs/configs.py fit together.

If you are new to the frame conventions first, read RCS Conventions.

world frame
└── root frame
    ├── all composed scene objects live here
    └── shared base frame
        (same kinematic node as root frame)
        (common robot-coordinate frame)
        ├── i-th robot base
        │   └── i-th robot attachment_site
        ├── j-th robot base
        │   └── j-th robot attachment_site
        └── ...

The simple mental model#

When building an RCS sim scene, it helps to think about five frames:

World frame
- The global MuJoCo frame.
- This is the outermost reference for the whole scene.
Root frame
- The scene-local frame for the composed robot setup.
- In the usual composed setup, all robot-scene objects are placed from here.
Shared base frame
- The common coordinate frame used for all robot actions and observations.
- It is attached to the same kinematic node as the root frame, but represents the common robot coordinate convention.
i-th robot base frame
- The base frame of one specific robot.
- Low-level kinematics and Cartesian commands are expressed here.
i-th robot attachment_site
- The end-effector mounting frame for that robot, before any optional tcp_offset.
- Use this for wrist-mounted objects, cameras, and tools.

A good rule of thumb is:

want the outer global reference -> world frame
want the composed scene placement frame -> root frame
want one common coordinate frame for all robots -> shared base frame
want per-robot kinematics -> i-th robot base frame
want wrist or tool mounting -> i-th robot attachment_site

If needed, extra world_frame_objects can still be placed directly in world coordinates.

How the main placement frames work#

For composed robot scenes, SimEnvCreator.create_model() combines three scene-placement transforms in this order:

robot2world = root_frame_to_world * shared_base_frame_to_root_frame * robot_to_shared_base_frame[robot_name]

In simple terms:

root_frame_to_world places the whole rig into the MuJoCo world
shared_base_frame_to_root_frame defines the shared robot coordinate frame relative to the root frame
robot_to_shared_base_frame places each robot relative to that shared robot frame

Single-robot intuition#

For a simple single-arm scene, all three can often be identity transforms:

root_frame_to_world = rcs.common.Pose()
shared_base_frame_to_root_frame = rcs.common.Pose()
robot_to_shared_base_frame = {"robot": rcs.common.Pose()}

That means the robot base, shared base, root frame, and world origin all coincide.

Dual-arm intuition#

The FR3 duo example in python/rcs/envs/configs.py uses these frames more meaningfully:

root_frame_to_world keeps the whole duo rig aligned with the world
shared_base_frame_to_root_frame lifts the shared base to the duo mount height
robot_to_shared_base_frame offsets the left and right robots from the center

That is why the dual-arm example can expose one common action frame while still placing each robot correctly.

`SimEnvCreatorConfig` keys#

This is the main top-level config for the scene API.

Key	What it controls	Typical use
`robot_cfgs`	Maps robot names to `SimRobotConfig`	Define one robot or multiple named robots such as `"left"` and `"right"`
`sim_cfg`	MuJoCo runtime settings (`SimConfig`)	Realtime, async control, frequency, convergence behavior
`control_mode`	Action representation	For example `ControlMode.CARTESIAN_TQuat`
`task_cfg`	Optional task-specific config	Add pick/place or other task logic
`scene`	Base scene XML path or scene key	Usually from `SCENE_PATHS[...]`
`gripper_cfgs`	Optional gripper config per robot	Add one gripper per robot
`camera_cfgs`	Optional camera config dictionary	Define resolution, type, and frame rate for named cameras
`max_relative_movement`	Relative action limit	Limit per-step Cartesian delta
`relative_to`	Relative action reference	Usually `RelativeTo.LAST_STEP` or `RelativeTo.NONE`
`robot_to_shared_base_frame`	Per-robot offset relative to the shared base frame	Multi-robot layouts
`add_gravcomp`	Add gravity compensation to the composed scene	Often useful for manipulation scenes
`wrapper_cfg`	Wrapper behavior flags	Binary gripper, home-on-reset, depth output
`headless`	GUI toggle	`True` for no GUI
`shared_base_frame_to_root_frame`	Offset from shared base frame to root frame	Move shared command origin inside the rig
`root_frame_to_world`	Offset from root frame to MuJoCo world	Place the whole setup in the room
`alternative_combined_robot_mjcf`	Use a pre-combined robot MJCF instead of composing robots one by one	Advanced custom scenes
`world_frame_objects`	Objects placed directly in world coordinates	Loose props, room-fixed assets
`root_frame_objects`	Objects placed in root-frame coordinates	Tables, mounts, fixtures that should move with the rig
`robot_frame_objects`	Objects attached in a robot attachment-site frame	Wrist mounts, end-effector payloads
`camera_adds`	Cameras to add to the scene	Fixed overhead cameras or wrist cameras
`gripper_offsets`	Pose offsets for mounted grippers	Align visual or tool frames
`_original_cfg`	Internal helper used after prefixing	Usually ignore this in user code

What usually lives inside the nested configs#

The scene config mostly wires together three lower-level config types.

`robot_cfgs: dict[str, SimRobotConfig]`#

Each robot entry usually defines things such as:

robot type
kinematic model path
attachment_site
tcp_offset
joint names and actuator names
base link name
degrees of freedom, joint limits, and q_home

The single-arm and dual-arm examples in python/rcs/envs/configs.py are good templates for this.

`gripper_cfgs: dict[str, SimGripperConfig]`#

Each gripper entry usually defines:

gripper type
gripper joint names and actuator name
min/max width or actuator range
collision geometry settings
callback timing

`camera_cfgs: dict[str, SimCameraConfig]`#

Each camera entry usually defines:

camera identifier
camera type
resolution
frame rate

A useful pattern is:

camera_cfgs defines the camera runtime properties
camera_adds defines where that camera is placed in the scene

The most important nested configs#

`WrapperConfig`#

WrapperConfig controls behavior of the environment wrappers around the raw simulation.

Key	Meaning
`binary_gripper`	If `True`, gripper commands are treated as open/close instead of continuous width
`home_on_reset`	If `True`, the robot returns home during reset
`include_depth`	If `True`, camera wrappers include depth images. These are metric depth values scaled by `BaseCameraSet.DEPTH_SCALE = 1000` and stored as `uint16`, so they are effectively in millimeters

`CameraAdderConfig`#

CameraAdderConfig describes how a camera is added to the scene.

Key	Meaning
`xml_path`	Optional camera XML asset to insert directly
`fovy`	Camera field of view, used when creating a camera directly
`offset`	Camera pose offset
`attachment_site`	Attachment site to use if mounted on a robot
`robot_name`	If set, mount the camera on that robot; otherwise add it as a scene camera

The important frame detail is:

if robot_name is not set, offset is interpreted in the root frame and then moved into world by root_frame_to_world
if robot_name is set, offset is interpreted relative to that robot’s attachment site

Easy examples#

Minimal single-robot scene#

This is the basic shape used by EmptyWorldFR3 in python/rcs/envs/configs.py:

cfg = SimEnvCreatorConfig(
    robot_cfgs={"robot": robot_cfg},
    sim_cfg=SimConfig(async_control=False, realtime=True, frequency=1),
    control_mode=ControlMode.CARTESIAN_TQuat,
    scene=SCENE_PATHS["empty_world"],
    gripper_cfgs={"robot": gripper_cfg},
    camera_cfgs={"bird_eye": bird_eye_cfg, "wrist": wrist_cfg},
    robot_to_shared_base_frame={"robot": rcs.common.Pose()},
    shared_base_frame_to_root_frame=rcs.common.Pose(),
    root_frame_to_world=rcs.common.Pose(),
)

What this means in plain language:

there is one robot named robot
it uses Cartesian tquat actions
the base scene is the empty world
there is one gripper on the robot
there are two cameras
all high-level frames start at the same origin

Dual-arm scene#

This is the important part of the EmptyWorldFR3Duo example:

robot_cfgs = {"left": robot_cfg_left, "right": robot_cfg_right}

robot_to_shared_base_frame = {
    "left": DEFAULT_TRANSFORMS["FR3_DUOMOUNT_LEFT_ROBOT"],
    "right": DEFAULT_TRANSFORMS["FR3_DUOMOUNT_RIGHT_ROBOT"],
}

shared_base_frame_to_root_frame = DEFAULT_TRANSFORMS["FR3_DUOMOUNT_HEIGHT_OFFSET"]
root_frame_to_world = rcs.common.Pose()

In plain language:

the shared base frame sits at the logical center of the duo setup
the left and right robot bases are offset from that center
the whole setup can still be moved together by changing root_frame_to_world

Object placement: which dictionary should I use?#

`world_frame_objects`#

Use this when the object should stay fixed in the room.

world_frame_objects = {
    "cube": (OBJECT_PATHS["green_cube"], rcs.common.Pose(translation=np.array([0.5, 0.0, 0.2]))),
}

Example meaning: place a cube at a fixed world position.

`root_frame_objects`#

Use this when the object belongs to the rig and should move together with it.

root_frame_objects = {
    "duo_mount": (OBJECT_PATHS["fr3_duo_mount"], DEFAULT_TRANSFORMS["FR3_DUOMOUNT_BASE"]),
}

Example meaning: the duo mount is part of the setup, not a free world object.

`robot_frame_objects`#

Use this when the object should be attached to one robot’s tool frame.

robot_frame_objects = {
    "left": {
        "left_d405_mount": (
            OBJECT_PATHS["robotiq_d405_mount"],
            DEFAULT_TRANSFORMS["FR3_ROBOTIQ_WRIST_D405_MOUNT"],
        )
    }
}

Example meaning: attach a wrist mount to the left robot only.

Camera depth units#

When depth is enabled, the camera wrapper exposes depth images as scaled metric depth:

sim depth is first converted to meters in python/rcs/camera/sim.py
camera frames use BaseCameraSet.DEPTH_SCALE = 1000
depth is then stored as uint16

So in practice:

divide by 1000 to get meters
or read the values directly as millimeters

Example:

depth[y, x] == 1500 means the point is about 1.5 m away from the camera

Camera placement#

Fixed scene camera#

This pattern from EmptyWorldFR3 adds an overhead camera:

camera_adds = {
    "bird_eye": CameraAdderConfig(
        fovy=60.0,
        offset=rcs.common.Pose(
            translation=np.array([0.271, 0.0, 2.080]),
            quaternion=np.array([0.0060, -0.0060, -0.7067, 0.7074]),
        ),
    )
}

Because robot_name is not set, this pose is interpreted in the root frame.

Wrist camera#

This pattern mounts a camera to a robot:

camera_adds = {
    "wrist": CameraAdderConfig(
        fovy=60.0,
        offset=some_pose,
        robot_name="robot",
    )
}

Because robot_name is set, offset is interpreted relative to that robot’s attachment site.

Common mistakes#

Mixing up world and root frame
- If the whole rig should move together, use root_frame_to_world or root_frame_objects, not world_frame_objects.
Using the wrong frame for camera offsets
- Scene cameras use root-frame offsets.
- Robot-mounted cameras use attachment-site offsets.
Forgetting matching camera names
- If a camera is added without xml_path, its name must also exist in camera_cfgs.
Putting wrist-mounted assets into world objects
- Use robot_frame_objects for things that should follow the robot wrist.
Using alternative_combined_robot_mjcf without the expected prefixes
- The docstring in scenes.py requires names like robot{robot_name}.

A practical workflow#

When building a new scene, this usually works well:

Start with one robot and identity transforms.
Add root_frame_objects for mounts or fixtures.
Add world_frame_objects only for room-fixed props.
Add robot_frame_objects for wrist payloads.
Add cameras with camera_adds.
Only then introduce non-trivial shared_base_frame_to_root_frame and robot_to_shared_base_frame offsets.

That order keeps the frame reasoning much easier.