Wheelchair Policy Goal Spec¶
This page is the contract for the wheelchair project. It defines the deliverable, the minimum physical validity requirements, the staged milestones, and the experiment dimensions the training loop is allowed to change.
Objective¶
Train a Unitree G1 policy that can manipulate a wheelchair in a physically meaningful way, starting from stable standing while holding the chair and progressing to commanded wheelchair locomotion.
The final target behavior is:
- Stand while holding the chair.
- Move the chair forward on command.
- Move the chair backward on command.
- Turn the chair left on command.
- Turn the chair right on command.
Single-policy control is preferred. A staged family of policies is acceptable during development, but the final deliverable should aim to collapse to one command-conditioned controller unless that clearly blocks progress.
Deliverable¶
The deliverable is complete when all of the following exist:
- A best checkpoint or checkpoint lineage with exact task IDs and training commands recorded.
- Deterministic evaluation rollouts showing stand, forward, backward, left turn, and right turn behavior.
- A short evaluation summary with the key scalar metrics and the failure cases that still remain.
- Documentation of the curriculum path that produced the result, including which scaffolds were temporary and which behavior transferred.
Physical Validity Rules¶
The final evaluated behavior must satisfy these rules:
- The wheelchair must be collidable during final evaluation.
- The final evaluation must not use a fixed-base wheelchair, an X-rail, or a no-collision wheelchair.
- Allowed steady contact is hands on handles. Torso, pelvis, legs, and non-hand arm links should not be used as support against the chair.
- The robot should not rely on chair interpenetration, non-physical resets, or hidden kinematic helpers that are absent from the target task.
- Temporary scaffolds are allowed during curriculum phases if they are explicitly marked as non-deliverable.
Milestones¶
| Milestone | Goal | Minimum pass condition |
|---|---|---|
M0 |
Clean fixed-chair stand | Holds upright for full episode with hands attached and low invalid chair contact. |
M1 |
Clean free-chair hold | Holds chair with collisions enabled while chair is braked or heavily damped; no torso-bracing exploit. |
M2 |
Lightly damped hold | Same as M1, but with less chair damping and no collapse in deterministic playback. |
M3 |
Forward creep | Moves chair forward slowly on command without obvious bracing exploit. |
M4 |
Forward walking | Tracks a stronger forward command with stable gait and chair control. |
M5 |
Backward control | Moves chair backward on command without collapse or turning exploit. |
M6 |
Left/right turning | Produces controllable left and right turns with the chair. |
M7 |
Unified controller | One command-conditioned policy covers stand, forward, backward, left, and right. |
Acceptance Criteria¶
These are the default pass gates for phase advancement unless a run shows a clearly better criterion is needed.
For standing and hold phases:
time_outshould be high enough that most deterministic eval episodes finish cleanly.bad_orientationandbase_heightresets should be rare.wheelchair_invalid_contactshould stay near zero in the collidable task.- Rollouts should not show the torso or hips using the chair as a support surface.
For M0 specifically, the fixed-chair hard-attachment scaffold currently treats the same-side *_wrist_yaw_link as acceptable handle-contact body alongside the rubber hand. This is deliberate: the hard attachment masks direct hand-handle collision, so strict hand-only contact at this phase produces a false invalid-contact failure at reset. Later free-chair phases should restore stricter contact expectations.
For motion phases:
- The commanded wheelchair motion should match the observed chair motion directionally and approximately in magnitude.
- Backward motion should not be secretly solved by turning and rolling forward in world frame.
- Turning should not be secretly solved by sliding sideways or exploiting constraint geometry.
- All four wheels should stay plausibly grounded unless a specific maneuver makes brief unloading unavoidable.
For the final controller:
- The same checkpoint should handle stand, forward, backward, left, and right commands in deterministic playback.
- The policy should remain physically valid under collisions-enabled evaluation.
- The result should be stable enough that the remaining work is gait polish and robustness, not basic task completion.
Allowed Scaffolds¶
These are acceptable temporary scaffolds during the curriculum:
- Fixed wheelchair root for initial standing.
- Braked wheelchair or strong damping for early free-chair hold.
- Ground-plane lock for height, roll, and pitch.
- Reduced command set, such as stand-only or forward-only.
- Warm-starting from an earlier checkpoint when the task changes.
These are not acceptable as final evaluation conditions:
- No-collision wheelchair.
- Fixed-base wheelchair.
- X-rail constraint.
- Manually hidden support geometry that the final task does not have.
What The Loop May Iterate On¶
The auto-research loop is allowed to vary:
- Reward shaping: invalid chair contact, hand-handle load, chair root drift, chair yaw/lateral motion, robot orientation, base height, action smoothness, joint deviation, energy, and command tracking.
- Observation set: chair root state, chair velocity, hand-handle relative pose, hand-handle relative velocity, handle wrench, and related direct chair state terms.
- Curriculum structure: fixed to braked to damped to free, command ranges, phase order, and advancement thresholds.
- Reset and initialization: robot pose, arm pose, chair pose, reset noise, and attachment startup procedure.
- Action parameterization: joint action scales, arm and wrist freedom, waist freedom, and whether specific joints are temporarily constrained.
- PPO settings: full resume versus actor-only warm start, critic reset, exploration std, learning rate, clip range, env count, rollout length, and minibatch structure.
- Contact rules: allowed-contact body lists, invalid-contact penalty weights, and whether the task should read contact directly or through derived penalties.
What The Loop Should Record Each Iteration¶
Every iteration should leave behind:
- A one-line hypothesis.
- The exact task/config change.
- The exact train command.
- The checkpoint lineage used to initialize it.
- A short deterministic evaluation result.
- A decision: continue, branch, revert, or promote to next milestone.
Example Experiment Types¶
These are valid examples of loop iterations:
- Retrain
M0from scratch with collidable chair and invalid-contact penalty active from the first step. - Compare fixed-chair standing with and without per-axis handle-force penalties.
- Compare free-chair hold with heavy damping versus pose tether versus braked chair.
- Add chair-state observations to a stable standing checkpoint and test whether transfer to free-chair hold improves.
- Keep the same task but change only resume mode: full PPO continuation versus actor-only warm start with critic reset.
- Introduce forward creep before full forward walking to see whether command tracking needs a smaller first motion target.
- Split turning into its own phase before merging into a unified controller.
Current Status¶
The current retained M0 solution is no longer the original direct-observation branch.
- The direct-observation
M0loop was superseded. It could improve the fixed-chair score somewhat, but it kept inheriting the fixed-chair bracing exploit from the900-dim warm-start source. - The retained
M0solution is now the observed-state branch:Unitree-G1-29dof-Wheelchair-Scratch-M0-CollidableStand-Observed. - That branch warm-started from the relaxed attached standing checkpoint and reached deterministic
M0eval with:m0_score = 1.0,clean_hold_rate = 1.0,invalid_contact_rate = 0.0. - The next milestone is
Phase 1Adamped release on the same observed-state branch. - The first bounded
Phase 1Atransfer from the clean observedM0checkpoint did not reintroduce torso bracing, but it failed through bilateral handle invalid-contact once the chair started moving. - A temporary early-release scaffold that allows same-side
*_wrist_yaw_linkhandle contact, matching theM0logic, removed the fake handle-contact blow-up and produced a usablePhase 1Abranch:clean_hold_rate = 0.8046875,invalid_contact_rate = 0.1953125,time_out_rate = 0.875. - The dominant remaining
Phase 1Afailure after that change is no longer handle semantics; it is wheelchair base contact plus release-phase drift and mild balance loss. - A follow-up base-only invalid-contact penalty on top of the relaxed-handle
Phase 1Abranch was a regression. It reduced neither drift nor stability cleanly and dropped deterministic release eval to:clean_hold_rate = 0.421875,invalid_contact_rate = 0.578125,time_out_rate = 0.75. - A second follow-up that tightened the chair pose and velocity tethering on top of the relaxed-handle branch was also a regression. It overconstrained the release phase, increased wheelchair-base contact again, and dropped deterministic release eval to:
clean_hold_rate = 0.4453125,invalid_contact_rate = 0.5546875,time_out_rate = 0.796875. - The retained
Phase 1Abaseline is therefore still the relaxed-handle observed branch. The current evidence says the next useful lever is not stronger tethering or sharper invalid-contact penalties; it is a lighter release-phase shaping change that reduces drift without pushing the robot back into the chair. - The first physically relevant
M1branch is now a collidable braked-chair task with the same temporary same-side wrist-yaw handle allowance used byM0. This removed the earlier handle invalid-contact failure entirely, but the first bounded probe still failed purely through orientation instability:Unitree-G1-29dof-Wheelchair-Scratch-M1-BrakedHold-Observed-RelaxedHandlewith deterministic evalinvalid_contact_rate = 0.0,bad_orientation_rate = 1.0,time_out_rate = 0.0,m0_score = -0.75. - A follow-up
M1variant that kept the same collidable relaxed-handle scaffold but switched to the stronger stationary-chair reward set improved stability materially without reintroducing invalid contact:Unitree-G1-29dof-Wheelchair-Scratch-M1-BrakedStationary-Observed-RelaxedHandlewith deterministic evalinvalid_contact_rate = 0.0,bad_orientation_rate = 0.9296875,time_out_rate = 0.0703125,clean_hold_rate = 0.0703125,m0_score = -0.626953125. - The first follow-up
M1stability sweep kept the same two-hand collidable stationary branch and tried two narrow changes. Neither helped enough to retain:- reduced arm and wrist action freedom:
bad_orientation_rate = 0.9453125,time_out_rate = 0.0546875,clean_hold_rate = 0.0546875,m0_score = -0.654296875 - stronger upright and low-motion regularization:
bad_orientation_rate = 0.9296875,time_out_rate = 0.0703125,clean_hold_rate = 0.0703125,m0_score = -0.6328125
- reduced arm and wrist action freedom:
- A same-task full PPO resume from the two-hand stationary collidable checkpoint also did not improve the retained result:
bad_orientation_rate = 0.9375,time_out_rate = 0.0625,clean_hold_rate = 0.0625,m0_score = -0.640625. - The first materially better physically valid post-
M0branch came from changing the scaffold, not the reward weights. A temporary one-hand collidable stationary brakedM1variant breaks the two-arm closed chain by attaching only the left hand:Unitree-G1-29dof-Wheelchair-Scratch-M1-BrakedStationary-Observed-LeftHand-RelaxedHandle. - The first bounded one-hand run from the same two-hand source checkpoint became the new retained
M1scaffold with deterministic eval:bad_orientation_rate = 0.2109375,invalid_contact_rate = 0.0078125,time_out_rate = 0.7890625,clean_hold_rate = 0.78125,m0_score = 0.6061033082008361. - Continuing that one-hand branch for another short same-task training block stayed physically clean, but it drifted slightly on deterministic eval rather than improving the retained checkpoint. Saved checkpoints from that continuation scored:
model_9800.pt:bad_orientation_rate = 0.2265625,invalid_contact_rate = 0.0078125,time_out_rate = 0.7734375,clean_hold_rate = 0.765625,m0_score = 0.595680835545063model_9836.pt:bad_orientation_rate = 0.2421875,invalid_contact_rate = 0.0,time_out_rate = 0.75,clean_hold_rate = 0.75,m0_score = 0.5625
- The retained best physically valid post-
M0branch is therefore still the first one-hand collidable stationary brakedM1checkpoint, not the later continuation. The current evidence is that the main blocker on the two-hand physical branch is the closed-chain attachment geometry, not missing contact penalties. - The next successful curriculum step keeps the retained left-hand hard attachment and reintroduces the right hand as a bounded soft assist instead of a second hard joint:
Unitree-G1-29dof-Wheelchair-Scratch-M1b-BrakedStationary-Observed-LeftHardRightSoft-RelaxedHandle. - That
M1bstage preserves the same585-dim observation space as the retained one-hand branch, so the one-hand checkpoint can be evaluated there directly. That immediate transfer is the current best physically valid free-chair hold result so far:bad_orientation_rate = 0.1015625,invalid_contact_rate = 0.0,time_out_rate = 0.8984375,clean_hold_rate = 0.8984375,m0_score = 0.822265625. - A short same-stage warm-start continuation from the same checkpoint did not improve that immediate-transfer result. Deterministic eval after 20 iterations gave:
model_9800.pt:bad_orientation_rate = 0.125,invalid_contact_rate = 0.0,time_out_rate = 0.8828125,clean_hold_rate = 0.875,m0_score = 0.7890625model_9806.pt:bad_orientation_rate = 0.171875,invalid_contact_rate = 0.0,time_out_rate = 0.828125,clean_hold_rate = 0.828125,m0_score = 0.69921875
- The retained best
M1/early-M2scaffold is therefore now the immediate-transferM1bresult, not the continuation. The evidence so far says the right direction is staged second-hand reintroduction with bounded compliance, while naïve continued PPO updates on that stage still destabilize orientation. - The next promotion attempt introduced an explicit damped dynamic-chair stage:
Unitree-G1-29dof-Wheelchair-Scratch-M2-DampedStationary-Observed-LeftHardRightSoft-RelaxedHandle. This keeps the retainedM1bleft-hard/right-soft grip scaffold, but swaps the braked chair for a lightly damped dynamic chair withlinear_damping = 0.15,angular_damping = 0.15, and stronger chair-stationary shaping. - Immediate transfer of the retained
M1bcheckpoint into thatM2stage was physically clean but materially worse on deterministic eval:bad_orientation_rate = 0.515625,invalid_contact_rate = 0.0,time_out_rate = 0.484375,clean_hold_rate = 0.484375,m0_score = 0.09765625. - A short
20-iteration warm-start continuation on the sameM2stage did not recover that drop. Both saved checkpoints evaluated to the same deterministic result:model_9800.pt:bad_orientation_rate = 0.515625,invalid_contact_rate = 0.0,time_out_rate = 0.484375,clean_hold_rate = 0.484375,m0_score = 0.09765625model_9806.pt:bad_orientation_rate = 0.515625,invalid_contact_rate = 0.0,time_out_rate = 0.484375,clean_hold_rate = 0.484375,m0_score = 0.09765625
- The current read is that this first damped-chair
M2promotion is a failed branch, not a retained milestone. It removes invalid contact cleanly, but the stability drop is too large, and short warm-start PPO updates did not move it. The retained scaffold therefore remains the immediate-transferM1bresult until a gentlerM2transition is found. - The next bridge attempt inserted a medium-damped dynamic-chair stage instead of jumping directly from braked
M1bto light-dampedM2:Unitree-G1-29dof-Wheelchair-Scratch-M1c-MediumDampedStationary-Observed-LeftHardRightSoft-RelaxedHandle. This stage keeps the retainedM1bleft-hard/right-soft scaffold and reward shaping unchanged, and only reduces the chair damping partway to the failedM2values. - Immediate transfer of the retained
M1bcheckpoint intoM1cwas materially better than the failedM2jump while staying physically clean:bad_orientation_rate = 0.375,invalid_contact_rate = 0.0,time_out_rate = 0.625,clean_hold_rate = 0.625,m0_score = 0.34375. - A short
20-iteration warm-start continuation onM1cimproved that bridge stage modestly without reintroducing any invalid contact:model_9800.pt:bad_orientation_rate = 0.3515625,invalid_contact_rate = 0.0,time_out_rate = 0.65625,clean_hold_rate = 0.6484375,m0_score = 0.392578125model_9806.pt:bad_orientation_rate = 0.34375,invalid_contact_rate = 0.0,time_out_rate = 0.65625,clean_hold_rate = 0.65625,m0_score = 0.3984375
- The current read is that
M1cis the best dynamic-chair bridge so far, and it clearly narrows the gap fromM1bto a moving chair better than the oldM2attempt. But it is still materially worse than the retained brakedM1bscaffold, so it should be treated as a provisional intermediate stage rather than a promoted new baseline. - A longer same-stage continuation from the retained
M1cmodel_9806.ptdid not preserve the improved online training statistics in deterministic eval. The savedmodel_9845.ptcheckpoint regressed to:bad_orientation_rate = 0.3828125,invalid_contact_rate = 0.0,time_out_rate = 0.6171875,clean_hold_rate = 0.6171875,m0_score = 0.330078125. So the retainedM1cresult remains the earlier short-runmodel_9806.pt, not the longer continuation. - Changing only the resume mode on
M1chelped. A bounded full-PPO resume from the retainedM1cmodel_9806.ptproduced a better deterministic bridge checkpoint:model_9825.ptwithbad_orientation_rate = 0.328125,invalid_contact_rate = 0.0,time_out_rate = 0.671875,clean_hold_rate = 0.671875,m0_score = 0.42578125. This is the current retainedM1ccheckpoint. The evidence is that optimizer state matters on this stage; actor-only warm starts were leaving some performance on the table. - Re-testing the lighter-damped
M2stage from that improved retainedM1ccheckpoint helped the raw transfer a little but still did not makeM2promotable. Immediate transfer ofmodel_9825.ptintoM2reached:bad_orientation_rate = 0.4921875,invalid_contact_rate = 0.0,time_out_rate = 0.5078125,clean_hold_rate = 0.5078125,m0_score = 0.138671875. That is better than the earlierM2transfer from the weakerM1bsource, but still materially behindM1c. - A short
20-iteration warm-start continuation onM2from the improvedM1cmodel_9825.ptregressed again instead of consolidating the gain. The savedmodel_9844.ptcheckpoint evaluated to:bad_orientation_rate = 0.5390625,invalid_contact_rate = 0.0,time_out_rate = 0.4609375,clean_hold_rate = 0.4609375,m0_score = 0.056640625. So the light-dampedM2task is still not the next retained milestone. The current best ladder isM1bbraked hold, thenM1cmedium-damped bridge, withM2still blocked by stage design rather than simple checkpoint quality. - To separate light damping from the stronger
M2stationary reward shaping, a second light-damped hold probe was added:Unitree-G1-29dof-Wheelchair-Scratch-M2-LightDampedHold-Observed-LeftHardRightSoft-RelaxedHandle. This uses the same light-damped chair asM2, but keeps theM1b/M1chold reward scaffold unchanged. - Immediate transfer from the retained
M1c model_9825.ptinto that isolated light-damped hold task matched the priorM2transfer:bad_orientation_rate = 0.4921875,invalid_contact_rate = 0.0,time_out_rate = 0.5078125,clean_hold_rate = 0.5078125,m0_score = 0.138671875. - A short
20-iteration warm-start continuation on the isolated light-damped hold task also failed to retain the gain:model_9844.ptwithbad_orientation_rate = 0.5078125,invalid_contact_rate = 0.0,time_out_rate = 0.4921875,clean_hold_rate = 0.4921875,m0_score = 0.111328125. This suggests the main cliff is the chair dynamics/damping drop itself, not only the strongerM2stationary reward weights. - To narrow that dynamics cliff further, a midpoint bridge stage was added between
M1cand the light-damped tasks:Unitree-G1-29dof-Wheelchair-Scratch-M1d-TransitionDampedHold-Observed-LeftHardRightSoft-RelaxedHandle. It keeps the same left-hard/right-soft hold scaffold and reward shaping asM1b/M1c, but uses a transition wheelchair withlinear_damping = 0.25,angular_damping = 0.25, and wheel/caster drive stiffness1.75. - Immediate transfer from the retained
M1c model_9825.ptintoM1dwas better than both light-damped branches while staying physically clean:bad_orientation_rate = 0.4453125,invalid_contact_rate = 0.0,time_out_rate = 0.5546875,clean_hold_rate = 0.5546875,m0_score = 0.220703125. That is still materially behind retainedM1c, but it shows the dynamics cliff is at least partly smoothable with a finer damping ladder. - A short
20-iteration warm-start continuation onM1ddid not consolidate that gain. The savedmodel_9844.ptcheckpoint regressed to:bad_orientation_rate = 0.484375,invalid_contact_rate = 0.0,time_out_rate = 0.515625,clean_hold_rate = 0.515625,m0_score = 0.15234375. - Changing only the continuation mode on
M1dhelped, just as it had onM1c. A bounded full-PPO resume from the retainedM1c model_9825.ptproduced a betterM1dcheckpoint:model_9844.ptwithbad_orientation_rate = 0.4375,invalid_contact_rate = 0.0,time_out_rate = 0.5625,clean_hold_rate = 0.5625,m0_score = 0.234375. That is a modest gain over the rawM1dtransfer, but it is the current retainedM1dresult and shows that optimizer state still matters on the transition stages. - Using that stronger retained
M1dcheckpoint directly on the old light-damped hold stage still did not lift the real blocker. Immediate transfer intoUnitree-G1-29dof-Wheelchair-Scratch-M2-LightDampedHold-Observed-LeftHardRightSoft-RelaxedHandlecame back at:bad_orientation_rate = 0.5078125,invalid_contact_rate = 0.0,time_out_rate = 0.4921875,clean_hold_rate = 0.4921875,m0_score = 0.111328125. So the0.25 -> 0.15dynamics gap was still too large. - To narrow that remaining gap, a second finer transition stage was added:
Unitree-G1-29dof-Wheelchair-Scratch-M1e-LightTransitionDampedHold-Observed-LeftHardRightSoft-RelaxedHandle. It keeps the same hold scaffold and reward shaping, but uses a lighter transition wheelchair withlinear_damping = 0.20,angular_damping = 0.20, and wheel/caster drive stiffness1.4. - Immediate transfer from the retained
M1d model_9844.ptintoM1eimproved over the failed light-damped stage while staying physically clean:bad_orientation_rate = 0.46875,invalid_contact_rate = 0.0,time_out_rate = 0.53125,clean_hold_rate = 0.53125,m0_score = 0.1796875. That is still below retainedM1d, but it is meaningfully better than the oldM2raw transfer and confirms that the ladder can still be smoothed further by tightening the dynamics step. - A bounded full-PPO continuation on
M1edid not retain that improvement. Deterministic eval of the saved checkpoints regressed below the raw transfer:model_9850.pt:bad_orientation_rate = 0.484375,invalid_contact_rate = 0.0,time_out_rate = 0.515625,clean_hold_rate = 0.515625,m0_score = 0.15234375model_9863.pt:bad_orientation_rate = 0.5,invalid_contact_rate = 0.0,time_out_rate = 0.5,clean_hold_rate = 0.5,m0_score = 0.125
- Changing only the continuation mode on
M1ehelped, just as it had onM1d. A short model-only continuation from the retainedM1d model_9844.ptproduced a betterM1echeckpoint:model_9850.pt:bad_orientation_rate = 0.484375,invalid_contact_rate = 0.0,time_out_rate = 0.515625,clean_hold_rate = 0.515625,m0_score = 0.15234375model_9863.pt:bad_orientation_rate = 0.453125,base_height_rate = 0.0,invalid_contact_rate = 0.0,time_out_rate = 0.546875,clean_hold_rate = 0.546875,m0_score = 0.20703125The retainedM1echeckpoint is nowmodel_9863.pt.
- Re-testing the fully light-damped hold stage from that retained
M1e model_9863.ptstill did not lift the real blocker. Immediate transfer intoUnitree-G1-29dof-Wheelchair-Scratch-M2-LightDampedHold-Observed-LeftHardRightSoft-RelaxedHandlecame back at:bad_orientation_rate = 0.5078125,invalid_contact_rate = 0.0,time_out_rate = 0.4921875,clean_hold_rate = 0.4921875,m0_score = 0.111328125. So the0.20 -> 0.15dynamics gap was still too large. - To narrow that final remaining gap, a third and finer transition stage was added:
Unitree-G1-29dof-Wheelchair-Scratch-M1f-FineTransitionDampedHold-Observed-LeftHardRightSoft-RelaxedHandle. It keeps the same hold scaffold and reward shaping, but uses a finer transition wheelchair withlinear_damping = 0.175,angular_damping = 0.175, and wheel/caster drive stiffness1.2. - Immediate transfer from the retained
M1e model_9863.ptintoM1fstayed physically clean but was still only a midpoint result:bad_orientation_rate = 0.4921875,invalid_contact_rate = 0.0,time_out_rate = 0.5078125,clean_hold_rate = 0.5078125,m0_score = 0.138671875. - A short model-only continuation on
M1ffrom that retainedM1echeckpoint did retain the new rung. The savedmodel_9882.ptcheckpoint evaluated to:bad_orientation_rate = 0.4375,base_height_rate = 0.0078125,invalid_contact_rate = 0.0,time_out_rate = 0.5625,clean_hold_rate = 0.5625,m0_score = 0.228515625. This is the current retainedM1fcheckpoint. - Using that stronger retained
M1f model_9882.ptdirectly on the fully light-damped hold stage helped only slightly. Immediate transfer intoUnitree-G1-29dof-Wheelchair-Scratch-M2-LightDampedHold-Observed-LeftHardRightSoft-RelaxedHandlereached:bad_orientation_rate = 0.5,invalid_contact_rate = 0.0,time_out_rate = 0.5,clean_hold_rate = 0.5,m0_score = 0.125. That is marginally better than the old0.111328125light-damped transfer, but still not a promotableM2result. - A short model-only continuation on the same
M2stage from retainedM1f model_9882.ptdid not retain the slight gain. One saved checkpoint failed to emit a valid eval result file, and the surviving deterministic eval regressed:model_9900.pt: no valid metrics file emitted by the benchmark wrappermodel_9901.pt:bad_orientation_rate = 0.515625,invalid_contact_rate = 0.0,time_out_rate = 0.484375,clean_hold_rate = 0.484375,m0_score = 0.09765625
- The current read is now concrete.
M1d,M1e, andM1fall became usable retained bridge rungs once continuation mode was softened to model-only, but the fully light-dampedM2stage is still blocked even with the improved source checkpoints. The next likely lever is another task-design change at the light-damped boundary, not more continuation on the currentM2setup. - To isolate that remaining
M1f -> M2cliff, two split boundary variants were added:Unitree-G1-29dof-Wheelchair-Scratch-M2a-LightBodyDampedHold-Observed-LeftHardRightSoft-RelaxedHandlechanges only the chair body damping to theM2level (linear_damping = 0.15,angular_damping = 0.15) while keeping the retainedM1fwheel-drive stiffness1.2.Unitree-G1-29dof-Wheelchair-Scratch-M2b-SoftDriveTransitionHold-Observed-LeftHardRightSoft-RelaxedHandlekeeps the retainedM1fbody damping (0.175) while dropping only the wheel/caster drive stiffness to theM2level (1.0).
- Immediate transfer from retained
M1f model_9882.ptinto those split variants showed the boundary is asymmetric:M2araw transfer:bad_orientation_rate = 0.484375,invalid_contact_rate = 0.0,time_out_rate = 0.515625,clean_hold_rate = 0.515625,m0_score = 0.15234375M2braw transfer:bad_orientation_rate = 0.46875,invalid_contact_rate = 0.0,time_out_rate = 0.53125,clean_hold_rate = 0.53125,m0_score = 0.1796875The stiffness drop alone is therefore less damaging than the body-damping drop alone.
- A short model-only continuation on
M2bdid not retain the raw-transfer gain. The savedmodel_9901.ptcheckpoint evaluated to:bad_orientation_rate = 0.4765625,invalid_contact_rate = 0.0,time_out_rate = 0.53125,clean_hold_rate = 0.5234375,m0_score = 0.173828125. SoM2bis useful as a probe, but its retained best is still the immediate-transfer result rather than the continuation. - A short model-only continuation on
M2adid help a little, but not enough to beat the strongerM2braw transfer. Direct deterministic eval of the saved checkpoints came back at:model_9900.pt:bad_orientation_rate = 0.484375,invalid_contact_rate = 0.0,time_out_rate = 0.515625,clean_hold_rate = 0.515625,m0_score = 0.15234375model_9901.pt:bad_orientation_rate = 0.4765625,invalid_contact_rate = 0.0,time_out_rate = 0.5234375,clean_hold_rate = 0.5234375,m0_score = 0.166015625
- The current read is now narrower. The wheel-drive stiffness drop is not the main blocker at the light-damped boundary; the chair body-damping drop is. The strongest boundary result below retained
M1fis now the rawM2btransfer atm0_score = 0.1796875. The next useful lever is therefore another finer damping rung or a redesigned body-damping transition, not more same-task continuation onM2. - A finer body-damping rung was then added directly below
M1f:Unitree-G1-29dof-Wheelchair-Scratch-M1g-BodyTransitionDampedHold-Observed-LeftHardRightSoft-RelaxedHandle. This stage keeps the retainedM1fwheel-drive stiffness1.2and lowers only the chair body damping partway toM2a, usinglinear_damping = 0.1625andangular_damping = 0.1625. - Immediate transfer from retained
M1f model_9882.ptintoM1gwas physically clean and matched the earlier bestM2bprobe:bad_orientation_rate = 0.46875,invalid_contact_rate = 0.0,time_out_rate = 0.53125,clean_hold_rate = 0.53125,m0_score = 0.1796875. - A short model-only continuation on
M1gdid retain that rung locally. The saved checkpoints evaluated to:model_9900.pt:bad_orientation_rate = 0.5,invalid_contact_rate = 0.0,time_out_rate = 0.5,clean_hold_rate = 0.5,m0_score = 0.125model_9901.pt:bad_orientation_rate = 0.4609375,invalid_contact_rate = 0.0,time_out_rate = 0.5390625,clean_hold_rate = 0.5390625,m0_score = 0.193359375So the retained same-stageM1gcheckpoint ismodel_9901.pt.
- But that same-stage improvement did not improve the true body-damping boundary. Immediate transfer from retained
M1g model_9901.ptintoUnitree-G1-29dof-Wheelchair-Scratch-M2a-LightBodyDampedHold-Observed-LeftHardRightSoft-RelaxedHandlefell back to:bad_orientation_rate = 0.5,invalid_contact_rate = 0.0,time_out_rate = 0.5,clean_hold_rate = 0.5,m0_score = 0.125. - That changes the lesson from this branch. Same-stage bridge improvement is not sufficient as a retention criterion by itself, because
M1gimproved its own deterministic score while failing to improve downstream transfer intoM2a. Future bridge-stage acceptance should therefore use downstream-stage transfer as a gate, not only same-stage deterministic eval. - The bridge harness was then upgraded to support
--evaluate-all-checkpoints, so bounded bridge runs can score every saved checkpoint and keep the one with the best downstream transfer metric rather than automatically taking the latest checkpoint. Re-scoring the existingM1grun with that rule changed the retained result:model_9900.ptwas the best downstream-transfer checkpoint, notmodel_9901.pt- same-stage
M1g:m0_score = 0.125 - downstream
M2a:bad_orientation_rate = 0.4765625,invalid_contact_rate = 0.0,time_out_rate = 0.5234375,clean_hold_rate = 0.5234375,m0_score = 0.166015625
- A fresh downstream-aware
M1gcontinuation from retainedM1f model_9882.ptwith lower exploration (policy_std = 0.005) improved the real body-damping boundary further. The selected checkpoint from run2026-05-26_05-52-09_m1g_bridge_std005_from_m1f_9882wasmodel_9900.pt, with:- same-stage
M1g:bad_orientation_rate = 0.4765625,invalid_contact_rate = 0.0,time_out_rate = 0.5234375,clean_hold_rate = 0.5234375,m0_score = 0.166015625 - downstream
M2a:bad_orientation_rate = 0.453125,invalid_contact_rate = 0.0,time_out_rate = 0.546875,clean_hold_rate = 0.546875,m0_score = 0.20703125
- same-stage
- That
M1g std=0.005 model_9900.ptcheckpoint is now the best pre-M2abridge source we have seen. It beats the older rawM2bprobe (0.1796875), the earlierM2acontinuation fromM1f(0.166015625), and the first same-stage-selectedM1gresult. - Using that downstream-selected
M1g model_9900.ptas the source for a short low-noiseM2acontinuation lifted the actual light-body-damped stage itself. In run2026-05-26_05-57-38_m2a_from_m1g9900_std005, the best selected checkpoint wasmodel_9900.ptwith:bad_orientation_rate = 0.4453125,invalid_contact_rate = 0.0,time_out_rate = 0.5546875,clean_hold_rate = 0.5546875,m0_score = 0.220703125. The later savedmodel_9919.ptregressed tom0_score = 0.15234375, confirming again that earliest downstream-clean checkpoints can be better than the latest checkpoint on these delicate bridge stages. - The current retained ladder is therefore no longer just a sequence of stage-local optima. The best known path is now:
- retained
M1f model_9882.pt - downstream-selected
M1g std=0.005 model_9900.pt - retained
M2a model_9900.ptfromm2a_from_m1g9900_std005with zero invalid chair contact throughout.
- retained
- Starting from that retained
M2a model_9900.pt, the remaining wheel-drive stiffness drop into fullUnitree-G1-29dof-Wheelchair-Scratch-M2-LightDampedHold-Observed-LeftHardRightSoft-RelaxedHandlewas then retried with the downstream-aware checkpoint-selection harness and lower exploration (policy_std = 0.005). - In run
2026-05-26_06-03-42_m2_from_m2a9900_std005, the selected checkpoint wasmodel_9900.pt, not the latermodel_9919.pt. The retainedM2result came back at:bad_orientation_rate = 0.4609375,invalid_contact_rate = 0.0,time_out_rate = 0.5390625,clean_hold_rate = 0.5390625,m0_score = 0.193359375. The later checkpoint regressed to:bad_orientation_rate = 0.46875,invalid_contact_rate = 0.0,time_out_rate = 0.53125,clean_hold_rate = 0.53125,m0_score = 0.1796875. - This is the first retained full-
M2checkpoint on the current observed left-hard/right-soft ladder. It materially beats the earlier rawM2transfer (0.125) and the failed older continuation branch (0.09765625), while keeping invalid chair contact at zero. - But
M2is still weaker than the retainedM2asource (0.220703125). So the wheel-drive stiffness boundary is no longer a hard failure, but it is still the next optimization target. The current retained ladder is now:- retained
M1f model_9882.pt - downstream-selected
M1g std=0.005 model_9900.pt - retained
M2a model_9900.pt - retained
M2 model_9900.pt
- retained
- The next branch should start from retained
M2 model_9900.ptand begin the first real motion curriculum step on top of the current physically clean hold scaffold, rather than revisiting old bridge rungs or latest-checkpoint heuristics. - The first motion-stage branch on top of retained
M2 model_9900.ptwasUnitree-G1-29dof-Wheelchair-Scratch-M3-CreepForward-Observed-LeftHardRightSoft-RelaxedHandle. This stage keeps the physically clean left-hard/right-soft hold scaffold, drops the stationary-chair objective, and introduces a small forward wheelchair command (0.10 m/s) with direct chair-motion shaping. - The original motion-stage scalar was too forgiving.
forward_motion_scorehad been using a clipped positive-only forward-velocity ratio, so high-survival rail runs could still look good even when the wheelchair was moving backward. The evaluator was then corrected to use a signed symmetric forward-velocity ratio together with lateral/yaw penalties. After that fix, oldM3and rail results had to be re-read. - Under the corrected directional metric, the retained free-chair
M3branch is still not a usable forward-motion milestone. Raw transfer from retainedM2 model_9900.ptintoM3came back at:forward_motion_score = -0.013570901006460152,clean_hold_rate = 0.484375,time_out_rate = 0.484375,wheelchair_forward_velocity_mean = 0.0001815104780253023,wheelchair_forward_velocity_ratio_symmetric = 0.0023294897258020943. A bounded free-chairM3continuation improved only slightly:model_9999.ptwithforward_motion_score = 0.025361138582229645,clean_hold_rate = 0.515625,time_out_rate = 0.515625,wheelchair_forward_velocity_mean = 0.003041791496798396,wheelchair_forward_velocity_ratio_symmetric = 0.030903536826372147. So the chair was still barely moving. - The first rail curriculum branch,
Unitree-G1-29dof-Wheelchair-Scratch-M3a-RailCreepForward-Observed-LeftHardRightSoft-RelaxedHandle, was then tested to simplify the motion problem. It solved survival on the rail, but after the scoring fix it was clearly a failed branch: the selected same-stage checkpointmodel_9950.pthadforward_motion_score = -0.15196037504938428,clean_hold_rate = 0.9921875,time_out_rate = 0.9921875,wheelchair_forward_velocity_mean = -0.07597053050994873,wheelchair_forward_velocity_ratio_symmetric = -0.5507431030273438. Downstream transfer from that rail checkpoint back into free-chairM3was still effectively zero-motion:forward_motion_score = -0.01093803327530615,clean_hold_rate = 0.4921875,wheelchair_forward_velocity_mean = 0.0001280088904313743. - A second rail curriculum branch,
Unitree-G1-29dof-Wheelchair-Scratch-M3b-RailDenseForward-Observed-LeftHardRightSoft-RelaxedHandle, widened the forward-velocity well and made dense chair progress dominate the rail stage. That improved the training signal, but it still did not produce a valid forward-motion milestone. The selected same-stage checkpointmodel_9999.ptstill moved backward on the rail:forward_motion_score = 0.04777805805206303,clean_hold_rate = 0.984375,time_out_rate = 0.984375,wheelchair_forward_velocity_mean = -0.07038901746273041,wheelchair_forward_velocity_ratio_symmetric = -0.5341796875. Its downstream transfer into free-chairM3was slightly positive but still tiny:forward_motion_score = 0.015163969621062322,clean_hold_rate = 0.515625,wheelchair_forward_velocity_mean = 0.0014841918600723147,wheelchair_forward_velocity_ratio_symmetric = 0.015274673700332642. - The current motion-stage read is therefore straightforward. The retained hold scaffold through
M2is physically clean, but the first forward-motion curriculum is still blocked. The corrected metric shows both rail branches failed to produce real forward chair motion, and the best free-chairM3result is still only marginally above zero. The next useful lever is not more reward nudging on the same rail tasks; it should be a different motion-stage scaffold or command structure that cannot hide behind backward or near-stationary solutions. - A third rail probe then tested whether the failure was mostly the shape of the motion reward itself:
Unitree-G1-29dof-Wheelchair-Scratch-M3c-RailSignedForward-Observed-LeftHardRightSoft-RelaxedHandle. This branch removed the soft exponential chair-velocity matching term entirely and replaced it with strictly directional shaping: strong positivewheelchair_forward_progress, linearwheelchair_backward_velocity_l1, and the same rail constraint. - That probe was a clean negative result. The rail stage itself still settled into backward motion:
- selected same-stage checkpoint
model_9900.pt:forward_motion_score = -0.09549030592315827,clean_hold_rate = 0.9921875,time_out_rate = 0.9921875,wheelchair_forward_velocity_mean = -0.06506837904453278,wheelchair_forward_velocity_ratio_symmetric = -0.49263304471969604Later checkpoints such asmodel_9999.ptstayed fully stable on the rail but still moved backward:forward_motion_score = -0.147451005372568,wheelchair_forward_velocity_mean = -0.06927454471588135.
- selected same-stage checkpoint
- Downstream transfer from
M3cback into free-chairM3also did not improve. The best selected downstream checkpoint was againmodel_9900.ptwith:forward_motion_score = -0.011835230141878147,clean_hold_rate = 0.484375,time_out_rate = 0.484375,wheelchair_forward_velocity_mean = 0.00040248059667646885,wheelchair_forward_velocity_ratio_symmetric = 0.004518650472164154. That is effectively the same as the raw retainedM2 -> M3transfer and confirms that reward-shape cleanup alone is not enough on the current left-hard/right-soft rail bridge. - The motion-stage blocker is therefore narrower now. The clean hold ladder through
M2is still valid, but the currentM3family does not bridge into actual chair propulsion. The next useful branch should borrow more aggressively from the older minimal successful motion scaffolds rather than keep iterating inside the currentM3a/M3b/M3cstructure. The most likely levers are a more minimal motion reward set, larger action authority, and possibly a temporarily stronger motion-phase hand constraint. - A fourth motion probe then borrowed more directly from the older minimal successful scaffolds:
Unitree-G1-29dof-Wheelchair-Scratch-M3d-GroundLockHeavyDampedForward-Observed-LeftHardRightSoft-RelaxedHandle. This branch replaced the rail with a ground-lock and heavy planar damping scaffold, increased leg/waist/arm/wrist action authority substantially, relaxed base-height and orientation terminations, strengthened the right soft hand attachment, and reduced the reward set to a sparse motion core around forward progress, backward penalty, lateral/yaw penalties, lean bias, and low-weight hand geometry. - The bounded
M3drun was stable on its own constrained stage, but it exposed a new failure mode instead of solving motion. The selected same-stage checkpoint wasmodel_9900.ptwith:forward_motion_score = 0.2117772144381888,clean_hold_rate = 0.6484375,time_out_rate = 1.0,bad_orientation_rate = 0.0,wheelchair_forward_velocity_mean = 0.02916671335697174, but alsoinvalid_contact_rate = 0.3515625, dominated bywheelchair_base_robot_contact = 0.34375andwheelchair_right_handle_invalid_contact = 0.140625. So the constrained stage itself was already learning contact abuse instead of a clean push. - Downstream transfer from
M3dback into the real free-chairM3did not help. The best selected downstream checkpoint was againmodel_9900.ptwith:forward_motion_score = 0.005037643201649188,clean_hold_rate = 0.5,time_out_rate = 0.5,bad_orientation_rate = 0.5,wheelchair_forward_velocity_mean = 0.0013121002120897174, andinvalid_contact_rate = 0.0. Later checkpointsmodel_9950.ptandmodel_9999.ptwere worse downstream. This makesM3da discard: it is weaker than the earlier bounded free-chairM3continuation (0.025361138582229645) and weaker than the denser rail branchM3b(0.015163969621062322) on the actual forward-motion metric. - The current motion-stage diagnosis is now sharper. Minimal reward cleanup, larger action authority, and a ground-lock/heavy-damping scaffold can produce a stable constrained-stage gait, but under the current observed left-hard/right-soft setup they still do not transfer into real free-chair chair propulsion and they reopen chair-contact exploitation. The next branch should change the motion-stage constraint structure itself rather than keep tuning within
M3b/M3c/M3d. - That next branch was
M3e:Unitree-G1-29dof-Wheelchair-Scratch-M3e-GroundLockHeavyDampedForward-Observed-BothHard-RelaxedHandle. It keeps the same minimalM3dmotion scaffold but replaces the left-hard/right-soft grip with both-hand hard attachment during the constrained motion bridge. This was a useful correction. Same-stageM3ebecame almost perfectly clean:- selected checkpoint
model_9950.pt forward_motion_score = 0.28799832941731435clean_hold_rate = 0.9921875time_out_rate = 1.0bad_orientation_rate = 0.0invalid_contact_rate = 0.0078125So the right-hand drift/bracing failure fromM3dwas largely removed.
- selected checkpoint
- But forcing transfer from
M3eback into the older free-chair soft-rightM3task still underperformed. The best selected downstream checkpoint wasmodel_9950.ptwith:forward_motion_score = 0.013585472479462624,clean_hold_rate = 0.5078125,time_out_rate = 0.5078125,bad_orientation_rate = 0.4921875, andinvalid_contact_rate = 0.0. That is better thanM3d, but still weaker than the earlier bounded free-chair continuation (0.025361138582229645). The important lesson is that the old left-hard/right-soft free-chair target had become the wrong curriculum target once the cleaner both-hard motion scaffold was introduced. - The next stage therefore promoted the cleaner grip into the freer motion task itself:
Unitree-G1-29dof-Wheelchair-Scratch-M3f-FreeYawHeavyDampedForward-Observed-BothHard-RelaxedHandle.M3fremoves the ground-plane clamp fromM3ewhile keeping the both-hard grip and heavy planar damping. This is the first strong positive result on the current motion ladder. Starting from retainedM3e model_9950.pt, the boundedM3fcontinuation selectedmodel_10049.ptwith:forward_motion_score = 0.3852109075058252,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.014253754168748856,wheelchair_lateral_velocity_abs_mean = 0.0014836001209914684, andwheelchair_yaw_velocity_abs_mean = 0.004958887584507465. This makesM3fthe new retained motion rung. It is materially better than every priorM3/M3a/M3b/M3c/M3dbranch and, more importantly, it stays fully stable and contact-clean while the chair is freer than in the old ground-locked bridges. - The motion curriculum has now changed shape. The retained path is no longer “left-hard/right-soft bridge back into the old free-chair
M3task.” The better path is:- retained
M2 model_9900.pt - retained
M3e model_9950.ptfor clean both-hard constrained motion - retained
M3f model_10049.ptfor free-yaw heavy-damped both-hard forward motion The next useful lever is to start reducing the remaining heavy planar damping onM3f, not to return to the older soft-right motion family.
- retained
- A first attempt to reduce that planar damping was:
Unitree-G1-29dof-Wheelchair-Scratch-M3g-FreeYawMediumDampedForward-Observed-BothHard-RelaxedHandle.M3grelaxed the retainedM3fscaffold from heavy planar damping to medium planar damping (y_velocity_scale = 0.25,yaw_velocity_scale = 0.25) while also strengthening the lateral/line/yaw penalties. The bounded continuation from retainedM3f model_10049.ptselectedmodel_10100.ptwith:forward_motion_score = 0.37746714847162366,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.011438323184847832,wheelchair_lateral_velocity_abs_mean = 0.007954314351081848, andwheelchair_yaw_velocity_abs_mean = 0.026653502136468887. This stayed fully stable and contact-clean, but it was still a slight regression from retainedM3fon the primary motion score and a clear regression in lateral/yaw cleanliness, soM3gwas discarded rather than promoted into the retained ladder. - The next probe isolated the damping change instead of changing both dynamics and reward shaping at once:
Unitree-G1-29dof-Wheelchair-Scratch-M3h-FreeYawIntermediateDampedForward-Observed-BothHard-RelaxedHandle.M3hkeeps the retainedM3freward scaffold exactly the same and only relaxes the planar damping to an intermediate step (y_velocity_scale = 0.15,yaw_velocity_scale = 0.15). Starting from retainedM3f model_10049.pt, the bounded continuation selectedmodel_10148.ptwith:forward_motion_score = 0.377978881332092,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.012244169600307941,wheelchair_lateral_velocity_abs_mean = 0.004626037552952766, andwheelchair_yaw_velocity_abs_mean = 0.015049846842885017. This is still a small same-task regression from retainedM3fon pure forward-motion score, but unlikeM3git stayed close toM3fwhile materially increasing chair freedom. Because it is fully stable, fully contact-clean, and substantially cleaner thanM3g,M3his retained as the next motion rung. - The retained motion ladder is now:
- retained
M2 model_9900.pt - retained
M3e model_9950.pt - retained
M3f model_10049.pt - retained
M3h model_10148.ptThe next useful lever is no longer another broad damping jump. It should be either a downstream test fromM3hinto a lighter rung or another narrowly isolated release of the planar damping, usingM3hrather thanM3fas the source.
- retained
- That next isolated release was:
Unitree-G1-29dof-Wheelchair-Scratch-M3i-FreeYawMediumDampedForward-Observed-BothHard-RelaxedHandle.M3istarts from retainedM3hand keeps the same reward scaffold, the same observations, and the same both-hard grip. The only change is another planar-damping release toy_velocity_scale = 0.25andyaw_velocity_scale = 0.25. Starting from retainedM3h model_10148.pt, the bounded continuation selectedmodel_10200.ptwith:forward_motion_score = 0.3813771064276807,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.012850762344896793,wheelchair_lateral_velocity_abs_mean = 0.007191838696599007, andwheelchair_yaw_velocity_abs_mean = 0.02351665124297142. This is a real improvement over retainedM3hon the primary motion score while preserving full stability and zero invalid contact. It is still somewhat less laterally clean than retainedM3f, but because the chair is now freer and the policy remained physically valid,M3iis retained as the next rung. - The retained motion ladder is now:
- retained
M2 model_9900.pt - retained
M3e model_9950.pt - retained
M3f model_10049.pt - retained
M3h model_10148.pt - retained
M3i model_10200.ptThe next useful experiment is to keep the same observation/reward scaffold again and either test a still lighter free-yaw damping rung fromM3i, or start introducing explicit backward/turn command structure on top of this cleaner forward-motion ladder.
- retained
- That first lighter free-yaw damping jump from
M3iwas:Unitree-G1-29dof-Wheelchair-Scratch-M3j-FreeYawLightBridgeDampedForward-Observed-BothHard-RelaxedHandle.M3jkept the retainedM3iscaffold unchanged and only relaxed the planar damping further toy_velocity_scale = 0.40andyaw_velocity_scale = 0.40. Starting from retainedM3i model_10200.pt, the bounded continuation selectedmodel_10200.ptwith:forward_motion_score = 0.37235095321666456,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.008854018524289131,wheelchair_lateral_velocity_abs_mean = 0.012296464294195175, andwheelchair_yaw_velocity_abs_mean = 0.039986491203308105. This stayed physically valid, but it regressed from retainedM3ion the primary motion score, forward velocity, and lateral/yaw cleanliness, soM3jwas discarded rather than promoted into the retained ladder. - The finer bridge between retained
M3iand discardedM3jwas:Unitree-G1-29dof-Wheelchair-Scratch-M3k-FreeYawTransitionDampedForward-Observed-BothHard-RelaxedHandle.M3kkept the retainedM3ireward and observation scaffold unchanged and only relaxed the planar damping partway to the failedM3jjump (y_velocity_scale = 0.325,yaw_velocity_scale = 0.325). Starting from retainedM3i model_10200.pt, the bounded continuation selectedmodel_10299.ptwith:forward_motion_score = 0.3819064769661054,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.00948390457779169,wheelchair_lateral_velocity_abs_mean = 0.009322939440608025, andwheelchair_yaw_velocity_abs_mean = 0.030022375285625458. Relative to retainedM3i, this is only a narrow primary-score improvement (0.3819064769661054vs0.3813771064276807) and it gives up some forward-velocity and yaw/lateral cleanliness. But it does so while keeping the chair freer thanM3i, and it remains fully stable and fully contact-clean, soM3kis retained as the next bridge rung rather than discarded. - The retained forward-motion ladder is now:
- retained
M2 model_9900.pt - retained
M3e model_9950.pt - retained
M3f model_10049.pt - retained
M3h model_10148.pt - retained
M3i model_10200.pt - retained
M3k model_10299.ptThe next useful lever is no longer another big damping jump. It should be either one more small free-yaw damping release fromM3k, or the first explicit backward/turn command branch on top of this now-cleaner forward ladder.
- retained
- That next smaller free-yaw damping release from retained
M3kwas:Unitree-G1-29dof-Wheelchair-Scratch-M3l-FreeYawLightTransitionDampedForward-Observed-BothHard-RelaxedHandle.M3lkept the retainedM3kreward and observation scaffold unchanged and only relaxed the planar damping again toy_velocity_scale = 0.35andyaw_velocity_scale = 0.35. Starting from retainedM3k model_10299.pt, the bounded continuation selectedmodel_10398.ptwith:forward_motion_score = 0.38274107103934507,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_forward_velocity_mean = 0.007933719083666801,wheelchair_lateral_velocity_abs_mean = 0.009734446182847023, andwheelchair_yaw_velocity_abs_mean = 0.03142866492271423. Relative to retainedM3k, this is again only a narrow primary-score improvement (0.38274107103934507vs0.3819064769661054). It gives up some raw forward velocity and a bit of yaw/lateral cleanliness, but it does so at still-freer chair dynamics while staying fully stable and fully contact-clean. SoM3lis retained as the next bridge rung, with the caveat that this branch is now clearly in diminishing-return territory. - The retained forward-motion ladder is now:
- retained
M2 model_9900.pt - retained
M3e model_9950.pt - retained
M3f model_10049.pt - retained
M3h model_10148.pt - retained
M3i model_10200.pt - retained
M3k model_10299.pt - retained
M3l model_10398.ptThe next useful branch should stop treating freer forward-only damping release as the only lever. The cleaner next move is either the first explicit backward/turn command-conditioned stage on top ofM3l, or a motion-stage redesign that rewards more actual chair speed instead of marginal score gains from trading forward speed against lateral/yaw behavior.
- retained
- The evaluator now exposes a command-aligned linear motion metric,
command_motion_score, for signed wheelchair command tasks. It keeps the same stability/contact penalties asforward_motion_score, but replaces the raw forward-velocity term with a command-aligned ratio so a physically good backward policy is not scored as failure just because its wheelchair velocity is negative in world X. - The first explicit backward branch from retained
M3lwas:Unitree-G1-29dof-Wheelchair-Scratch-M4a-FreeYawLightTransitionDampedBackward-Observed-BothHard-RelaxedHandle.M4akeeps the retainedM3lfree-yaw damping scaffold, flips the commanded X velocity to-1.0, zeroes the forward-progress reward, switcheswheelchair_backward_velocityto the linear form, and biases the robot lean target slightly backward. Zero-shot transfer from retainedM3l model_10398.ptalready produced real backward chair motion, but through obvious failure modes:command_motion_score = 0.25261443648487325,clean_hold_rate = 0.03125,time_out_rate = 0.0546875,bad_orientation_rate = 0.8984375,invalid_contact_rate = 0.890625. - A bounded continuation on
M4aselectedmodel_10497.ptwith:command_motion_score = 0.2613088373094797,clean_hold_rate = 0.1015625,time_out_rate = 0.125,bad_orientation_rate = 0.84375,invalid_contact_rate = 0.75,wheelchair_command_aligned_velocity_ratio = 0.4343257546424866,wheelchair_forward_velocity_mean = -0.4343257546424866,wheelchair_lateral_velocity_abs_mean = 0.0511283352971077, andwheelchair_yaw_velocity_abs_mean = 0.1161910742521286. This confirms that backward chair motion transfers directionally from the forward ladder, but the branch is still not physically usable. The continuation improves survival and reduces invalid contact compared with zero-shot transfer, yet it still relies heavily on bad orientation and chair-body contact. SoM4ais not retained as a milestone; it is the first diagnostic backward branch, and the next backward iteration needs stronger stability/contact shaping rather than more damping release. - A stricter signed-motion metric was then added:
physical_command_motion_score. Unlike the earliercommand_motion_score, it weightsclean_hold_rate,time_out_rate,bad_orientation_rate,base_height_rate, andinvalid_contact_ratemuch more heavily. The old scalar was too generous for backward pulling: it could score a fast but physically bad rollout as progress simply because the wheelchair moved backward at the commanded speed. - The next backward branch was:
Unitree-G1-29dof-Wheelchair-Scratch-M4b-FreeYawLightTransitionDampedBackward-Stabilized-Observed-BothHard-RelaxedHandle.M4bremoved the duplicated raw backward-speed reward, kept the signed command-tracking term as the only dense backward objective, and restored light posture/contact shaping (flat_orientation_l2,base_height,robot_xy_velocity,robot_yaw_velocity, strongerwheelchair_invalid_contact). On a fair64 env / 300 stepdeterministic eval, the best saved checkpoint wasmodel_10400.ptwith:physical_command_motion_score = -0.11300523318350317,clean_hold_rate = 0.0,time_out_rate = 0.0,bad_orientation_rate = 0.578125,base_height_rate = 0.03125,invalid_contact_rate = 0.484375,wheelchair_command_aligned_velocity_ratio = 0.5938286781311035. The useful diagnostic is where the contact lives:wheelchair_base_robot_contact = 658.8596,wheelchair_left_rear_wheel_robot_contact = 255.9626,wheelchair_right_rear_wheel_robot_contact = 21.5525, while both handle invalid-contact sensors stayed at0.0. - The fair
64 env / 300 stepcomparison against the originalM4a model_10497.ptshowed thatM4bdid not actually beat it as a branch.M4a model_10497.ptscoredphysical_command_motion_score = -0.10337315350770951with the sameclean_hold_rate = 0.0andtime_out_rate = 0.0.M4bdid reduce base-height failures and removed the small left-handle invalid-contact leak, but it did not solve the real blocker. The dominant failure mode is still torso/chair-base plus left-rear-wheel contact while the robot collapses backward into the chair. - A second probe moved the backward task earlier in the dynamics ladder:
Unitree-G1-29dof-Wheelchair-Scratch-M4c-FreeYawHeavyDampedBackward-Stabilized-Observed-BothHard-RelaxedHandle. Zero-shot transfer from retainedM3f model_10049.ptcame back worse:physical_command_motion_score = -0.16158021707087755,bad_orientation_rate = 0.796875,invalid_contact_rate = 0.625, with the same dominant invalid-contact pattern (wheelchair_base_robot_contactpluswheelchair_left_rear_wheel_robot_contact). So stepping earlier to heavier free-yaw damping did not fix backward pulling either. - The pre-
M4dbackward read afterM4a/M4b/M4cwas: backward chair motion transferred directionally from the retained forward ladder, but there was still no retained physically valid backward milestone. The meaningful lesson fromM4a/M4b/M4cwas that the blocker was not missing sign information in the reward, and it was not primarily handle contact. The blocker was geometric/postural collapse into the chair base and left rear wheel during pullback. So the next backward branch had to target that specific failure mode directly, likely with a changed backward manipulation scaffold or clearance/separation shaping rather than another raw speed reward change. - The first branch that directly targeted that failure mode was:
Unitree-G1-29dof-Wheelchair-Scratch-M4d-FreeYawHeavyDampedBackward-Creep-Observed-BothHard-RelaxedHandle.M4dkeeps the retainedM3fheavy-damped both-hard scaffold, but changes the backward curriculum in three specific ways:- it slows the commanded backward speed down to a fixed creep target of
-0.14 m/s - it removes the extra raw backward-speed shaping and keeps signed command tracking as the dense motion term
- it adds explicit robot-frame chair-separation shaping through
wheelchair_robot_standoff, penalizing drift of the wheelchair root away from its nominal XY offset in the robot root frame This is the first backward branch that became physically clean instead of collapsing into the chair. Zero-shot transfer from retainedM3f model_10049.pton a full64 env / 600 stepdeterministic eval produced:physical_command_motion_score = 1.2724524709396063,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.7269237637519836,wheelchair_forward_velocity_mean = -0.13679799437522888,wheelchair_lateral_velocity_abs_mean = 0.003717011772096157, andwheelchair_yaw_velocity_abs_mean = 0.009016682393848896. So the retained forwardM3fcheckpoint already transfers cleanly into slow backward creep when the task is eased enough and the chair-separation geometry is made explicit.
- it slows the commanded backward speed down to a fixed creep target of
- A bounded low-noise continuation on
M4dfrom retainedM3f model_10049.ptthen wrote three checkpoints:model_10050.pt,model_10100.pt, andmodel_10148.pt. All three stayed fully clean on the same64 env / 600 stepdeterministic eval, and each slightly improved on the zero-shot baseline. The best saved checkpoint wasmodel_10148.ptwith:physical_command_motion_score = 1.282831170875579,command_motion_score = 1.1443229076452552,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.7383108139038086,wheelchair_forward_velocity_mean = -0.1339568942785263,wheelchair_lateral_velocity_abs_mean = 0.003613825421780348, andwheelchair_yaw_velocity_abs_mean = 0.008656003512442112. The gain over zero-shot is small, but it is real and consistent, soM4d model_10148.ptis retained as the first physically valid backward curriculum rung. It is not yet the finalM5backward-control milestone, because the commanded speed is still only a slow creep and the dynamics are still the easier heavy-damped branch, but it is the first backward stage that is worth building on instead of discarding. - The next backward rung was a pure command-difficulty increase on top of retained
M4d:Unitree-G1-29dof-Wheelchair-Scratch-M4e-FreeYawHeavyDampedBackward-Moderate-Observed-BothHard-RelaxedHandle.M4ekeeps the same heavy-damped both-hard scaffold and the same explicitwheelchair_robot_standoffshaping, but increases the fixed commanded backward speed from-0.14 m/sto-0.25 m/sand slightly widens the command-tracking standard deviation to0.10. Zero-shot transfer from retainedM4d model_10148.ptstayed physically very strong:physical_command_motion_score = 1.106793893314898,clean_hold_rate = 0.984375,time_out_rate = 0.984375,bad_orientation_rate = 0.015625,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5650618672370911,wheelchair_forward_velocity_mean = -0.14822953939437866,wheelchair_lateral_velocity_abs_mean = 0.004035579971969128, andwheelchair_yaw_velocity_abs_mean = 0.010057952255010605. So the harder backward command did not reopen the old chair-contact failure mode; it only cost a small amount of stability and tracking. - A short low-noise
M4econtinuation from retainedM4d model_10148.ptthen wrote two checkpoints:model_10150.ptandmodel_10197.pt.model_10150.ptregressed slightly by introducing a small base-contact leak (invalid_contact_rate = 0.015625), so it was not retained. The later checkpoint,model_10197.pt, recovered zero invalid contact and slightly improved the harder backward task over the zero-shot baseline:physical_command_motion_score = 1.108896442782134,command_motion_score = 0.9297696615569295,clean_hold_rate = 0.984375,time_out_rate = 0.984375,bad_orientation_rate = 0.015625,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5706008672714233,wheelchair_forward_velocity_mean = -0.14852306246757507,wheelchair_lateral_velocity_abs_mean = 0.003994424361735582, andwheelchair_yaw_velocity_abs_mean = 0.010136200115084648. This is still not the final backward-control milestone, because the branch remains slightly less stable than retainedM4dand is still on the easier heavy-damped dynamics. But it is a valid retained curriculum rung: the policy stays contact-clean at the faster backward command and does not collapse back into the chair. - The retained backward ladder is now:
- retained
M4d model_10148.ptfor physically clean backward creep - retained
M4e model_10197.ptfor the first faster backward pull on the same clean scaffold The next useful branch is still not “final backward control.” The right next lever is another command/dynamics increase fromM4e, while preserving the chair-separation shaping that removed the old base and rear-wheel collapse.
- retained
- That next branch released dynamics instead of increasing the backward command again:
Unitree-G1-29dof-Wheelchair-Scratch-M4f-FreeYawIntermediateDampedBackward-Moderate-Observed-BothHard-RelaxedHandle.M4fkeeps the retainedM4ebackward command at-0.25 m/s, keeps the same both-hard grip and the same explicitwheelchair_robot_standoffshaping, and only relaxes the planar damping from the heavy branch to the retained forwardM3hlevel (y_velocity_scale = 0.15,yaw_velocity_scale = 0.15). Zero-shot transfer from retainedM4e model_10197.ptwas already strong and fully clean:physical_command_motion_score = 1.1096096923574805,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5538078546524048,wheelchair_forward_velocity_mean = -0.1447066366672516,wheelchair_lateral_velocity_abs_mean = 0.012959773652255535, andwheelchair_yaw_velocity_abs_mean = 0.03162518888711929. The important part is not the raw score bump. It is that the backward policy stayed fully stable and fully contact-clean after the first real damping release. - A short low-noise continuation on
M4ffrom retainedM4e model_10197.ptthen wrote two checkpoints:model_10200.ptandmodel_10226.pt. Both stayed fully clean, butmodel_10200.ptwas slightly better:physical_command_motion_score = 1.1096278823912142,command_motion_score = 0.9126209668815137,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5522634387016296,wheelchair_forward_velocity_mean = -0.14592473208904266,wheelchair_lateral_velocity_abs_mean = 0.012861572206020355, andwheelchair_yaw_velocity_abs_mean = 0.03229823708534241. The gain over zero-shot is extremely small, but it remains a valid retained rung because it preserves full physical cleanliness at the lighter damping level. - The retained backward ladder is now:
- retained
M4d model_10148.ptfor physically clean backward creep - retained
M4e model_10197.ptfor the first faster backward pull on heavy damping - retained
M4f model_10200.ptfor the same-0.25 m/sbackward task after the first damping release The next useful branch is to keep the command fixed and continue releasing dynamics one rung at a time before asking for a still faster backward pull.
- retained
- The next backward rung was that exact dynamics-only release:
Unitree-G1-29dof-Wheelchair-Scratch-M4g-FreeYawMediumDampedBackward-Moderate-Observed-BothHard-RelaxedHandle.M4gkeeps the retainedM4fcommand at-0.25 m/s, keeps the same both-hard grip and the samewheelchair_robot_standoffshaping, and only relaxes planar damping one more step to the retained forwardM3ilevel (y_velocity_scale = 0.25,yaw_velocity_scale = 0.25). Zero-shot transfer from retainedM4f model_10200.ptwas already physically valid:physical_command_motion_score = 1.0627775263041257,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5247573852539062,wheelchair_forward_velocity_mean = -0.1451890766620636,wheelchair_lateral_velocity_abs_mean = 0.023683326318860054, andwheelchair_yaw_velocity_abs_mean = 0.056736476719379425. SoM4gis a valid rung even before continuation, but it gives up some yaw/lateral cleanliness compared with retainedM4f. - A short low-noise continuation on
M4gfrom retainedM4f model_10200.ptthen wrote two checkpoints:model_10200.ptandmodel_10229.pt. Both stayed fully stable and fully contact-clean. The later checkpoint was slightly better and is retained:physical_command_motion_score = 1.0778401739895345,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.536535382270813,wheelchair_forward_velocity_mean = -0.14621871709823608,wheelchair_lateral_velocity_abs_mean = 0.022800607606768608, andwheelchair_yaw_velocity_abs_mean = 0.05446473881602287. The score gain over the zero-shotM4gtransfer is small, but it is real, and the branch stays fully physically valid at the freer damping level. The retained backward ladder is now:- retained
M4d model_10148.ptfor physically clean backward creep - retained
M4e model_10197.ptfor the faster heavy-damped backward pull - retained
M4f model_10200.ptfor the first damping release at-0.25 m/s - retained
M4g model_10229.ptfor the next medium-damped backward rung on the same clean scaffold The next useful move is the same pattern again: keep the backward command fixed and release dynamics one more rung before increasing backward speed.
- retained
- The next backward rung was that next dynamics-only release:
Unitree-G1-29dof-Wheelchair-Scratch-M4h-FreeYawTransitionDampedBackward-Moderate-Observed-BothHard-RelaxedHandle.M4hkeeps the retainedM4gcommand at-0.25 m/s, keeps the same both-hard grip and the samewheelchair_robot_standoffshaping, and only relaxes planar damping one more step to the retained forwardM3klevel (y_velocity_scale = 0.325,yaw_velocity_scale = 0.325). Zero-shot transfer from retainedM4g model_10229.ptstayed fully physically valid:physical_command_motion_score = 1.0293966956436635,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5074198842048645,wheelchair_forward_velocity_mean = -0.14734360575675964,wheelchair_lateral_velocity_abs_mean = 0.03193458169698715, andwheelchair_yaw_velocity_abs_mean = 0.07378971576690674. This is a small same-task regression versus retainedM4g, but it is still a real physically valid backward rung at freer chair dynamics. - A short low-noise continuation on
M4hfrom retainedM4g model_10229.ptthen wrote two checkpoints:model_10250.ptandmodel_10258.pt. Both stayed fully stable and fully contact-clean. The later checkpoint was slightly better and is retained:physical_command_motion_score = 1.0570705771446227,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5301052927970886,wheelchair_forward_velocity_mean = -0.14599697291851044,wheelchair_lateral_velocity_abs_mean = 0.03068336471915245, andwheelchair_yaw_velocity_abs_mean = 0.07100684940814972.M4hstill does not beat retainedM4gon the primary same-task score, but it recovers part of the zero-shot drop and preserves full physical cleanliness at the freer damping level. The retained backward ladder is now:- retained
M4d model_10148.ptfor physically clean backward creep - retained
M4e model_10197.ptfor the faster heavy-damped backward pull - retained
M4f model_10200.ptfor the first damping release at-0.25 m/s - retained
M4g model_10229.ptfor the medium-damped backward rung - retained
M4h model_10258.ptfor the next transition-damped backward rung at the same command The next useful move is the same one again: keep the backward command fixed and release dynamics one more rung before increasing backward speed or mixing in turning commands.
- retained
- The next backward rung was the light-transition release to the retained forward
M3ldamping level:Unitree-G1-29dof-Wheelchair-Scratch-M4i-FreeYawLightTransitionDampedBackward-Moderate-Observed-BothHard-RelaxedHandle.M4ikeeps the retainedM4hcommand at-0.25 m/s, keeps the same both-hard grip and the samewheelchair_robot_standoffshaping, and only relaxes planar damping one more step toy_velocity_scale = 0.35andyaw_velocity_scale = 0.35. Zero-shot transfer from retainedM4h model_10258.ptexposed the first backward contact leak on this ladder:physical_command_motion_score = 1.039323188364506,clean_hold_rate = 0.984375,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.015625, with the entire leak concentrated inwheelchair_base_robot_contact. SoM4iwas initially beyond the clean backward boundary, but only slightly. -
A short low-noise continuation on
M4ifrom retainedM4h model_10258.ptthen recovered that leak cleanly. The selected checkpointmodel_10287.ptcame back at:physical_command_motion_score = 1.0621339753270151,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_backward_velocity_ratio = 0.5296800136566162,wheelchair_forward_velocity_mean = -0.14203692972660065,wheelchair_lateral_velocity_abs_mean = 0.031298667192459106, andwheelchair_yaw_velocity_abs_mean = 0.07465916872024536. That makesM4ia valid retained rung after continuation. It is slightly freer than retainedM4h, fully stable, and fully contact-clean again. The retained backward ladder is now:- retained
M4d model_10148.ptfor physically clean backward creep - retained
M4e model_10197.ptfor the faster heavy-damped backward pull - retained
M4f model_10200.ptfor the first damping release at-0.25 m/s - retained
M4g model_10229.ptfor the medium-damped backward rung - retained
M4h model_10258.ptfor the transition-damped backward rung - retained
M4i model_10287.ptfor the light-transition backward rung after contact recovery The next useful move is to decide whether the backward ladder can absorb one final dynamics release cleanly, or whether this is the point to stop releasing damping and start mixing in turn command structure.
- retained
-
The first turn branch started from the retained free-yaw both-hard motion scaffold:
Unitree-G1-29dof-Wheelchair-Scratch-M6a-FreeYawHeavyDampedLeftTurn-Observed-BothHard-RelaxedHandle.M6akeeps the retainedM3fchair dynamics, setslin_vel_x = 0,lin_vel_y = 0,ang_vel_z = +0.35, enables the new wheelchair yaw-tracking reward, and suppresses forward-motion-specific shaping. Zero-shot transfer from retainedM3f model_10049.ptwas already a valid left-turn milestone with the corrected turn gate:physical_turn_motion_score = 0.6742888854518624,turn_motion_score = 1.3318987463135272,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = 0.887657642364502,wheelchair_command_aligned_yaw_ratio_clipped = 0.8880758881568909,wheelchair_yaw_tracking_mean = 0.3497166037559509, andwheelchair_yaw_velocity_mean = 0.5226268768310547. SoM6ais retained immediately. Left turning exists on the current ladder. -
The mirrored right-turn task was then added as
Unitree-G1-29dof-Wheelchair-Scratch-M6b-FreeYawHeavyDampedRightTurn-Observed-BothHard-RelaxedHandle.M6bis intentionally the exact mirror ofM6aexcept for the command sign:ang_vel_z = -0.35. This branch exposed a real asymmetry. Zero-shot transfer from retainedM3f model_10049.ptstayed physically clean, but the chair still turned left:wheelchair_yaw_velocity_mean = 0.35613349080085754,wheelchair_command_aligned_yaw_ratio_symmetric = -0.6341801881790161, andturn_motion_score = -0.29364724527113134. A short low-noise continuation from the same source improved rollout cleanliness but not signed turning. The best checkpoint from that run,model_10050.pt, still failed the actual task:physical_turn_motion_score = 0.0456698986813967,turn_motion_score = -0.22634734716266397,clean_hold_rate = 0.96875,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = -0.5703282356262207, andwheelchair_yaw_velocity_mean = 0.32910603284835815. SoM6bis not retained as a right-turn milestone. It is a stable wrong-sign turn. -
That wrong-sign behavior is not just a bad warm start from one checkpoint. The same right-turn task was probed from the cleaner motion and hold sources as well:
- retained
M3e model_9950.pt:wheelchair_command_aligned_yaw_ratio_symmetric = -0.6749926209449768 - retained
M4i model_10287.pt:wheelchair_command_aligned_yaw_ratio_symmetric = -0.571495771408081 - retained
M2 model_9900.pt:wheelchair_command_aligned_yaw_ratio_symmetric = -0.6318390965461731All of them stayed mostly or fully stable and contact-clean, and all of them still turned left under the negative yaw command. That means the immediate blocker for right turning is not simply a bad source checkpoint. The current both-hard heavy-damped turn scaffold is itself left-biased under the mirrored command.
- retained
-
The turn evaluator had to be corrected before retaining any right-turn result. The original
physical_turn_motion_scorecould still stay moderately positive when the rollout was stable and contact-clean even if the chair turned the wrong way. The corrected gate now multiplies the physical turn base score bywheelchair_command_aligned_yaw_ratio_clipped, so wrong-sign yaw cannot win selection just by surviving cleanly. This matters for future automation too. The wheelchair observed policy already includes signedvelocity_commands, so the right-turn failure is not explained by a missing yaw command observation. The next right-turn branch should change the turn scaffold itself, most likely by changing grip asymmetry or handle-assist structure, not by blindly running more of the current mirrored both-hard task. -
That next scaffold change was exactly the missing mirrored asymmetry:
Unitree-G1-29dof-Wheelchair-Scratch-M6c-FreeYawHeavyDampedRightTurn-Observed-RightHardLeftSoft-RelaxedHandle.M6ckeeps the same heavy-damped right-turn command asM6b, but replaces the failing both-hard grip with the mirror of the successful left-turn asymmetry: the right hand stays as the hard attachment and the left hand becomes a bounded soft assist with mirrored handle-position and axis-alignment shaping. This immediately fixed the sign problem. Zero-shot transfer from retainedM3f model_10049.ptcame back fully stable, fully contact-clean, and directionally correct:physical_turn_motion_score = 0.6617681152270161,turn_motion_score = 1.322267762525007,clean_hold_rate = 1.0,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = 0.712121307849884,wheelchair_command_aligned_yaw_ratio_clipped = 0.7412710785865784,wheelchair_yaw_tracking_mean = 0.6739866733551025, andwheelchair_yaw_velocity_mean = -0.27772974967956543. SoM6cis retained immediately as the first valid right-turn milestone. -
The current turn read is now much cleaner:
- retained
M6afor left turn on the both-hard heavy-damped scaffold - discarded
M6bbecause the mirrored both-hard right-turn scaffold stayed wrong-sign - retained
M6cbecause the mirrored right-hard/left-soft scaffold fixes the sign without losing stability The important lesson is that turning on this wheelchair is not symmetric under the both-hard grip. Left turn works on the symmetric scaffold, but right turn needs the mirrored asymmetric grip structure. That is now concrete evidence, not speculation.
- retained
-
The next question was whether that retained
M6cright-hard/left-soft scaffold could serve as a single fixed grip structure for a future unified controller. Three direct probes answered that. The first was a forward probe on the same heavy-damped motion stage:Unitree-G1-29dof-Wheelchair-Scratch-M3m-FreeYawHeavyDampedForward-Observed-RightHardLeftSoft-RelaxedHandle. Zero-shot transfer from retainedM3f model_10049.ptstayed upright, but it was not a valid forward milestone:physical_command_motion_score = 0.6925527523155324,clean_hold_rate = 0.921875,time_out_rate = 1.0,invalid_contact_rate = 0.078125, with the entire leak concentrated inwheelchair_left_handle_invalid_contact, andwheelchair_forward_velocity_ratio = 0.05961471050977707. So the fixed right-hard/left-soft scaffold is poor for forward propulsion. -
The second probe was a left-turn task on the same fixed right-hard/left-soft scaffold:
Unitree-G1-29dof-Wheelchair-Scratch-M6d-FreeYawHeavyDampedLeftTurn-Observed-RightHardLeftSoft-RelaxedHandle. Zero-shot transfer from retainedM3f model_10049.ptstayed physically clean, but it flipped the sign just like the earlier failed right-turn both-hard branch:physical_turn_motion_score = 0.018247194337265026,turn_motion_score = -0.32261592620052393,clean_hold_rate = 1.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = -0.7156335711479187, andwheelchair_yaw_velocity_mean = -0.28035983443260193. So the fixed right-hard/left-soft scaffold is not a valid left-turn scaffold either. -
The third probe was a backward task on that same fixed right-hard/left-soft scaffold:
Unitree-G1-29dof-Wheelchair-Scratch-M4j-FreeYawHeavyDampedBackward-Moderate-Observed-RightHardLeftSoft-RelaxedHandle. This one did work. Zero-shot transfer from retainedM4i model_10287.ptcame back fully stable and contact-clean:physical_command_motion_score = 1.315110251214355,command_motion_score = 1.1925020365975796,clean_hold_rate = 1.0,time_out_rate = 1.0,invalid_contact_rate = 0.0, andwheelchair_command_aligned_velocity_ratio_symmetric = 0.8068245649337769. So the fixed right-hard/left-soft scaffold is strong for backward pulling, just like it is for right turning. -
That gives a clean scaffold split:
- both-hard heavy-damped is a strong retained scaffold for forward and left turn
- right-hard/left-soft heavy-damped is a strong retained scaffold for backward and right turn
- no single fixed grip structure tested so far supports all four commands cleanly This means the next unified-controller branch should not assume one fixed hard/soft asymmetry. The most defensible next scaffold is either:
- a command-conditioned attachment/stiffness scaffold that changes dominance with the command sign, or
- a new neutral two-soft-hands scaffold that is explicitly tested across all four commands before mixed-command training starts.
-
That neutral two-soft-hands branch was then tested directly on the same heavy-damped motion stage. Four zero-shot probes answered it cleanly. The forward probe was
Unitree-G1-29dof-Wheelchair-Scratch-M3n-FreeYawHeavyDampedForward-Observed-BothSoft-RelaxedHandle. Zero-shot transfer from retainedM3f model_10049.ptstayed fully stable and fully contact-clean:physical_command_motion_score = 0.9078378646518104,clean_hold_rate = 1.0,time_out_rate = 1.0,invalid_contact_rate = 0.0, andwheelchair_forward_velocity_ratio = 0.3246929943561554. But it also carried a very large lateral-offset bias:wheelchair_lateral_offset_norm = 0.9989199042320251. So the neutral scaffold can translate forward without falling apart, but it is not especially clean. -
The backward probe was
Unitree-G1-29dof-Wheelchair-Scratch-M4k-FreeYawHeavyDampedBackward-Moderate-Observed-BothSoft-RelaxedHandle. Zero-shot transfer from retainedM4i model_10287.ptwas the strongest result on this neutral scaffold:physical_command_motion_score = 1.3269853260484525,clean_hold_rate = 1.0,time_out_rate = 1.0,invalid_contact_rate = 0.0, andwheelchair_command_aligned_velocity_ratio_symmetric = 0.6706080436706543. So the neutral scaffold is fully viable for backward pulling. -
The turning probes were the real discriminator. Left turn on
Unitree-G1-29dof-Wheelchair-Scratch-M6e-FreeYawHeavyDampedLeftTurn-Observed-BothSoft-RelaxedHandlestayed fully stable and contact-clean, but it produced almost no yaw authority:physical_turn_motion_score = 0.005771564438546955,clean_hold_rate = 1.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = 0.007619310170412064, andwheelchair_yaw_velocity_mean = 0.0026667648926377296. Right turn onUnitree-G1-29dof-Wheelchair-Scratch-M6f-FreeYawHeavyDampedRightTurn-Observed-BothSoft-RelaxedHandlefailed the same way:physical_turn_motion_score = 0.003245790785507999,clean_hold_rate = 1.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = 0.003883844008669257, andwheelchair_yaw_velocity_mean = -0.001359348651021719. So the neutral two-soft scaffold is not a viable fixed turning scaffold in either direction. It preserves stability by effectively refusing to generate yaw. -
That resolves the neutral-scaffold question.
- both-soft heavy-damped can support translation, especially backward
- both-soft heavy-damped does not provide usable left-turn or right-turn control
- the fixed neutral scaffold therefore does not solve the unified-controller problem The next unified branch should not keep adding fixed grip variants. It should move to a command-conditioned attachment or stiffness scaffold that changes left/right dominance with the requested motion.
-
A narrower follow-up tested exactly that next idea, but only on turning before widening it into a full mixed-command scaffold. Two heavy-damped zero-shot probes used bounded soft attachments on both hands, with the left/right spring gains selected from the commanded yaw sign at runtime:
Unitree-G1-29dof-Wheelchair-Scratch-M6g-FreeYawHeavyDampedLeftTurn-Observed-CommandConditionedSoft-RelaxedHandleUnitree-G1-29dof-Wheelchair-Scratch-M6h-FreeYawHeavyDampedRightTurn-Observed-CommandConditionedSoft-RelaxedHandleThe left-turn probe stayed almost fully stable, but it still produced essentially no yaw:physical_turn_motion_score = 0.006829837649564176,clean_hold_rate = 0.984375,invalid_contact_rate = 0.015625,wheelchair_command_aligned_yaw_ratio_symmetric = 0.005293993279337883, andwheelchair_yaw_velocity_mean = 0.0018529020017012954. The tiny contact leak was concentrated entirely inwheelchair_base_robot_contact. The right-turn probe stayed fully stable and fully contact-clean, but it failed the same way:physical_turn_motion_score = 0.00034842319505137,clean_hold_rate = 1.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = -0.00770591339096427, andwheelchair_yaw_velocity_mean = 0.002697076415643096. So command-conditioned stiffness on an all-soft grip is still too weak. It preserves stability, but it does not create usable turning authority.
-
That means the next unified-controller branch should skip further soft-only variants. The likely next mechanism is a command-conditioned hybrid scaffold that changes the attachment class itself, not just the spring gain: for example a dominant hard or spherical attachment on the commanded turn side plus a bounded assist on the opposite hand. The turning blocker is no longer “which side should dominate”; it is that soft-only dominance does not transmit enough yaw authority into the chair.
-
The remaining open question on turning was whether the successful right-turn asymmetry was just incidental to that one task family, or whether the hard/soft attachment class itself really determines the turn sign. Two direct probes on the existing left-hard/right-soft scaffold answered that cleanly:
Unitree-G1-29dof-Wheelchair-Scratch-M6i-FreeYawHeavyDampedLeftTurn-Observed-LeftHardRightSoft-RelaxedHandleUnitree-G1-29dof-Wheelchair-Scratch-M6j-FreeYawHeavyDampedRightTurn-Observed-LeftHardRightSoft-RelaxedHandleOnM6i, zero-shot transfer from retainedM3f model_10049.ptwas fully stable and fully contact-clean, and it produced a very strong correct-sign left turn:physical_turn_motion_score = 0.7484071274287998,turn_motion_score = 1.4161572684533892,clean_hold_rate = 1.0,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = 1.0, andwheelchair_yaw_velocity_mean = 0.8289926052093506. OnM6j, the exact same scaffold flipped back to the wrong sign under the mirrored right-turn command:physical_turn_motion_score = 0.0,turn_motion_score = -0.6045880594581831,clean_hold_rate = 0.984375,invalid_contact_rate = 0.0,wheelchair_command_aligned_yaw_ratio_symmetric = -0.9685173630714417, andwheelchair_yaw_velocity_mean = 0.4204327166080475.
-
That completes the asymmetric turn table:
- both-hard heavy-damped: strong left turn, wrong-sign right turn
- right-hard/left-soft heavy-damped: strong right turn, wrong-sign left turn
- left-hard/right-soft heavy-damped: strong left turn, wrong-sign right turn
- soft-only heavy-damped: stable but essentially no turn in either direction So the next unified-controller branch is no longer ambiguous. It should be a command-conditioned hybrid attachment policy that switches the dominant hard/soft side with the turn command sign. The turn problem is not just reward shaping; it is attachment topology.
-
A first direct attempt to encode that mixed topology inside one task did not fail as a policy result; it failed mechanically during environment bring-up and was removed. The discarded task was:
Unitree-G1-29dof-Wheelchair-Scratch-M6k-FreeYawHeavyDampedMixedTurn-Observed-SplitHybrid-RelaxedHandle. The design was:- even environment ids:
left-hard/right-soft - odd environment ids:
right-hard/left-soft - command sign split by environment parity so both left and right turn commands existed in one batch
The reason for the env-parity split was an Isaac Lab reset-order constraint: reset-mode events run before the command manager resamples commands, so the hard attachment side cannot be switched from the freshly sampled yaw sign in the ordinary reset event path.
In practice,
M6knever reached deterministic evaluation. On repeated runs at1,2,16, and64environments, the process exited during environment setup before the manager/observation summary and before anyAUTORESEARCH_METRICor output file was written. The same evaluator and source checkpoint completed normally on the neighboring fixed taskM6a, so this was not a shared evaluator failure. The last emittedM6klog lines were the base-environment setup banner plus the standard/World/envs/.../Robotrigid-body-property warnings, then silent exit. SoM6kis discarded as a task-construction/runtime failure, not as evidence about mixed-command policy quality. The lesson is narrower: the current env-parity mixed-turn scaffold is not robust inside the present Isaac Lab manager/event path. The next unified-controller branch should stay inside the mechanically proven task family and use a different command-conditioned hybrid mechanism instead of parity-split hard/soft reset events.
- even environment ids:
-
The next unified-turn probe stayed entirely inside the mechanically proven both-hard heavy-damped scaffold and only mixed the command sign:
Unitree-G1-29dof-Wheelchair-Scratch-M6l-FreeYawHeavyDampedMixedTurn-Observed-BothHard-RelaxedHandle. This is the first clean fixed-scaffold mixed left/right turn task. Zero-shot transfer from retainedM3f model_10049.ptwas fully stable and fully contact-clean:physical_turn_motion_score = 0.3606618200187657,turn_motion_score = 0.5175286232959478,clean_hold_rate = 1.0,time_out_rate = 1.0,invalid_contact_rate = 0.0, andwheelchair_command_aligned_yaw_ratio_symmetric = 0.10960268974304199. So the both-hard scaffold does not collapse under mixed-sign yaw commands. The failure mode is narrower: it still carries a strong left-turn bias, but it is no longer a pure wrong-sign branch. -
A matched
50-iteration model-only continuation onM6lfrom the same retainedM3f model_10049.ptproduced a small real improvement and is worth retaining as the current mixed-sign baseline: run:unitree_g1_29dof_wheelchair_scratch_m6l_freeyaw_heavydamped_mixedturn_observed_both_hard_relaxedhandle/2026-05-26_13-05-14_mixedturn_bothhard_from_m3f10049_modelonly_50itretained checkpoint:model_10098.ptdeterministic eval:physical_turn_motion_score = 0.3666192910436557,turn_motion_score = 0.5391623704694211,clean_hold_rate = 1.0,time_out_rate = 1.0,invalid_contact_rate = 0.0, andwheelchair_command_aligned_yaw_ratio_symmetric = 0.12723684310913086. This is still far from a good unified turn controller, but it is the first retained mixed-sign turn rung that stays fully stable and fully contact-clean while showing nonzero bidirectional command alignment on one fixed attachment topology. -
A follow-up reward-shaped branch tried to improve that same mixed-sign task without changing the scaffold itself. The discarded task was:
Unitree-G1-29dof-Wheelchair-Scratch-M6m-FreeYawHeavyDampedMixedTurn-Observed-BothHard-CommandShape-RelaxedHandle. It kept the same both-hard heavy-damped runtime scaffold asM6l, and only changed training rewards so that the hand-handle pose and axis-alignment shaping followed the commanded yaw sign. As expected, zero-shot evaluation was identical toM6l, because the runtime policy weights were unchanged. The meaningful check was the same matched50-iteration model-only continuation fromM3f model_10049.pt: run:unitree_g1_29dof_wheelchair_scratch_m6m_freeyaw_heavydamped_mixedturn_observed_both_hard_command_shape_relaxedhandle/2026-05-26_13-11-32_mixedturn_bothhard_commandshape_from_m3f10049_modelonly_50itevaluated checkpoint:model_10098.ptdeterministic eval:physical_turn_motion_score = 0.3541501714724594,turn_motion_score = 0.519261335162446,clean_hold_rate = 0.984375,time_out_rate = 0.984375,bad_orientation_rate = 0.015625,invalid_contact_rate = 0.0, andwheelchair_command_aligned_yaw_ratio_symmetric = 0.11556387692689896. So this first reward-only command-conditioned branch did not beat retainedM6l. It slightly regressed the mixed-sign turn score, slightly reduced symmetric yaw alignment, and reintroduced a small orientation failure rate. The conclusion is specific: a fixed both-hard scaffold can support a clean mixed-sign turn baseline, but the first command-conditioned single-side reward shaping pass was not a useful lever. The retained mixed-turn rung staysM6l model_10098.pt. -
I then tested whether the mixed-sign both-hard baseline simply needed more continuation rather than a new mechanism. A longer bounded model-only continuation from retained
M3f model_10049.ptwas launched onM6l:unitree_g1_29dof_wheelchair_scratch_m6l_freeyaw_heavydamped_mixedturn_observed_both_hard_relaxedhandle/2026-05-26_13-21-47_mixedturn_bothhard_from_m3f10049_modelonly_200it. I stopped it after the50/100/150checkpoint ladder was already written, because that was enough to answer the selection question. The deterministic eval results were:model_10050.pt:physical_turn_motion_score = 0.3518808914450896,wheelchair_command_aligned_yaw_ratio_symmetric = 0.1114727109670639,clean_hold_rate = 1.0,invalid_contact_rate = 0.0model_10100.pt:physical_turn_motion_score = 0.35305543622110797,wheelchair_command_aligned_yaw_ratio_symmetric = 0.09481938183307648,clean_hold_rate = 0.984375,invalid_contact_rate = 0.0model_10150.pt:physical_turn_motion_score = 0.3413121377635986,wheelchair_command_aligned_yaw_ratio_symmetric = 0.08664043247699738,clean_hold_rate = 0.984375,invalid_contact_rate = 0.015625None of those beat the earlier retained short-run checkpointM6l model_10098.pt, which stayed atphysical_turn_motion_score = 0.3666192910436557andwheelchair_command_aligned_yaw_ratio_symmetric = 0.12723684310913086withclean_hold_rate = 1.0andinvalid_contact_rate = 0.0. So the longer same-task continuation is not the right lever.M6ldoes improve slightly from zero-shot, but then drifts backward with more training. The retained mixed-sign turn rung remains the short50-iterationM6l model_10098.ptresult, and the next useful branch should change task structure again rather than just trainM6llonger.
-
The next task-structure branch was:
Unitree-G1-29dof-Wheelchair-Scratch-M6n-FreeYawHeavyDampedMixedTurn-RapidResample-Observed-BothHard-RelaxedHandle.M6nkept the exact same both-hard heavy-damped scaffold as retainedM6l, but changed the command schedule so the yaw command resampled every2.0 sinside the episode instead of staying fixed for the whole rollout. The point was to force sign-switch experience instead of letting the policy treat left and right turns as separate static episodes. I launched a matched50-iteration model-only continuation from retainedM6l model_10098.pt:unitree_g1_29dof_wheelchair_scratch_m6n_freeyaw_heavydamped_mixedturn_rapidresample_observed_both_hard_relaxedhandle/2026-05-26_13-38-42_rapidresample_from_m6l10098_modelonly_50itand evaluated the resultingmodel_10100.ptback on the standard downstreamM6lmixed-turn task. The downstream deterministic eval came back at:physical_turn_motion_score = 0.3448920971138997,turn_motion_score = 0.5191973022185267,clean_hold_rate = 0.984375,time_out_rate = 0.984375,base_height_rate = 0.015625,invalid_contact_rate = 0.0, andwheelchair_command_aligned_yaw_ratio_symmetric = 0.11264941841363907. So the rapid intra-episode sign-switch curriculum also failed to beat retainedM6l model_10098.pt. It stayed physically clean, but it regressed both the mixed-turn physical score and the symmetric yaw-alignment score on the standard task. The conclusion is again narrow: the unified-turn blocker is not just lack of command switching within the episode. The retained mixed-turn rung still staysM6l model_10098.pt, and the next branch needs a different control or topology mechanism rather than more schedule tweaks on the same both-hard scaffold. -
I then tightened the evaluator instead of guessing from the aggregate mixed-turn score.
eval_wheelchair_m0.pynow emitsturn_metrics_by_command_sign, splitting the turn metrics intoleft_turn_commandandright_turn_commandsubsets. I reran the retained mixed-turn baselineM6l model_10098.pton the standardM6ltask with the full episode horizon (64 envs,500 steps). The split result is decisive:left_turn_command:physical_turn_motion_score = 0.6922063695049961,turn_motion_score = 1.3198288734070955,wheelchair_command_aligned_yaw_ratio_symmetric = 0.8864503502845764,wheelchair_yaw_velocity_mean = 0.459000825881958,clean_hold_rate = 1.0,invalid_contact_rate = 0.0right_turn_command:physical_turn_motion_score = 0.07233190653455447,turn_motion_score = -0.19419106543064119,wheelchair_command_aligned_yaw_ratio_symmetric = -0.585963785648346,wheelchair_yaw_velocity_mean = 0.3227216303348541,clean_hold_rate = 1.0,invalid_contact_rate = 0.0SoM6lis not a uniformly weak mixed-sign controller. It is effectively the retained left-turn specialist plus a still-broken right-turn branch on the same both-hard scaffold. The aggregate retained scorephysical_turn_motion_score = 0.3666192910436557is mostly just the average of one good sign and one wrong sign. That narrows the next experiment substantially: the next unified-turn branch should target right-turn sign correction directly, not perturb the whole mixed-turn task globally.
-
I tested that exact mirrored branch next instead of guessing: a mixed-sign turn task on the retained right-turn scaffold,
Unitree-G1-29dof-Wheelchair-Scratch-M6o-FreeYawHeavyDampedMixedTurn-Observed-RightHardLeftSoft-RelaxedHandle.M6okept the same heavy-damped turn dynamics and symmetric yaw-command range asM6l, but replaced the both-hard grip with the retained right-hard/left-soft turn topology fromM6c. I ran a matched50-iteration model-only continuation from retainedM3f model_10049.pt:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6o_freeyaw_heavydamped_mixedturn_observed_right_hard_left_soft_relaxedhandle/2026-05-26_13-56-11_mixedturn_righthardleftsoft_from_m3f10049_modelonly_50itand evaluated the saved checkpoints with the split-turn metrics. The later checkpointmodel_10098.ptwas slightly better thanmodel_10050.pt, but the result simply mirroredM6linstead of solving it:- aggregate:
physical_turn_motion_score = 0.3394886074186614,turn_motion_score = 0.4556515691103414,wheelchair_command_aligned_yaw_ratio_symmetric = -0.02971457690000534,clean_hold_rate = 1.0,invalid_contact_rate = 0.0 left_turn_command:physical_turn_motion_score = 0.03331444319650689,turn_motion_score = -0.3082868215395137,wheelchair_command_aligned_yaw_ratio_symmetric = -0.7320753931999207,wheelchair_yaw_velocity_mean = -0.2430439293384552right_turn_command:physical_turn_motion_score = 0.7535063807269622,turn_motion_score = 1.3776461312081665,wheelchair_command_aligned_yaw_ratio_symmetric = 0.8179622292518616,wheelchair_yaw_velocity_mean = -0.3072325587272644SoM6ois the exact mirror image of retainedM6l: strong right turn, wrong-sign left turn, fully stable, and fully contact-clean. It does not beat retainedM6las a unified mixed-turn controller, soM6ois discarded as a fixed-scaffold mixed branch and removed from code. The lesson is now concrete: neither fixed both-hard nor fixed right-hard/left-soft can support symmetric mixed-sign turning. The next unified-turn branch needs command-conditioned topology or a similar sign-aware mechanism, not another fixed attachment scaffold.
- aggregate:
-
I then moved to the first command-conditioned mixed-turn scaffold that stays inside the mechanically proven task family instead of switching attachment topology at reset time:
Unitree-G1-29dof-Wheelchair-Scratch-M6p-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedSoft-RelaxedHandle.M6preplaces the fixed both-hard grip with two bounded soft hand-handle attachments whose dominant side follows the commanded yaw sign at runtime:- left-turn command: left side dominant, right side assist
- right-turn command: right side dominant, left side assist
- near-zero yaw: both sides neutral
The zero-shot transfer from retained
M6l model_10098.ptwas immediately useful because it changed the sign behavior without reopening posture failure: - aggregate:
physical_turn_motion_score = 0.09044851200650537,turn_motion_score = 0.6346616450930015,wheelchair_command_aligned_yaw_ratio_symmetric = 0.11344851553440094,clean_hold_rate = 0.921875,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.078125 left_turn_command:physical_turn_motion_score = 0.04624660266167488,turn_motion_score = 0.5735906860558317,wheelchair_command_aligned_yaw_ratio_symmetric = 0.0562153197824955,wheelchair_yaw_velocity_mean = 0.010243130847811699right_turn_command:physical_turn_motion_score = 0.14185192968349883,turn_motion_score = 0.708367930725217,wheelchair_command_aligned_yaw_ratio_symmetric = 0.18252304196357727,wheelchair_yaw_velocity_mean = -0.02506144717335701That is the first mixed-sign branch here where both yaw-command signs stayed physically valid and produced the correct yaw sign on one shared runtime mechanism. But authority was still extremely weak, and invalid contact was still real.
-
I then ran a bounded
50-iteration model-only continuation onM6pfrom retainedM6l model_10098.pt:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6p_freeyaw_heavydamped_mixedturn_observed_command_conditioned_soft_relaxedhandle/2026-05-26_14-13-42_mixedturn_commandsoft_from_m6l10098_modelonly_50itand evaluated both saved checkpoints. The later checkpointmodel_10147.ptwas the better one:- aggregate:
physical_turn_motion_score = 0.11287512941327117,turn_motion_score = 0.680165586457588,wheelchair_command_aligned_yaw_ratio_symmetric = 0.1318761706352234,clean_hold_rate = 0.984375,time_out_rate = 1.0,bad_orientation_rate = 0.0,invalid_contact_rate = 0.015625 left_turn_command:physical_turn_motion_score = 0.056229882400059016,turn_motion_score = 0.6117976468289271,wheelchair_command_aligned_yaw_ratio_symmetric = 0.06599722802639008,wheelchair_yaw_velocity_mean = 0.01132544968277216right_turn_command:physical_turn_motion_score = 0.17932415988238903,turn_motion_score = 0.7626786550041288,wheelchair_command_aligned_yaw_ratio_symmetric = 0.21138525009155273,wheelchair_yaw_velocity_mean = -0.02904720976948738Training therefore improvedM6pin the right direction: higher aggregate physical turn score, better symmetric yaw alignment, and much lower invalid contact than zero-shot. But it still does not beat retainedM6l model_10098.ptas a mixed-turn milestone, and its left-turn authority is still far below the retained left-turn branch. SoM6pis worth keeping as the first command-conditioned sign-correct unified-turn foothold, but not as the retained mixed-turn controller. The next useful branch should strengthen command-conditioned authority rather than go back to another fixed scaffold.
- aggregate:
-
I then tested exactly that stronger command-conditioned branch:
Unitree-G1-29dof-Wheelchair-Scratch-M6q-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RelaxedHandle.M6qkeeps the sameM6pruntime mechanism, but makes the commanded dominant side much stiffer and the assist side much weaker. Zero-shot from retainedM6p model_10147.ptimmediately showed the right tradeoff: yaw authority increased substantially, but the right-turn side reopened contact and posture failures:- aggregate zero-shot:
physical_turn_motion_score = 0.180478867037474,turn_motion_score = 0.7685924386605621,wheelchair_command_aligned_yaw_ratio_symmetric = 0.4062581956386566,clean_hold_rate = 0.765625,bad_orientation_rate = 0.140625,invalid_contact_rate = 0.234375A short bounded continuation fromM6p model_10147.ptimproved the branch materially. The better checkpoint wasmodel_10150.pt: - aggregate:
physical_turn_motion_score = 0.21757397106793339,turn_motion_score = 0.7585547987371685,wheelchair_command_aligned_yaw_ratio_symmetric = 0.3074813783168793,clean_hold_rate = 0.828125,time_out_rate = 0.953125,bad_orientation_rate = 0.015625,invalid_contact_rate = 0.171875 left_turn_command:physical_turn_motion_score = 0.18983607529271296,wheelchair_command_aligned_yaw_ratio_symmetric = 0.24410340189933777,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.21480752041858034,wheelchair_command_aligned_yaw_ratio_symmetric = 0.38397204875946045,clean_hold_rate = 0.6896551847457886,invalid_contact_rate = 0.3103448152542114This is the first branch here where both left and right turn commands have clearly positive physical-turn scores on one shared mechanism. But it is not retainable as a milestone because the right-turn half is still too dirty.
- aggregate zero-shot:
-
I then checked whether that remaining
M6qfailure could be cleaned up with reward-side pressure instead of another mechanism change:Unitree-G1-29dof-Wheelchair-Scratch-M6r-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-ContactClean-RelaxedHandle.M6rkeeps theM6qmechanism intact and only increaseswheelchair_invalid_contactandwheelchair_robot_standoffpenalties. That did not solve the real problem. The better checkpoint wasmodel_10150.pt:- aggregate:
physical_turn_motion_score = 0.2020381042806937,turn_motion_score = 0.7551943441852926,wheelchair_command_aligned_yaw_ratio_symmetric = 0.3301670253276825,clean_hold_rate = 0.8125,time_out_rate = 0.875,bad_orientation_rate = 0.078125,invalid_contact_rate = 0.171875 left_turn_command:physical_turn_motion_score = 0.19608352782752075,wheelchair_command_aligned_yaw_ratio_symmetric = 0.2514706254005432,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.15149955009310193,wheelchair_command_aligned_yaw_ratio_symmetric = 0.4251454472541809,clean_hold_rate = 0.6551724076271057,bad_orientation_rate = 0.17241379618644714,invalid_contact_rate = 0.3103448152542114SoM6rconfirmed the failure surface rather than fixing it: stronger command-conditioned soft dominance can create usable bidirectional yaw authority, but simply turning up invalid-contact and standoff penalties does not buy back the right-turn cleanliness. The blocker is now narrower than before: the next mixed-turn branch should keep the sign-conditioned authority idea fromM6q, but solve the right-turn contact geometry with a different physical scaffold rather than more penalty weight on the same setup.
- aggregate:
-
I also tested a narrower reward-only follow-up on top of retained
M6p model_10147.pt:Unitree-G1-29dof-Wheelchair-Scratch-M6s-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedSoft-DominantGeometry-RelaxedHandle.M6skept the originalM6psoft command-conditioned turn scaffold, but replaced the shared hand-handle geometry shaping with commanded dominant-side-only geometry rewards. The idea was to preserve the cleanM6pruntime mechanism while giving the commanded turn side a clearer spatial target without jumping all the way to the strongerM6qauthority settings. The bounded50-iteration model-only continuation run was:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6s_freeyaw_heavydamped_mixedturn_observed_command_conditioned_soft_dominant_geometry_relaxedhandle/2026-05-26_14-58-24_mixedturn_commandsoft_dominantgeom_from_m6p10147_modelonly_50itand produced one saved checkpoint,model_10150.pt. Compared against retainedM6p model_10147.pt,M6s model_10150.ptregressed slightly on the aggregate mixed-turn gate:- retained
M6p model_10147.pt:physical_turn_motion_score = 0.11287512941327117,turn_motion_score = 0.680165586457588,wheelchair_command_aligned_yaw_ratio_symmetric = 0.1318761706352234,clean_hold_rate = 0.984375,invalid_contact_rate = 0.015625 M6s model_10150.pt:physical_turn_motion_score = 0.11120020173864448,turn_motion_score = 0.6665303901769223,wheelchair_command_aligned_yaw_ratio_symmetric = 0.12793606519699097,clean_hold_rate = 0.96875,invalid_contact_rate = 0.03125The sign split makes the failure mode more specific:left_turn_commandimproved a little on authority (physical_turn_motion_score = 0.07252853426676785vs0.056229882400059016) but reopened contact (invalid_contact_rate = 0.02857142873108387vs0.0)right_turn_commandgot worse on authority (physical_turn_motion_score = 0.158437060450738vs0.17932415988238903) while staying just as dirty as retainedM6p(invalid_contact_rate = 0.03448275849223137) Aggregate invalid contact also shifted the failure surface the wrong way:- retained
M6p:wheelchair_base_robot_contact = 0.015625,wheelchair_right_handle_invalid_contact = 0.015625 M6s:wheelchair_base_robot_contact = 0.03125,wheelchair_left_handle_invalid_contact = 0.015625SoM6sdid not solve the mixed-turn bottleneck. Dominant-side-only geometry shaping is not enough on top of the original soft scaffold. I discardedM6sfrom code. The next mixed-turn branch should continue from theM6qlesson instead: preserve sign-conditioned authority, but change the right-turn physical scaffold or contact geometry rather than just reweighting geometry rewards inside the weakerM6pmechanism.
- retained
-
I then restored the stronger
M6qmixed-turn scaffold into the live codebase because it had only existed in saved run artifacts, not in the current source tree:Unitree-G1-29dof-Wheelchair-Scratch-M6q-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RelaxedHandle. The first reconstruction pass exposed a real runtime bug in the helper path:ScratchM6q...was calling_apply_command_conditioned_turn_soft_handle_scaffold(...)with explicit dominant/assist gain overrides, but the current helper implementation only accepted the original fixed signature. That meant the task could compile but not instantiate. After restoring the missing helper parameters, I reran the deterministic zero-shot eval from retainedM6p model_10147.ptand recovered the historicalM6qzero-shot branch exactly:- aggregate zero-shot:
physical_turn_motion_score = 0.180478867037474,turn_motion_score = 0.7685924386605621,wheelchair_command_aligned_yaw_ratio_symmetric = 0.4062581956386566,clean_hold_rate = 0.765625,bad_orientation_rate = 0.140625,invalid_contact_rate = 0.234375 left_turn_command:physical_turn_motion_score = 0.1849697791430749,invalid_contact_rate = 0.02857142873108387right_turn_command:physical_turn_motion_score = -0.008921457944580906,invalid_contact_rate = 0.48275861144065857So the code restoration is now mechanically verified: the liveM6qimplementation reproduces the historical zero-shot behavior and remains the correct stronger shared-turn scaffold to branch from.
- aggregate zero-shot:
-
The next bounded follow-up on top of that restored scaffold was:
Unitree-G1-29dof-Wheelchair-Scratch-M6u-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-DominantGeometry-RelaxedHandle.M6ukept the strongerM6qcommand-conditioned soft dominance and reintroduced dominant-side-only hand-handle geometry shaping, with the specific goal of cleaning up the right-turn contact leak without giving back the both-sign authority. As expected, zero-shot fromM6p model_10147.ptwas identical to zero-shotM6qbecause only the reward changed. The real test was a matched bounded50-iteration model-only continuation:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6u_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_dominant_geometry_relaxedhandle/2026-05-26_16-07-39_mixedturn_commandstrongsoft_dominantgeom_from_m6p10147_modelonly_50itwhich produced two saved checkpoints:model_10150.pt:physical_turn_motion_score = 0.203279605285046,turn_motion_score = 0.7342491181567311,wheelchair_command_aligned_yaw_ratio_symmetric = 0.32976973056793213,clean_hold_rate = 0.796875,bad_orientation_rate = 0.03125,invalid_contact_rate = 0.203125withright_turn_command physical_turn_motion_score = 0.14647181343215585andright_turn_command invalid_contact_rate = 0.41379308700561523model_10196.pt:physical_turn_motion_score = 0.19345759608092344,turn_motion_score = 0.7403149347752332,wheelchair_command_aligned_yaw_ratio_symmetric = 0.31351807713508606,clean_hold_rate = 0.828125,bad_orientation_rate = 0.09375,invalid_contact_rate = 0.140625withright_turn_command physical_turn_motion_score = 0.13699392383603276andright_turn_command invalid_contact_rate = 0.27586206793785095This is a real improvement over zero-shotM6qbecause the right-turn half becomes physically positive instead of wrong-sign. But it still does not beat the historical boundedM6qcontinuation on the actual tradeoff we care about. HistoricalM6q model_10150.ptstayed stronger on authority (physical_turn_motion_score = 0.21757397106793339, right-turn0.21480752041858034) while landing at a comparable aggregate cleanliness point (clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875).M6ubuys back some right-turn contact, but it does so by giving back too much turn authority and reintroducing more bad orientation on the later checkpoint. SoM6uis discarded from code. The retained lesson is narrower: preservingM6qsign-conditioned authority is correct, but dominant-side-only geometry shaping is still not the mechanism that resolves the right-turn scaffold cleanly.
-
I then fixed the mixed-turn evaluator itself so the diagnosis path could be trusted. The per-handle/per-sensor breakdown mode in
scripts/rsl_rl/eval_wheelchair_m0.pyhad been preallocating filter tensors from the static config lists, but the live relaxed-handle sensors were returning a different filter count at runtime. That caused the diagnosis run to crash with a31 != 30tensor-width mismatch instead of telling us what was actually colliding. I changed that evaluator path to size breakdown buffers from the live sensor shapes and to label them from the runtime filter expressions when available. After that fix, the historical boundedM6q model_10150.ptbreakdown became usable and showed the real right-turn leak:wheelchair_right_handle_invalid_contactwas dominated byright_wrist_pitch_link = 27904.484375- the next contributors were much smaller:
right_wrist_roll_link = 237.5985107421875,right_elbow_link = 124.99657440185547,pelvis = 77.53118896484375 wheelchair_base_robot_contactwas also mostly hand/wrist spill:right_rubber_hand = 380.507568359375,right_wrist_pitch_link = 353.1176452636719,right_wrist_yaw_link = 320.6406555175781Based on that, I ran one more bounded probe:Unitree-G1-29dof-Wheelchair-Scratch-M6v-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-DominantWrist-RelaxedHandle.M6vkept the strong-softM6qscaffold, started from the historical boundedM6q model_10150.pt, and added a dominant-side wrist pitch/roll deviation term with the explicit goal of reducing theright_wrist_pitch_linkhandle intrusion without giving away two-sign turn authority. The bounded continuation run was:unitree_rl_lab/logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6v_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_dominant_wrist_relaxedhandle/2026-05-26_16-34-03_mixedturn_commandstrongsoft_dominantwrist_from_m6q10150_modelonly_50itand producedmodel_10199.ptas the only new saved checkpoint. Deterministic eval onM6v model_10199.ptcame back at:- aggregate:
physical_turn_motion_score = 0.21214263804737943,turn_motion_score = 0.7641600643284618,wheelchair_command_aligned_yaw_ratio_symmetric = 0.32313433289527893,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875,bad_orientation_rate = 0.046875,base_height_rate = 0.078125 left_turn_command:physical_turn_motion_score = 0.17840703906227545,wheelchair_command_aligned_yaw_ratio_symmetric = 0.24167948961257935,clean_hold_rate = 0.9142857193946838,invalid_contact_rate = 0.08571428805589676right_turn_command:physical_turn_motion_score = 0.21401638972764173,wheelchair_command_aligned_yaw_ratio_symmetric = 0.42144185304641724,clean_hold_rate = 0.7241379022598267,invalid_contact_rate = 0.27586206793785095Relative to historical boundedM6q model_10150.pt, this did exactly what the targeted wrist penalty was supposed to do on the weak side:- right-turn invalid contact improved:
0.3103448152542114 -> 0.27586206793785095 - right-turn clean hold improved:
0.6896551847457886 -> 0.7241379022598267 - and the dominant leak body was cut almost in half:
wheelchair_right_handle_invalid_contact -> right_wrist_pitch_link27904.484375 -> 13843.240234375But it was still not a win overall. Aggregatephysical_turn_motion_scoreregressed slightly (0.21757397106793339 -> 0.21214263804737943), aggregateinvalid_contact_ratestayed exactly flat at0.171875, and the contact mostly shifted into adjacent geometry rather than disappearing:wheelchair_base_robot_contact -> right_rubber_handrose to735.692138671875, whileright_wrist_roll_linkalso rose to469.6550598144531. SoM6vconfirmed a useful physical fact: the dominant-side wrist term can reduce the specificright_wrist_pitch_linkhandle strike, but by itself it does not solve the mixed-turn scaffold. It just trades that leak for neighboring hand/base contact and degrades the left-turn half enough to lose overall. I discardedM6vfrom code. The retained outcome of this round is the evaluator fix plus a narrower diagnosis: the nextM6q-family branch should change the dominant-side physical contact scaffold, not just add more reward pressure on the same soft geometry.
-
I then tested a more structural command-conditioned hybrid branch:
Unitree-G1-29dof-Wheelchair-Scratch-M6w-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedHybrid-RelaxedHandle. The intent was to use a hard dominant-side hand-handle joint on the commanded turn side and keep a soft assist on the other side, switching that topology with the yaw-command sign. The zero-shot transfer from retainedM6q model_10150.ptimmediately showed that this was not a real control improvement even before training:physical_turn_motion_score = 0.0035975821035016385,turn_motion_score = 0.4383784196572379,wheelchair_command_aligned_yaw_ratio_symmetric = -0.1019144207239151,clean_hold_rate = 1.0, andinvalid_contact_rate = 0.0. SoM6wwas physically clean but essentially non-turning and wrong-sign. I still ran a bounded continuation attempt from retainedM6q model_10150.pt, but that branch exposed a more serious issue: reset-time USD joint mutation on this scaffold triggered Isaac/PhysX hierarchy/xformstack errors during training instead of producing a valid checkpoint ladder. I stopped that run and discardedM6wfrom code. The lesson is specific: command-conditioned reset-time hard/soft joint rewriting is not currently a mechanically valid mixed-turn scaffold in this environment stack. -
The next branch kept the proven runtime mechanism from
M6qand changed only the right-turn dominant-side strength:Unitree-G1-29dof-Wheelchair-Scratch-M6x-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedAsymmetricSoft-RelaxedHandle.M6xkept the left-turn dominant side at the retainedM6qstrong-soft values, but softened the right-turn dominant side to reduce the specific right wrist/base intrusion seen in theM6qbreakdown. Zero-shot transfer from retainedM6q model_10150.ptcame back at:- aggregate:
physical_turn_motion_score = 0.1523773732366186,turn_motion_score = 0.6750587449409067,wheelchair_command_aligned_yaw_ratio_symmetric = 0.1960894614458084,clean_hold_rate = 0.875,invalid_contact_rate = 0.125,bad_orientation_rate = 0.0 left_turn_command:physical_turn_motion_score = 0.1895894598691404,wheelchair_command_aligned_yaw_ratio_symmetric = 0.24422724545001984,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.1128378109060148,wheelchair_command_aligned_yaw_ratio_symmetric = 0.13799212872982025,clean_hold_rate = 0.7931034564971924,invalid_contact_rate = 0.20689654350280762That was directionally plausible: it was cleaner than retainedM6q, especially on the right-turn side, but it gave back a lot of turn authority. I then ran a matched bounded50-iteration model-only continuation from retainedM6q model_10150.pt:unitree_rl_lab/logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6x_freeyaw_heavydamped_mixedturn_observed_command_conditioned_asymmetric_soft_relaxedhandle/2026-05-26_17-10-40_mixedturn_asymmetricsoft_from_m6q10150_modelonly_50itThe run produced one new checkpoint,model_10199.pt, and it regressed relative to its own zero-shot start:- aggregate:
physical_turn_motion_score = 0.13453308225478522,turn_motion_score = 0.6417798833455891,wheelchair_command_aligned_yaw_ratio_symmetric = 0.18083100020885468,clean_hold_rate = 0.84375,invalid_contact_rate = 0.15625,bad_orientation_rate = 0.0 left_turn_command:physical_turn_motion_score = 0.17649054176165396,wheelchair_command_aligned_yaw_ratio_symmetric = 0.22434625029563904,clean_hold_rate = 0.9142857193946838,invalid_contact_rate = 0.08571428805589676right_turn_command:physical_turn_motion_score = 0.09141481387408519,wheelchair_command_aligned_yaw_ratio_symmetric = 0.1283126175403595,clean_hold_rate = 0.7586206793785095,invalid_contact_rate = 0.24137930572032928Compared with retained historical boundedM6q model_10150.pt,M6xis slightly cleaner in aggregate (0.15625vs0.171875invalid-contact rate) but materially weaker on the real turn objective (0.1345vs0.2176physical turn score), and it loses most of that gap on the right-turn half. SoM6xis not retained and was removed from code. The useful conclusion is narrower than before: softening the weak-side dominant attachment can buy some cleanliness, but on this scaffold it buys it by giving away too much two-sign turn authority.
- aggregate:
-
I then tested the opposite-side support hypothesis directly:
Unitree-G1-29dof-Wheelchair-Scratch-M6y-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-LeftAssist-RelaxedHandle.M6ykept the retainedM6qdominant-side strength intact, but increased the left-side assist used during right-turn episodes. The intent was to preserve right-turn authority while letting the off-side hand help stabilize the torso and keep the dominant right wrist out of the chair. That branch failed immediately on zero-shot transfer from retainedM6q model_10150.pt, so I did not run a continuation. Deterministic zero-shot eval came back at:- aggregate:
physical_turn_motion_score = 0.17580272792821708,turn_motion_score = 0.7473227627575397,wheelchair_command_aligned_yaw_ratio_symmetric = 0.3668012022972107,clean_hold_rate = 0.78125,invalid_contact_rate = 0.21875,bad_orientation_rate = 0.140625,base_height_rate = 0.046875 left_turn_command:physical_turn_motion_score = 0.17390093406513527,wheelchair_command_aligned_yaw_ratio_symmetric = 0.22297799587249756,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.05574707816595314,wheelchair_command_aligned_yaw_ratio_symmetric = 0.5403808951377869,clean_hold_rate = 0.5862069129943848,invalid_contact_rate = 0.41379308700561523,bad_orientation_rate = 0.3103448152542114The right-turn sign stayed positive, but the physical base score collapsed because the branch immediately reintroduced large invalid contact, bad orientation, and base-height failures on the weak side. So the increased opposite-side assist is not the fix either. I discardedM6yfrom code without spending a train budget on it.
- aggregate:
-
The next geometry-only probe changed the dominant right palm grip point itself:
Unitree-G1-29dof-Wheelchair-Scratch-M6z-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightGripOutboardRaised-RelaxedHandle.M6zkept the retainedM6qstrong-soft gains exactly as they were, but moved the right palm grip point farther forward, outward, and upward on theright_rubber_handbody during mixed-turn control. The goal was to reduce the right-turnright_wrist_pitch_linkhandle strike without weakening right-turn authority the wayM6xdid. This branch did not produce a usable zero-shot eval result. The rollout never wrote/tmp/m6z_zeroshot_eval.json, and the active Kit logkit_20260526_172942.logreported repeatedHang detectedevents during the same run instead of completing the deterministic evaluation. I stopped the probe and removedM6zfrom code. So the current lesson is that this larger right-palm grip relocation is not just unproven; in the present stack it is mechanically unstable enough to hang the zero-shot rollout before metrics are emitted. -
I then tested a smaller geometry-only right-side scaffold change:
Unitree-G1-29dof-Wheelchair-Scratch-M6aa-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightHandleForwardUp-RelaxedHandle.M6aakept the retainedM6qstrong-soft gains unchanged and did not move the robot hand grip point. Instead it shifted only the right handle target point slightly forward and upward inright_handle_framelocal coordinates:right_wheelchair_body_local_positions = [[0.008, 0.0, 0.012]]. The intent was to keep the successfulM6qauthority mechanism intact while making the dominant right-turn alignment a little less wrist-pitch-heavy than the original contact geometry. This branch also failed before producing a usable zero-shot eval. The smoke run never wrote/tmp/m6aa_smoke_eval.json, the Isaac logisaaclab_2026-05-26_17-40-20.logstopped at environment/reward-manager initialization, and the live Kit logkit_20260526_174007.logshowed:SimulationApp.close: Closing applicationfollowed roughly two minutes later by:Hang detectedand a declined crash dialog. SoM6aareproduced the same failure class asM6z: a geometry-only right-side handle-target change that is mechanically unstable enough to hang shutdown before deterministic eval metrics are emitted. I removedM6aafrom code. The mixed-turn lesson is tighter now: large right-palm relocation and smaller right-handle-target relocation both fail mechanically in the current stack, so the nextM6q-family branch should stay away from more right-side geometry rewiring and instead change a different part of the right-turn physical scaffold. -
I then tested whether the dominant weak-side leak was really just a handle-contact classification issue:
Unitree-G1-29dof-Wheelchair-Scratch-M6ab-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-PitchRelaxedHandle.M6abkept the retainedM6qstrong-soft scaffold unchanged, but temporarily relaxed handle validity to allow same-sidewrist_pitchcontact at both handles in addition to the already-allowedrubber_handandwrist_yawlinks. The idea was to see whether the right-turn failure was mostly a bookkeeping problem aroundright_wrist_pitch_linktouching the handle rather than a real base/torso contact problem. Deterministic zero-shot eval from retainedM6q model_10150.ptcame back at:- aggregate:
physical_turn_motion_score = 0.19427912372222916,turn_motion_score = 0.7051787175238131,wheelchair_command_aligned_yaw_ratio_symmetric = 0.38512715697288513,clean_hold_rate = 0.8125,invalid_contact_rate = 0.1875,bad_orientation_rate = 0.046875,base_height_rate = 0.0625 left_turn_command:physical_turn_motion_score = 0.17755345293315306,wheelchair_command_aligned_yaw_ratio_symmetric = 0.24824529886245728,clean_hold_rate = 0.9142857193946838,invalid_contact_rate = 0.08571428805589676right_turn_command:physical_turn_motion_score = 0.16123353407766905,wheelchair_command_aligned_yaw_ratio_symmetric = 0.5503294467926025,clean_hold_rate = 0.6551724076271057,invalid_contact_rate = 0.3448275923728943,bad_orientation_rate = 0.13793103396892548The useful detail is in the sensor split.M6abdid suppress the specific handle invalid-contact terms hard:wheelchair_right_handle_invalid_contactdropped to4182.55810546875mean force and0.125rate, whilewheelchair_left_handle_invalid_contactdropped to595.6119384765625mean force and0.09375rate. But the aggregate branch still lost to retainedM6qbecause the contact just migrated intowheelchair_base_robot_contact, which stayed large at6263.53466796875mean force with0.1875rate. SoM6abproved the weak side was not primarily misclassified handle contact. It was a real physical collapse into the chair base once the wrist-pitch link stopped being counted against the handles. I discardedM6abfrom code without spending a training budget on it.
- aggregate:
-
I then tested a narrow reward-side cleanup targeted only at the surviving weak-side sensors:
Unitree-G1-29dof-Wheelchair-Scratch-M6ac-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightTurnContactClean-RelaxedHandle.M6ackept the retainedM6qstrong-soft attachment scaffold intact and added one extra reward term only during right-turn-command episodes: a filtered invalid-contact penalty onwheelchair_base_robot_contactandwheelchair_right_handle_invalid_contact. I ran a matched50-iteration model-only continuation from retainedM6q model_10150.pt:unitree_rl_lab/logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6ac_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_rightturn_contactclean_relaxedhandle/2026-05-26_17-55-28_mixedturn_rightturncontact_from_m6q10150_modelonly_50itThe run producedmodel_10150.ptandmodel_10199.pt; deterministic eval of the later checkpoint came back at:- aggregate:
physical_turn_motion_score = 0.2022933866306722,turn_motion_score = 0.7843994962051511,wheelchair_command_aligned_yaw_ratio_symmetric = 0.38512715697288513,clean_hold_rate = 0.796875,invalid_contact_rate = 0.203125,bad_orientation_rate = 0.078125,base_height_rate = 0.078125 left_turn_command:physical_turn_motion_score = 0.17755345293315306,wheelchair_command_aligned_yaw_ratio_symmetric = 0.24824529886245728,clean_hold_rate = 0.9142857193946838,invalid_contact_rate = 0.08571428805589676right_turn_command:physical_turn_motion_score = 0.16123353407766905,wheelchair_command_aligned_yaw_ratio_symmetric = 0.5503294467926025,clean_hold_rate = 0.6551724076271057,invalid_contact_rate = 0.3448275923728943,bad_orientation_rate = 0.13793103396892548Relative to retained historical boundedM6q model_10150.pt,M6acdid not solve the actual tradeoff. Aggregatephysical_turn_motion_scoreregressed (0.2023vs0.2176), aggregateinvalid_contact_rategot worse (0.203125vs0.171875), and the right-turn half still stayed materially worse on both cleanliness and physical base score. In other words, extra right-turn-only penalty pressure on the already-identified weak-side sensors was not enough to clean the scaffold without also suppressing useful turn behavior. I discardedM6acfrom code and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
- aggregate:
-
I then tested a narrower physical-scaffold change inside the same
M6qmechanism:Unitree-G1-29dof-Wheelchair-Scratch-M6ad-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightForceCapped-RelaxedHandle.M6adkept the retainedM6qcommand-conditioned strong-soft scaffold and changed only one thing: during right-turn-command episodes, the dominant right-hand soft attachment kept the same stiffness/damping but loweredmax_forcefrom1000.0to700.0. The intent was to preserve the useful two-sign authority fromM6qwhile reducing the specific right-turn chair intrusion without touching rewards or geometry. Deterministic zero-shot transfer from retainedM6q model_10150.ptcame back at:- aggregate:
physical_turn_motion_score = 0.19002076231086962,turn_motion_score = 0.6953225670315612,wheelchair_command_aligned_yaw_ratio_symmetric = 0.2520219683647156,clean_hold_rate = 0.875,invalid_contact_rate = 0.125,base_height_rate = 0.015625 left_turn_command:physical_turn_motion_score = 0.186151182969205,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.1883712823501326,clean_hold_rate = 0.7931034564971924,invalid_contact_rate = 0.20689654350280762That zero-shot probe was cleaner than retainedM6q, especially on the right-turn half, so I spent one matched bounded continuation budget from the same retained source:unitree_rl_lab/logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6ad_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_rightforcecapped_relaxedhandle/2026-05-26_18-21-17_mixedturn_rightforcecapped_from_m6q10150_modelonly_50itThe saved checkpoints weremodel_10150.ptandmodel_10199.pt. Deterministic eval selectedmodel_10150.ptas the better one:- aggregate:
physical_turn_motion_score = 0.18106631314477534,turn_motion_score = 0.7072188844904302,wheelchair_command_aligned_yaw_ratio_symmetric = 0.27011001110076904,clean_hold_rate = 0.8125,invalid_contact_rate = 0.1875,base_height_rate = 0.03125 left_turn_command:physical_turn_motion_score = 0.19664195016651886,clean_hold_rate = 0.8857142925262451,invalid_contact_rate = 0.11428571492433548right_turn_command:physical_turn_motion_score = 0.16070869259009296,clean_hold_rate = 0.7241379022598267,invalid_contact_rate = 0.27586206793785095Relative to retained historical boundedM6q model_10150.pt,M6addid not hold its zero-shot cleanliness advantage once trained. Aggregatephysical_turn_motion_scoreregressed (0.1811vs0.2176), aggregateclean_hold_rateregressed (0.8125vs0.828125), and aggregateinvalid_contact_ratewas still slightly worse (0.1875vs0.171875). The right-turn half also stayed worse than retainedM6qon both cleanliness and physical turn score. So lowering only the right-turn dominant-side force cap was not a useful lever. I discardedM6adfrom code and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
- aggregate:
-
I then checked whether the surviving weak-side contact was an underdamped dominant-side problem rather than a force-cap problem:
Unitree-G1-29dof-Wheelchair-Scratch-M6ae-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightDamped-RelaxedHandle.M6aekept the retainedM6qstrong-soft scaffold and changed only one thing: during right-turn-command episodes, the dominant right-hand soft attachment kept the same stiffness and max force, but increased damping from300.0to500.0. The hypothesis was that the right-turn wrist/base spill might be an overshoot problem that could be reduced without giving back turn authority. That branch failed immediately on deterministic zero-shot transfer from retainedM6q model_10150.pt, so I did not spend a train budget on it. The zero-shot eval came back at:- aggregate:
physical_turn_motion_score = 0.13812577034261808,clean_hold_rate = 0.6875,invalid_contact_rate = 0.3125,bad_orientation_rate = 0.125,time_out_rate = 0.828125 left_turn_command:physical_turn_motion_score = 0.17982053125865288,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = -0.07719530538038262,clean_hold_rate = 0.37931033968925476,invalid_contact_rate = 0.6206896305084229,bad_orientation_rate = 0.27586206793785095So increasing only the right-turn dominant-side damping was not stabilizing. It actually collapsed the right-turn half back to wrong-sign behavior with much worse contact and posture. I discardedM6aefrom code without training.
- aggregate:
-
I then tested the opposite-side support hypothesis in the narrower direction:
Unitree-G1-29dof-Wheelchair-Scratch-M6af-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightLowAssist-RelaxedHandle.M6afkept the retainedM6qdominant-side strength intact and changed only the right-turn off-side assist, reducing it from250/20/80to100/10/30on the left hand during right-turn-command episodes. The idea was that the off-side hand might be dragging the robot torso into the chair base while the dominant right side was trying to turn. Deterministic zero-shot transfer from retainedM6q model_10150.ptcame back at:- aggregate:
physical_turn_motion_score = 0.19930134763460872,clean_hold_rate = 0.8125,invalid_contact_rate = 0.1875,bad_orientation_rate = 0.0625,time_out_rate = 0.890625 left_turn_command:physical_turn_motion_score = 0.17590923426254665,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.15644299743092036,clean_hold_rate = 0.6551724076271057,invalid_contact_rate = 0.3448275923728943,bad_orientation_rate = 0.13793103396892548This kept the right-turn sign physically positive, but it still did not beat retainedM6q model_10150.pton the real tradeoff. Aggregatephysical_turn_motion_scoreregressed (0.1993vs0.2176), aggregateclean_hold_rateregressed (0.8125vs0.828125), and aggregateinvalid_contact_rateworsened (0.1875vs0.171875). The right-turn half also stayed worse than retainedM6qon both cleanliness and physical turn score. So weakening only the right-turn off-side assist was not a useful lever either. I discardedM6affrom code without training.
- aggregate:
-
I then tested a true physical-control-point branch instead of another spring retune:
Unitree-G1-29dof-Wheelchair-Scratch-M6ag-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightWristYaw-RelaxedHandle.M6agkept the retainedM6qstrong-soft gains unchanged and changed only the weak-side control scaffold. I measured the existing retained palm grip point relative toright_wrist_yaw_linkand used that local offset ([0.09564, 0.00072, 0.00502]) to drive the right handle fromright_wrist_yaw_linkinstead ofright_rubber_hand, while also allowingright_wrist_yaw_linkas valid right-handle contact and updating the handle-state observations/rewards to use the mixed left-hand / right-wrist control pair. The task compiled and instantiated cleanly. In the Isaac logs, the reset/interval soft-attachment terms resolvedright_wrist_yaw_linkcorrectly, and the mixedwheelchair_handle_stateobservation resolved body names['left_rubber_hand', 'right_wrist_yaw_link']. But the deterministic zero-shot eval from retainedM6q model_10150.ptnever produced metrics:- eval process stayed alive at full CPU
/tmp/wheelchair_bridge_eval_5ruv86xi.txtremained empty- Kit logged
SimulationApp.close: Closing applicationwithout emitting the usual benchmark JSON SoM6agwas not a scored failure likeM6aeorM6af; it was a mechanical hang after the right-side control-point swap. I stopped the eval, removedM6agfrom code, and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint. The useful lesson is narrow: swapping the weak side fromright_rubber_handtoright_wrist_yaw_linkis mechanically unstable in the current stack even when the local grip point is measured from the retained geometry.
-
I then tried an explicit easier right-turn curriculum rung rather than another contact-geometry retune:
Unitree-G1-29dof-Wheelchair-Scratch-M6ah-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightModerate-RelaxedHandle.M6ahkept the retainedM6qstrong-soft runtime scaffold exactly the same and only reduced the right-turn command range, changingang_vel_zfrom symmetric(-0.35, 0.35)to(-0.25, 0.35). The intent was to create a real bridge rung: keep the shared mixed-turn mechanism, make the weak-side demand slightly easier, and then score any saved checkpoints back on the standard symmetricM6qtask rather than judging them only on the easier rung. Zero-shot from retainedM6q model_10150.ptbehaved like a real curriculum rung should:- on the easier
M6ahtask:physical_turn_motion_score = 0.20418132327252228,clean_hold_rate = 0.875,invalid_contact_rate = 0.125 - but on downstream standard
M6qit was still just the retained baseline:physical_turn_motion_score = 0.21757397106793339,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875I then ran a bounded low-noise50-iteration model-only continuation from retainedM6q model_10150.pt:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6ah_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_rightmoderate_relaxedhandle/2026-05-26_18-59-25_mixedturn_rightmoderate_from_m6q10150_modelonly_std005_50itand selected checkpoints by downstreamM6q physical_turn_motion_score, not same-task score. The better checkpoint on the easier rung wasmodel_10199.pt: M6ah same-task:physical_turn_motion_score = 0.19036498903991295,clean_hold_rate = 0.84375,invalid_contact_rate = 0.15625M6q downstream:physical_turn_motion_score = 0.2007210309635142,clean_hold_rate = 0.796875,invalid_contact_rate = 0.203125Even the earlier saved checkpointmodel_10150.ptwas worse downstream:physical_turn_motion_score = 0.19389275615147705,clean_hold_rate = 0.78125,invalid_contact_rate = 0.21875. SoM6ahdid answer the curriculum question cleanly: easing only the right-turn command demand makes the branch look better on its own task, but that improvement does not transfer back to the real symmetric mixed-turn objective. I discardedM6ahfrom code and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
- on the easier
-
I then tested a sign-specific authority branch instead of another mirrored gain tweak:
Unitree-G1-29dof-Wheelchair-Scratch-M6ai-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedRightHarder-RelaxedHandle.M6aikept the retainedM6qleft-turn scaffold unchanged, but made only the right-turn dominant side a stronger soft approximation of the successful fixedright-hard / left-softtopology. Zero-shot from retainedM6q model_10150.ptwas directionally promising on the new task:M6ai same-task:physical_turn_motion_score = 0.1957863270006181,clean_hold_rate = 0.890625,invalid_contact_rate = 0.109375- the right-turn half became materially cleaner than retained
M6qon its own scaffold:clean_hold_rate = 0.8275861740112305,invalid_contact_rate = 0.17241379618644714versus retainedM6qright-turnclean_hold_rate = 0.6896551847457886,invalid_contact_rate = 0.3103448152542114Because that looked like a real scaffold improvement, I ran a bounded low-noise50-iteration model-only continuation from retainedM6q model_10150.pt:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6ai_freeyaw_heavydamped_mixedturn_observed_command_conditioned_right_harder_relaxedhandle/2026-05-26_19-16-38_mixedturn_rightharder_from_m6q10150_modelonly_std005_50itand selected checkpoints by downstream standardM6q physical_turn_motion_score, not by the easier same-task result. The best downstream checkpoint wasmodel_10150.pt, and it still lost to retainedM6q: - downstream
M6qfromM6ai model_10150.pt:physical_turn_motion_score = 0.19398082470007244,clean_hold_rate = 0.828125,invalid_contact_rate = 0.140625 - retained
M6q model_10150.pt:physical_turn_motion_score = 0.21757397106793339,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875The later saved checkpointmodel_10199.ptregressed harder downstream:physical_turn_motion_score = 0.18075072226731204,clean_hold_rate = 0.734375,invalid_contact_rate = 0.265625. SoM6aiconfirmed a narrower lesson: making the right-turn dominant side harder does clean the branch locally, but it still does not transfer into a better shared mixed-turn controller on the real symmetricM6qtask. I discardedM6aifrom code and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
-
I then tested the cleaner
M6aibranch as an explicit curriculum source instead of judging it only as an alternate scaffold. The idea was simple: start from the same-task-cleanerM6ai model_10150.ptand continue directly on the standard symmetricM6qtask, then score checkpoints on standardM6qrather than on the easierM6airung. The bounded low-noise50-iteration model-only continuation was:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6q_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_relaxedhandle/2026-05-26_19-27-56_mixedturn_m6q_from_m6ai10150_modelonly_std005_50itand produced two saved checkpoints:model_10150.pt:physical_turn_motion_score = 0.18110951605476558,clean_hold_rate = 0.75,invalid_contact_rate = 0.25model_10199.pt:physical_turn_motion_score = 0.20292924602862325,clean_hold_rate = 0.796875,invalid_contact_rate = 0.203125The better of those,model_10199.pt, still lost to retained historicalM6q model_10150.pt:- transferred
M6qfromM6ai model_10199.pt:physical_turn_motion_score = 0.20292924602862325,clean_hold_rate = 0.796875,invalid_contact_rate = 0.203125 - retained
M6q model_10150.pt:physical_turn_motion_score = 0.21757397106793339,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875The sign split stayed consistent with the aggregate loss: the transferred run helped neither side enough to offset the drop in cleanliness, and the right-turn half still lagged the retained branch on both authority and contact. So the curriculum hypothesis did not hold up. A cleanerM6airung does not transfer back into a better standardM6qmixed-turn controller, and I discarded thisM6ai -> M6qtransfer run.
-
I then tested whether the source-policy itself was the blocker by probing standard symmetric
M6qdirectly from the discarded mixed right-strong branch:M6o model_10098.pt. This did not justify a continuation budget. Deterministic zero-shot eval ofM6o model_10098.pton the standardM6qtask came back at:physical_turn_motion_score = 0.18877694330328737,clean_hold_rate = 0.796875,invalid_contact_rate = 0.203125. The sign split was still weak on the right-turn half:left_turn_command:physical_turn_motion_score = 0.1795978052718013,clean_hold_rate = 0.9428571462631226,invalid_contact_rate = 0.05714285746216774right_turn_command:physical_turn_motion_score = 0.11044519296102995,clean_hold_rate = 0.6206896305084229,invalid_contact_rate = 0.37931033968925476That is worse than retainedM6q model_10150.pton both the aggregate gate and the weak-side turn half, so I discarded theM6o -> M6qsource-policy hypothesis without training.
-
I then tested the cleanest discarded
M6ucheckpoint as a curriculum source:M6u model_10196.pt. Zero-shot transfer onto standardM6qwas the closest source-policy alternative so far:physical_turn_motion_score = 0.19345759608092344,clean_hold_rate = 0.828125,invalid_contact_rate = 0.140625. That matched retainedM6qonclean_hold_rateand improved aggregate invalid contact, so I spent one bounded low-noise50-iteration model-only continuation budget:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6q_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_relaxedhandle/2026-05-26_19-43-41_mixedturn_m6q_from_m6u10196_modelonly_std005_50itThe checkpoint selector kept the early saved checkpointmodel_10200.pt; the latermodel_10245.ptregressed harder. The selected downstream result was:- transferred
M6q model_10200.pt:physical_turn_motion_score = 0.19439021422986044,clean_hold_rate = 0.828125,invalid_contact_rate = 0.15625 - retained
M6q model_10150.pt:physical_turn_motion_score = 0.21757397106793339,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875The weak-side split stayed consistent with the aggregate loss: - transferred
right_turn_command:physical_turn_motion_score = 0.14930035061665134,clean_hold_rate = 0.6206896305084229,invalid_contact_rate = 0.3448275923728943 - retained
right_turn_command:physical_turn_motion_score = 0.21480752041858034,clean_hold_rate = 0.6896551847457886,invalid_contact_rate = 0.3103448152542114SoM6uconfirmed the same lesson from a different angle: a cleaner source policy can reduce aggregate contact, but it still does not recover the missing right-turn authority on the real sharedM6qscaffold. I discarded theM6u -> M6qtransfer run and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
- transferred
-
I then tested whether a one-sided curriculum rung would help the retained shared scaffold recover right-turn authority without changing the mixed-turn task itself. The branch was:
Unitree-G1-29dof-Wheelchair-Scratch-M6aj-FreeYawHeavyDampedRightTurnOnly-Observed-CommandConditionedStrongSoft-RelaxedHandle.M6ajkept the retainedM6qstrong-soft scaffold unchanged and only fixed the yaw command to the right-turn value-0.35on the same heavy-damped chair dynamics. I ran a bounded low-noise50-iteration model-only continuation from retainedM6q model_10150.ptand scored every saved checkpoint back on the real symmetric mixed-turnM6qtask:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6aj_freeyaw_heavydamped_rightturnonly_observed_command_conditioned_strong_soft_relaxedhandle/2026-05-26_19-54-49_rightturnonly_m6aj_from_m6q10150_modelonly_std005_50itThe branch was bad even on its own task. The selected same-taskM6aj model_10199.ptonly reached:physical_turn_motion_score = 0.03729120536368411,clean_hold_rate = 0.5625,invalid_contact_rate = 0.4375, with the dominant contact leaking into bothwheelchair_base_robot_contactandwheelchair_right_handle_invalid_contact. Downstream transfer back onto symmetricM6qwas cleaner than the retained checkpoint but still weaker on the primary gate:- transferred
M6q model_10199.pt:physical_turn_motion_score = 0.21264488661933387,clean_hold_rate = 0.859375,invalid_contact_rate = 0.140625 - retained
M6q model_10150.pt:physical_turn_motion_score = 0.21757397106793339,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875The split confirms the same tradeoff as the earlier source-policy probes. The transferred branch bought back some cleanliness, especially on the right-turn half (invalid_contact_rate = 0.27586206793785095versus retained0.3103448152542114), but it still gave back too much turn authority to beat retainedM6qon the real objective. So I discardedM6ajfrom code and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
- transferred
-
I then tested a narrower reward-only branch instead of another topology change:
Unitree-G1-29dof-Wheelchair-Scratch-M6ak-FreeYawHeavyDampedMixedTurn-Observed-CommandConditionedStrongSoft-RightTurnRightGeometry-RelaxedHandle.M6akkept the retainedM6qstrong-soft attachment scaffold and heavy-damped chair dynamics unchanged, and only added extra right-hand handle-position and axis-alignment shaping during right-turn-command episodes. The intent was to help the weak right-turn half without perturbing the already-good left-turn half. I ran a bounded low-noise50-iteration model-only continuation from retainedM6q model_10150.pt:logs/rsl_rl/unitree_g1_29dof_wheelchair_scratch_m6ak_freeyaw_heavydamped_mixedturn_observed_command_conditioned_strong_soft_rightturn_rightgeom_relaxedhandle/2026-05-26_20-11-19_mixedturn_rightturnrightgeom_from_m6q10150_modelonly_std005_50itThe checkpoint selector still chose the earliest saved checkpoint,model_10150.pt; the latermodel_10199.ptregressed both same-task and downstream. On the real downstream gate, the selected result was:- downstream
M6qfromM6ak model_10150.pt:physical_turn_motion_score = 0.20529377062368684,clean_hold_rate = 0.8125,invalid_contact_rate = 0.1875 - retained
M6q model_10150.pt:physical_turn_motion_score = 0.21757397106793339,clean_hold_rate = 0.828125,invalid_contact_rate = 0.171875The weak-side split made the failure mode explicit. The added right-turn geometry reward did not buy back right-turn authority; it reduced it: - downstream
right_turn_commandfromM6ak model_10150.pt:physical_turn_motion_score = 0.15259989048398542,clean_hold_rate = 0.6206896305084229,invalid_contact_rate = 0.37931033968925476 - retained
right_turn_commandfromM6q model_10150.pt:physical_turn_motion_score = 0.21480752041858034,clean_hold_rate = 0.6896551847457886,invalid_contact_rate = 0.3103448152542114SoM6akanswered the question cleanly: extra right-turn-only hand geometry shaping on top of retainedM6qdoes not fix the weak side. It makes the branch more constrained, but not more effective. I discardedM6akfrom code and kept retainedM6q model_10150.ptas the active mixed-turn checkpoint.
- downstream
Autoresearch Harness¶
The first codex-autoresearch loop targeted M0, not the later motion phases.
The original task was:
Unitree-G1-29dof-Wheelchair-Scratch-M0-CollidableStand-DirectObs
The retained M0 task is now:
Unitree-G1-29dof-Wheelchair-Scratch-M0-CollidableStand-Observed
Mechanical verify command:
conda run --no-capture-output -n isaaclab python scripts/autoresearch/benchmark_wheelchair_m0.py --metric m0_score
That command lives in unitree_rl_lab and does two things: it runs a short bounded training continuation on the requested task, then it evaluates the resulting checkpoint with a deterministic rollout. The primary score is m0_score, but the evaluator also records clean_hold_rate, invalid_contact_rate, bad_orientation_rate, and base_height_rate. Phase advancement should not be decided from m0_score alone; it is only the dense optimization signal for the loop.
The default verifier intentionally omits the per-handle invalid-contact filter breakdown. That breakdown is still available as a diagnosis-only path in the evaluator, but it is not part of the unattended loop because it is materially heavier than the aggregate metric path.
For the unattended background loop, use the metrics-only JSON variant instead so the runtime can keep m0_score as the primary metric while also enforcing acceptance gates on clean_hold_rate and invalid_contact_rate:
conda run --no-capture-output -n isaaclab python scripts/autoresearch/benchmark_wheelchair_m0.py --metrics-json-only
For later bridge stages such as M1f, M1g, and M2a, the single-stage M0 wrapper is not sufficient. Use
scripts/autoresearch/benchmark_wheelchair_bridge.py
instead so the loop can score both same-stage hold quality and downstream transfer into the next damping rung, with downstream m0_score used as the primary metric.
For motion-stage work such as M3, the same bridge harness should use --primary-metric-key forward_motion_score. That metric is now directional and should be treated as the gate: backward rail motion or near-zero chair motion is not a pass even if the rollout survives cleanly.