Learning Rotations Online

One fact per step, three ways to use it — and a factor of two for respecting the geometry.

An unknown rotation $A \in SO(d)$ reveals itself one pair at a time: a unit vector $x$ goes in, $y = Ax$ comes out. An estimate $W$, starting at the identity, must absorb each pair as it arrives. Every update obeys the same instruction — make $W$ send $x$ to $y$ — but how much of the rotation structure it uses sets how fast $W$ converges.

Animation: three estimators learning a rotation from streamed pairs — One data stream, three updates (d = 3): each panel shows the estimate applied to a reference object, target pose ghosted in gray. Gradient descent (left) leaves the rotation group and the object deforms; projection (middle) keeps it rigid; the geodesic update (right) rotates straight onto the target and lands first.

Three ways to insert information

1 · Gradient descent — few assumptions

$$W \leftarrow W + (y - Wx)\,x^\top$$

The smallest change to $W$ that makes $Wx = y$ — a Kaczmarz step. It works for any matrix and knows nothing about rotations; $W$ drifts off $SO(d)$. Each step projects the error away from one random direction.

squared error × $1 - 1/d$ per step

2 · Gradient descent + projection

$$W \leftarrow \mathrm{polar}\left(W + (y - Wx)\,x^\top\right)$$

The same step, snapped back to the nearest rotation. The projection keeps only the rotational part of the correction — in closed form, a rotation through exactly half the angle between $Wx$ and $y$.

squared error × $1 - 3/(2d)$ per step

3 · A native update for rotations

$$W \leftarrow RW, \qquad R = I + K + \frac{K^2}{1+c}$$ $$K = yu^\top - uy^\top, \qquad u = Wx, \qquad c = u \cdot y$$

$R$ is the geodesic rotation carrying $Wx$ exactly onto $y$: the full angle, never leaving $SO(d)$. It corrects the error on both sides at once, doubling the contraction. Full derivation →

squared error × $1 - 2/d$ per step

Result

The three contractions sit in ratio 1 : 1.5 : 2 —
using the constraint doubles the rate; projecting after the fact recovers half the gain.

Convergence of the three updates at d = 128 against per-step contraction theory — 256 trials at d = 128, distance to the target on a log scale. Gradient descent rides its theory line $(1-1/d)^{t/2}$ exactly; the constrained methods run parallel to theirs after an early transient.

Implementation

Each update is a correction of rank at most two, so a step costs $O(d^2)$ — no SVD anywhere: the projection in update 2 uses its half-angle closed form, and the geodesic rotation collapses to two outer products. The figure reproduces in about two seconds on a laptop:

git clone https://github.com/yaroslavvb/learning-rotations
python large_d_collapse.py     # simulate + plot
python benchmark_d128.py       # equivalence vs SVD reference + timings