Implementing SLAM for a Differential-Drive Robot Using a Camera and Wheel Encoders (TurtleBot3 Example)
- Karan Bhakuni
- 4 days ago
- 6 min read
Ever watched a small robot roll into a room and thought, how does it know where it is? That's the heart of SLAM (Simultaneous Localization and Mapping). We estimate our robot's pose while also building a map of the space. The trick is that both goals depend on each other, so we solve them together.
In this post, we'll focus on a differential-drive robot with wheel encoder odometry plus a monocular camera. We'll keep the math honest, but we won't drown in symbols. TurtleBot3 is our running example because it's widely used in the US, ROS2-friendly, and easy to test in Gazebo before risking real hardware. If you need a quick reminder of the standard package layout and conventions, the TurtleBot3 ROS 2 documentation is a solid reference point.
The math we actually need: motion model, camera measurements, and uncertainty

Before we fuse anything, we need to agree on "where measurements live." In ROS2, we usually talk about frames like map, odom, base_link, and camera_link. In plain words: base_link is the robot body, camera_link is the camera, odom is our smooth but drifting local frame from wheel motion, and map is the corrected global frame after SLAM and loop closure.
SLAM is mostly an uncertainty bookkeeping problem. Every sensor number comes with noise. So we carry not only a pose, but also "how sure we are" about it. That's why probabilistic filters show up everywhere.
Differential-drive wheel odometry, turning encoder ticks into (x, y, yaw)
Encoders give ticks for left and right wheels. We convert ticks into wheel travel distance. Let:
r = wheel radius
N = ticks per wheel revolution
ΔticksL, ΔticksR = tick increments in one timestep
b = track width (distance between wheel contact points)
Wheel travel distances become:
dL = (2πr) (ΔticksL / N)dR = (2πr) (ΔticksR / N)
From those, differential-drive kinematics give a body motion:
d = (dR + dL) / 2dθ = (dR - dL) / b
If our pose is (x, y, θ), the midpoint update is:
x' = x + d cos(θ + dθ/2)y' = y + d sin(θ + dθ/2)θ' = θ + dθ
That's the whole "predict" step in one breath. It feels clean, but it drifts. Slip, unequal wheel radii, wrong b, and rough flooring all push the pose away from reality. Turning in place is especially unforgiving because tiny errors in dR and dL become big errors in dθ.
Quick check: if we spin on carpet and the robot "walks" sideways in the map, it's usually wheel slip or a miscalibrated track width b.
What the camera gives us, features, matches, and a visual motion estimate
A monocular camera doesn't measure depth directly, but it's great at tracking how the scene moves across frames. Visual odometry (VO) starts by detecting repeatable points (features), matching them between frames, and estimating the camera's motion that best explains those matches.
In practice, we'll see ORB features, optical flow tracking, and full visual SLAM pipelines inspired by ORB-SLAM3. The math underneath is a "projection story": a 3D point in the world projects to pixel coordinates through camera intrinsics. Our observation model is basically:
z_k = h(x_k, landmark) + noise
Here, z_k is an image measurement (like a feature pixel location), x_k is our pose, and h(·) is the camera projection. The "noise" grows when frames blur, lighting shifts, textures repeat, or people walk through view.
The big catch with monocular VO is scale ambiguity. The camera can tell direction and rotation well, but the translation scale can float. That's where encoders help. Wheel odometry provides a real-world distance prior, so the camera motion stops "breathing" in and out.
Fusing encoder motion and camera updates with EKF-SLAM, the two-step recipe
Extended Kalman Filter SLAM (EKF-SLAM) explains fusion clearly, even if many modern systems use pose graphs later. We track a state x and a covariance P. For the simplest case, x contains robot pose (x, y, θ). In landmark-based EKF-SLAM, x also includes landmark positions.
Predict with wheel odometry:
x_k|k-1 = f(x_k-1, u_k)
P_k|k-1 = F P_k-1 Fᵀ + Q
u_k is our wheel input (d, dθ). F is the Jacobian of f, which linearizes motion around the current estimate. Q is motion noise, meaning how much we distrust encoders for that step.
Update with the camera measurement:
innovation y = z_k - h(x_k|k-1)
S = H P Hᵀ + R
K = P Hᵀ S⁻¹
x_k = x_k|k-1 + K y
P_k = (I - K H) P
H is the Jacobian of the camera measurement model. R is measurement noise, meaning how much we distrust the camera for that update.
If you want a concrete ROS2 reference implementation on TurtleBot3, the turtlebot3-ekf-slam example project is a helpful "read the source" companion.
EKF fusion reduces drift, but loop closure is the real drift killer. When we recognize a previously seen place, we add a constraint that says, "These two moments happened at the same location." Then the system adjusts past poses to make the full story consistent.
Picking a SLAM approach for camera plus encoders, what works in real robots
As of March 2026, ROS 2 Humble is still a common baseline for TurtleBot3 in labs and classrooms. For camera plus encoders, we usually end up on one of two paths.
First, we can run a visual SLAM system where the camera drives the pose estimate, and encoders stabilize motion and scale. That's close to the ORB-SLAM3 family of pipelines. Second, we can treat the camera as a visual odometry source, fuse it with wheel odometry (often via robot_localization), then hand that improved odom into a mapping back-end.
If we're looking for ROS2 integration patterns around ORB-SLAM3, the orb_slam3_ros2 wrapper repository gives a realistic idea of what topics and dependencies look like.
When EKF-SLAM is enough, and when we should switch to pose graphs
EKF-SLAM shines when we're learning, when the map is small, or when we track a modest number of landmarks. However, EKF-SLAM can get heavy as landmarks grow, and linearization errors can cause instability.
Pose graphs scale better. Instead of keeping every landmark in one giant filter, we store poses as nodes and constraints as edges. Loop closure becomes clean: when we revisit a place, we add an edge between "now" and "then," then the optimizer spreads the correction across the whole trajectory.
Common failure cases and the math knobs that fix them
Most "SLAM is broken" moments are just mismatched noise settings or bad data.
Wheel slip during turns: increase Q for rotation, and calibrate b carefully.
Blurry frames or low light: increase R for camera updates, and slow the robot down.
Repeating textures (hallways, tiles): tighten outlier rejection, and require better feature geometry.
Moving people: mask dynamic regions if possible, or reduce trust in those measurements.
Scale jumps in monocular VO: enforce stronger encoder priors, or add an IMU later.
Implementing it on TurtleBot3, a clear ROS2 plan from sensors to a live map
Many TurtleBot3 SLAM tutorials are LiDAR-first, but the plumbing ideas still transfer. Our goal is simple: publish correct odometry, publish camera images with calibration, keep the TF tree consistent, then let a SLAM or fusion node do the math.
Data plumbing in ROS2, topics, transforms, and time sync we must get right
We need a minimum checklist that we can verify in RViz2:
Encoders should produce /odom as nav_msgs/Odometry, with reasonable twist and pose increments. The TF tree must include map -> odom -> base_link -> camera_link. Camera drivers should publish /image and /camera_info with correct intrinsics.
Time matters as much as frames. In Gazebo, we set use_sim_time so all nodes agree on the clock. On hardware, we keep timestamps consistent and avoid large delays in the camera pipeline. If TF or time slips, the math can be perfect and the map will still smear.
A simple build path, start in Gazebo, then move to the real TurtleBot3
We can keep this process calm and repeatable.
Start in Gazebo with the TurtleBot3 model set, then verify /odom and camera topics. Next, run a visual odometry or visual SLAM node and confirm we get a pose estimate that moves when the robot moves. After that, fuse wheel and camera motion, either inside the SLAM system or with a dedicated EKF. Then we visualize the trajectory and frames in RViz2 and drive a loop to test loop closure behavior.
On real hardware, we repeat the same steps but move slower and improve lighting. Good calibration pays off fast. If we also run a LiDAR-based baseline like SLAM Toolbox or Cartographer, we can compare trajectories and spot camera failure modes sooner.
Conclusion
SLAM for a differential-drive robot with a camera and wheel encoders comes down to a loop we can reason about: we predict motion with encoders, we correct with camera measurements, and we track uncertainty using Q and R. Once we add loop closure, drift stops being permanent and becomes something we can pull back into place.
Our best next step is practical: calibrate wheel radius and track width, verify TF frames, then tune noise values while watching the trajectory and covariance in RViz2. Test in simulation first, then bring the same setup onto the real TurtleBot3 with steady speeds and stable lighting.



Comments