The Odyssey: Building an Autonomous “Follow-Me” Rover
TL;DR
Our experiences, successes and failures trying to build an autonomous small-scale “follow-me” vehicle. If you are looking for the code, scroll to the Conclusion section.
Why Am I Doing This
We were in Japan, cherry blossoms were still in bloom. There were so many things to do and see but our 5-year old was a bit preoccupied with a “robot dog.” Not one we had seen in Akihabara Electric Town. No, it was an autonomous buggy she saw following a jogger in San Francisco the day before we left. She wanted one and wouldn’t let it go.
Perhaps this would be a fun project that we could do together? Me: “maybe we’ll try to build our own robot when we get back home.” I knew that I would do most of the work. I wasn’t sure exactly what we were going to do at the time.
Initial Approach
The first steps were easy. I had built model RC cars growing up. I‘m somewhat of an engineer. So check. I can solder. Check.
The real challenge: making a somewhat complicated project fun while remaining within the attention span of a 5-year old (who can be very results-oriented and easily distracted).
My top quick-and-dirty options at this time were:
- GPS to GPS: one on the car, one in hand
- Sonar: transmitter in hand, receivers on the car
- Infrared (IR): like sonar but with IR
- Radio Frequency Direction-Finding
Some of these options seemed straightforward. After some searching, it seemed that someone had already worked on the sonar approach:
First thought: great, this project is going to be easy! I found another example and appeared to be all ready to go:
Fabulous!
Incidentally, I had always wanted an excuse to use a Raspberry Pi (RPi) on a real project. Most examples I found were Arduino-based. But I had also found an alternative approach (below) that used the RPi with a NAVIO2 “hat” — it provided integrated GPS, IMU, as well as and the Pin Out, PWM, and SBUS interfaces we needed on a servo rail:
The open-sourced Python framework is called “burro”. The author adapted it from the donkey car project where the goal is a sort of rover AI race/competition. We were not training our car to follow lines, or run around a track, but I suspected that this framework might come in handy.
Chassis, Power and Sensors
The car itself was assembled in about a week. My daughter helped build the center and rear differentials, the front steering assembly and several shocks.
She got a taste of the small-scale vehicular mechanics and I wanted her to enjoy the physical process of assembling the rover. The software work that followed would be a bit beyond a 5-year old.
We found that many of these rover projects begin by making an “upper deck” to hold additional electronics. However we weren’t so interested in the 3D printing approach (more delay for a kid).
With a bit of plexiglass we constructed our own upper deck and added a cheap RPi case that could hold an external ALFA wifi adapter. The car would host its own network access point. Later we used the same material to mount sonar to the anti-sway bar attachment points on the front shock tower.
We chose a brushless motor with higher turns (effective 21.5T) for lower speed and higher torque — this rover needed to move at small human speed with good low-end dynamic range. Our electronic speed control (ESC), unfortunately, provided 6V BEC out — 1V too high for the NAVIO2 servo rail acceptable range.
To solve this, I bought and soldered in an external 5V BEC for the NAVIO2 and bypassed power to the servo rail. This worked for quite some time until it didn’t — a random collision caused a short circuit that then fried both the ESC and the motor’s onboard electronics. I ripped everything out and replaced with ESC that provided the exact 5V BEC input required. This new ESC also more control over top end speed, torque, throttle punch, braking, and power curves. Super!
One note about power: high torque steering servos can cause sudden amperage spikes drawing power away from the RPi — basically, a sudden steering change inadvertently will cause the RPi to lose power and reboot. To avoid this, I added a second low-profile battery to keep the power sources isolated.
GPS: Our First Failure
We pursued the GPS idea first. GPS was already available on the NAVIO2 hat. When it was time to test the GPS system, I started up the NAVIO2, plugged in the external antenna, and it acquired a dozen satellites. I plotted the incoming tracks on leaflet (a javascript map). But the tracks were all over the place! As in sometimes close and sometimes across the street.
This wouldn’t do. Commercial GPS with a best case accuracy of say ~3 meters. Our car would have zig-zagged around chasing random tracks. GPS always needs a clear line of sight to the sky to acquire satellites. It had occurred to me that “San Francisco jogger” guy wasn’t carrying anything at all to direct the car and and at that time there was an overhead concrete deck. GPS couldn’t have been the solution so we abandoned it.
Sonar-based steering had similar problems: sonar requires that a transmitter that must be pointed at the car at all times (see videos of the cooler or other similar solutions). Ditto for IR solutions. Radio frequency DF, at least on the surface seems feasible, until you see the physical footprint of DF antennas that would have to be mounted to the vehicle:
Little kids don’t have the patience to run backwards pointing a thing at a car anyway. Little kids just want to run. Jogger guy certainly wasn’t facing backward with transmitter in hand. He was just running.
The simple approaches were not working. It only left one option that I knew of and I was worried it would doom the whole project. So we took the intimidating leap into the world of GPUs, tensors, and machine learning.
End-to-End ML Computer Vision
Autonomous driving is the big topic du jour. So I hoped that we could leverage some existing, well-proven resources to stay focused. I wanted to avoid having a complex subject of active research totally hijack and prolong achieving our goal.
I read up on pre-trained object identification models and ran MobileNetSSD on the RPi. This model identifies humans but produced only about 1 frame per second (FPS). Meaning it was slow and terrible for car navigation. I also noticed that it dropped detection on many video frames. Only if you were close to the camera was it reliable. But back too far away and poof, nothing, no detection.
I added an Intel Neural Compute Stick 2 to improve frame rates to about 8–10 FPS. After reading more, I tried to hand these SSD detection frames off to a tracker. The trackers we tried in the python-opencv package didn’t really improve performance much. The other options, available in python-opencv-contrib looked promising, however the Intel’s NCC 2 SDK for Python requires you to replace what you have with their opencv shared object and doing so then conflicts with the python-opencv-contrib library making these other trackers unusable.
I also began to suspect that our fisheye lens (175 FOV) was degrading object detection accuracy. So I spent an enormous amount of time trying to correct for lens distortion:
This had no effect. Detecting persons at distance remained a problem. I lost hope in an integrated approach using existing components.
So I went a bit deeper into the rabbit hole. I was, now, going to try to train my own CNN and use the end-to-end ML approach. This means there is very little but the neural net between the sensors and actuators. You train it as a system end-to-end against your problem. We went back to the burro project to get started with this because it already had categorical and regression CNN models that had been setup to train a rover auto-pilot for line following.
We would need to collect a lot of data for this to work. Before investing that time, I wanted all possible sensors installed to begin with at least sonar or LIDAR to avoid having to start over later. The burro project made it simple to integrate stereo sonar into the rover —we added two MaxBotix MB1240 EZ4 Ultrasonic Distance Sensors near the front shock tower.
I tried to incorporate the sonar, the IMU readings and camera frames into an end-to-end Keras/Tensorflow network. Pyimagesearch.com is a fantastic resource for ramping up on the concepts and techniques.
I had captured about 40,000 frames of training data over 5 weekends — my daughter running in front of the car. We integrated several CNN approaches into our test model, optimized learning rates, fused the 2D CNN with a linear vector for IMU and sonar readings, and we used an eGPU to speed up the training process. It was “slightly-operational”. The training data included both the raw frames alongside frames that had been processed by the SSD running on the Intel NCC (see below). A sidecar file held sonar and IMU readings.
I trained dozens of models using raw images and others using the SSD frames. My hope was that the preprocessed blue boxes might better inform steering and throttle decisions. But it turns out the SSD-processed models didn’t noticeably outperform the raw-frame models. We continued to try different CNN network layouts and we iterated on dropout, batch normalization, learning rates, and removing layers to avoid data overfitting. We added a learning rate finder. None of it mattered.
I suspected we just didn’t have enough training data, at just above 45,000 frames. The data acquisition process had a natural limiter: coaxing a 5-year old to participate in yet another 30-minute training data collection session in the company parking lot. You can acquire more frames by mirroring your own data, augmenting it, and shifting brightness. And street/line following projects have entire simulators (gym/gazebo) to help produce simulated training data. However, none of these environments involve a rover chasing after a 5-year old girl.
Our best model to date would follow her for a block or two and then take a hard left into a parked car or a bush. I couldn’t trust steering decisions except in very contrived situations. I couldn’t trust autonomous throttle control in any conditions. We had used the sonar thus far as just a throttle kill switch for safety. It didn’t appear to enhance the CNN-based decisions in any way. It was difficult to differentiate what was working and what was not. Each model iteration might appear to work “marginally” better than the last. However, just a minor configuration tweak could just as easily result in massive performance regressions.
Also, many CNN models failed to properly compile to the .bin format needed for the Intel NCC. My experience with the NCC began to sour: it had scant documentation, byzantine input config files, no helpful examples, and spotty support.
I bought a Google Coral EDGE TPU on a whim.
It was immediately obvious that Coral’s keras/tensorflow -> tensorflow lite -> uint8 quantization -> EDGE TPU compiled model workflow was much better thought-through. The Coral TPU could compile and run our custom CNN models where the Intel TPU had failed. Despite this new discovery, we needed a better approach again. The CNN training and improvement process was at once promising or tantalizing but ultimately just frustrating. The improvements and failures all occurred in a blackbox and the massive amount of time needed for data collection and research were too much of an obstacle to overcome for us.
A Breakthrough
So we hit the “reset” button on the design. I had originally given up on MobileNetSSD due to its detection performance at distance: it was great at tracking people. Just not so much with small people. Or large people walking away and becoming small people. Along the way I had, of course, tried other object detectors. For example, I tried to yolov3-tiny and was impressed but it failed to properly convert to the NCC stick. A keras port of the network did run on the Google Coral but the delay/FPS was unacceptable — under 5 FPS performance as its still a 17 layer network. But perhaps I could give MobileNetSSD another look with the Coral TPU.
A few things vastly improved an SSD’s ability to meet our performance needs that I had overlooked or dismissed upfront:
- Replacing the glass (a better lens): I ordered the Arducam M12 Lenses
- Using Google’s default v1 model, quantized, and compiled on EDGE TPU
- Applying a SORT Kalman output filter
The improved lenses were a godsend and kicked myself for not identifying this problem much earlier.
The quantized tflite model was speedy (50–60ms delay) even though the RPi 3B+ throttles its transfer speed to USB2.0. Watching the SORT tracker in action is really pretty amazing and it gave us a lot of confidence that the vehicle could continue to follow us. Suddenly, with these pieces in place everything was actually working. The rover would steer toward people as they walked around the room. It reminded my wife of the Terminator.
SSDs will acquire multiple object targets, so you have to choose what to follow. For target selection, we currently pick the closest person in the frame (largest box by area) if no other target is available and tracks that person using SORT. Ours discards very small objects from consideration: the car will take off after these and they are usually false positives — read “not people.” It would be great to use facial recognition in the future, but, in practice, what we have now works well for a bunch of kids running around with a robot car.
We had solved the steering, target selection, and tracking problems by this point in time. But we still needed an auto-throttle solution. Trusting a somewhat powerful model car with autonomous throttle control around little kids is a somewhat dicey proposition. So the most important factor to me was safety features and a smooth throttle response.
After iterating on the problem, what I ended up with was this:
- Use a PID control system based on a SetPoint of a constant predefined height in pixels (for a person), output is raw throttle value
- Run the raw throttle value through an averaging step function to smooth throttle response- that is, use what burro already had.
- Then take the smoothed throttle output and apply a “sonar factor”: starting at a max sonar distance, use sonar readings to factor the throttle value down to zero as it approaches the minimum safety distance (then apply brake)
- Finally, use the Castle Creations ESC firmware programming to limit top-end power, reduce punch, enforce a gradual power curve, increase braking power
After a lot of tuning of the PID coefficients, sonar thresholds, and throttle averaging factor we actually had a working solution!
Conclusion
We now are the proud owners of a totally functional, autonomous “follow-me” rover. That sentence makes my wife frown and raise her eyebrows. Nevertheless, we can activate it from the responsive web interface the rover serves over its AP using a smartphone, see the target box, sonar and IMU readouts, and control auto-pilot modes.
The auto-throttle moderates acceleration nicely and I trust its decisions under most conditions. The rover is not as good with objects on the far left or right due to the square 300px frame crop required by MobileNetSSD. It can lose targets that exhibit sharp lateral movements — kids are known for their sharp lateral movements. The upgraded lenses perform really well under most conditions with the exception of direct sunlight and some deep shadows or mottled backgrounds.
We’ve taken the rover on several “walks” through the neighborhood, at dusk no less. And, yes, you can definitely go on a jog with the car. I’ve done it. But most importantly, my daughter finally has a robot dog. And I think she learned something really valuable about engineering, persistence, failure, and about how stubborn Daddy is.
For more information, see my repository:
Please send any comments or ideas, I would love to hear about them.