Tesla Places Big Bet on Vision-Only Self-Driving
The latest update to Tesla’s self-driving technology ups the company’s stake in a bold bet that it can deliver autonomous vehicles using cameras alone. But despite improving capabilities in vision-based self-driving, experts say it faces fundamental hurdles.
Last Saturday, Tesla rolled out the much-delayed version 9 of its “Full Self-Driving” (FSD) software, which gives Tesla vehicles limited ability to navigate autonomously. The package, which is already on sale as a $10,000 add-on, has been in beta testing with a select group of drivers since last October. But the latest update marks a significant shift by ditching input from radar sensors and relying solely on the car’s cameras.
This follows the announcement in May that Tesla will be removing radar altogether from its Model 3 and Model Y cars built in the US and suggests the company is doubling down on a strategy at odds with most other self-driving projects. Autonomous vehicles built by Alphabet subsidiary Waymo and GM-owned Cruise fuse input from cameras, radar and ultra-precise lidar and only ply streets pre-mapped using high-resolution 3D laser scans.
Tesla CEO Elon Musk has been vocal in his criticism of lidar due to its high cost and has instead advocated for a “pure vision” approach. That’s controversial due to the lack of redundancy that comes with relying on a single sensor. But the rationale is clear, says Kilian Weinberger, an associate professor at Cornell University who works on computer vision for autonomous vehicles.
“Cameras are dirt cheap compared to lidar,” he says. “By doing this they can put this technology into all the cars they’re selling. If they sell 500,000 cars all of these cars are driving around collecting data for them.”
Data is the lifeblood of the machine learning systems at the heart of self-driving technology. Tesla’s big bet, says Weinberger, is that the mountain of video its fleet amasses will help it reach full autonomy faster than the smaller amount of lidar data its competitors relying on a small number of more sensor-laden cars driven by employees.
Speaking at the Conference on Computer Vision and Pattern Recognition last month, Tesla’s AI chief Andrej Karpathy revealed the company had built a supercomputer, which he claimed was the fifth most powerful in the world, to process all this data. He also explained the decision to drop radar, saying that after training on more than 1.5 petabytes of video augmented with both radar data and human labeling the vision-only system now significantly outperforms their previous approach.
The justification for dropping radar does make sense, says Weinberger, and he adds that the gap between lidar and cameras has narrowed in recent years. Lidar’s big selling point is incredibly accurate depth sensing achieved by bouncing lasers off objects—but vision-based systems can also estimate depth, and their capabilities have improved significantly.
Weinberger and colleagues made a breakthrough in 2019 by converting camera-based depth estimations into the same kind of 3D point clouds used by lidar, significantly improving accuracy. Karpathy revealed that the company was using such a “pseudo-lidar” technique at the Scaled Machine Learning Conference last year.
How you estimate depth is important though. One approach compares images from two cameras spaced sufficiently far apart to triangulate the distance to objects. The other is to train AI on huge numbers of images until it learns to pick up depth cues. Weinberger says this is probably the approach Tesla uses because its front facing cameras are too close together for the first technique.
The benefit of triangulation-based techniques is that measurements are based in physics, much like lidar, says Leaf Jiang, CEO of start-up NODAR, which develops camera-based 3D vision technology based on this approach. Inferring distance is inherently more vulnerable to mistakes in ambiguous situations, he says, for instance, distinguishing an adult at 50 meters from a child at 25 meters. “It tries to figure out distance based on perspective cues or shading cues, or whatnot, and that’s not always reliable,” he says.
How you sense depth is only part of the problem, though. State-of-the-art machine learning simply recognizes patterns, which means it struggles with novel situations. Unlike a human driver, if it hasn’t encountered a scenario before it has no ability to reason about what to do. “Any AI system has no understanding of what’s actually going on,” says Weinberger.
The logic behind collecting ever more data is that you will capture more of the rare scenarios that could flummox your AI, but there’s a fundamental limit to this approach. “Eventually you have unique cases. And unique cases you can’t train for,” says Weinberger. “The benefits of adding more and more data are diminishing at some point.”
This is the so-called “long tail problem,” says Marc Pollefeys, a professor at ETH Zurich who has worked on camera-based self-driving, and it presents a major hurdle for going from the kind of driver assistance systems already common in modern cars to truly autonomous vehicles. The underlying technology is similar, he says. But while an automatic braking system designed to augment a driver’s reactions can afford to miss the occasional pedestrian, the margin for error when in complete control of the car is fractions of a percent.
Other self-driving companies try to get round this by reducing the scope for uncertainty. If you pre-map roads you only need to focus on the small amount of input that doesn’t match, says Pollefeys. Similarly, the chance of three different sensors making the same mistake simultaneously are vanishingly small.
The scalability of such an approach is certainly questionable. But trying to go from a system that mostly works to one that almost never makes mistakes by simply pushing ever more data through a machine learning pipeline is “doomed to fail,” says Pollefeys.
“When we see that something works 99 percent of the time, we think it can’t be too hard to make it work 100 percent,” he says. “And that’s actually not the case. Making 10 times fewer mistakes is a gigantic effort.”
Videos posted by Tesla owners after the FSD update showing their vehicles lurching out into highways or being blind to concrete pillars in the middle of the road demonstrate the gulf that still needs to be bridged and suggests Musk’s prediction of full autonomy by the end of the year may have been overly optimistic.
But Pollefeys thinks it’s unlikely Tesla will abandon the narrative that full autonomy is close at hand. “A lot of people already paid for it [Tesla’s FSD package], so they have to keep the hope alive,” he says. “They’re stuck in that story.”
Tesla didn’t respond to an interview request.