SoC Runs AI Simultaneously on 14 Video Streams
Image processing specialist Ambarella has launched two new SoCs for single- and multiple-sensor security cameras, each with new AI capabilities enabled by the company’s CVflow AI accelerator engine. Both support 4K video encoding and advanced AI processing such as facial recognition or license plate recognition.
The CV5S SoC targets multi-sensor camera systems, encoding four imager channels of up to 8MP/4K resolution, each at 30 frames per second (fps) while performing advanced AI on each 4K image stream. It can handle up to 14 inputs. The SoC family doubles the encoding resolution and memory bandwidth of Ambarella’s previous generation of products while consuming 30 percent less power. It consumes <5W and provides 12 eTOPS (GPU-equivalent TOPS, Arabarella’s measure of the amount of GPU horsepower required to run the same AI processing tasks).
The other new SoC, CV52S, targets single-sensor cameras and supports 4K resolution at 60 fps. Compared to previous generations of Ambarella SoCs, this new device quadruples AI performance, doubles CPU throughput and offers 50 percent more memory bandwidth. It consumes <3 W and provides 6 eTOPS.
The performance boost stems from migration to the 5-nm process node along with improvements to–and enlargement of–Ambarella’s in-house CVflow AI accelerator block.
“You see all these startups coming from everywhere, saying they have the best AI performance per watt, and they may be right,” said Jerome Gigot, Ambarella’s senior director of marketing. “But that doesn’t make a camera, that doesn’t make a product. If you just have an AI accelerator, you just have an AI accelerator.”
Gigot noted that an imaging pipeline for 4K or 8K video is complex, handling a large amount of data, encoding big data volumes, transferring those data to a special block for AI processing while probably running a Linux stack on top. That’s difficult to achieve at low power budgets while maintaining video quality.
Alongside the CVflow AI accelerator, both the new SoCs include Ambarella’s image signal processor (ISP) that handles features like color processing, auto-exposure, auto white balance and noise filtering.
“This block we’ve been developing for 16 years,” said Gigot. “That’s why we think startups still have a long way to go. They could license [an ISP block from elsewhere] but then it’s not really integrated with the rest of the system in terms of memory access and everything else.”
The memory system is among the company’s key pieces of IP.
“We have one memory controller, and we orchestrate the whole thing so that when we get data on-chip. We try not to make any copies,” Gigot said. “We move pointers around, we don’t move data around. That’s only possible if you design the whole architecture from scratch, knowing exactly what the chip is going to do.”
The AI accelerator is a vector processor that can speed convolution and other common AI functions, or be used for classical computer vision workloads. Users can also choose to run parts of a neural network (such as sorting algorithms in a single-shot detector network) or via an on-chip dual-core Arm Cortex-A76 CPU.
The software stack allows applications to take advantage of coefficient sparsity, a technique whereby network coefficients with values that are close to zero are rounded down to zero. The approach can “prune” whole “branches” of calculations from the algorithm in order to vastly reduce computing requirements.
Sparsification “is a really effective technique for us because when there’s a zero coefficient, in our architecture we don’t do the operation, we have a skip [function],” he said. “So we don’t calculate the result for that coefficient. It takes us pretty much zero cycles.”
The process typically identifies 50 to 80 percent of the coefficients as targets for sparsification, Gigot said. Some minor retraining is usually required after sparsification in order to regain the prediction accuracy lost during the process. According to Gigot, retraining can usually bring accuracy to within 1 percent of the original model –- an acceptable tradeoff for most customers, especially given up to 5-fold model size reduction. Ambarella also is working on sparsification and quantization tools that are more architecture-aware.
With the ability to accept up to 14 video streams, then perform AI on those streams simultaneously, will customers run multiple neural networks simultaneously? Will some sort of multiplexing scheme be required?
Yes to both, Gigot replied. “The CVflow is a very fast vector engine, a very fast convolution engine. Everything is time-multiplexed. We have different paths in hardware so we can parallelize operations, but we don’t tie it to a specific network [which is] totally different to batch processing on a GPU.”
Batch processing, a technique often employed by large GPUs, groups images and sends them to be processed in parallel. GPUs have other parameters already loaded. That approach reduces computing cost by not having to switch between operations.
For smaller engines like CVflow, bigger neural networks must be broken into chunks to be processed since the chip’s memory can’t store all the parameters at once. Consecutive chunks may originate from the same neural network, or another network, or another channel input. Typical hardware utilization on CVflow is between 70 and 80 percent, Gigot said, adding that switching networks/channels does not affect efficiency.
The CV5S and CV52S are expected to begin sampling in October 2021.