FPGA comes back into its own as edge computing and AI catch fire
The saturation of mobile devices and ubiquitous connectivity has steeped the world in an array of wireless connectivity, from the growing terrestrial and non-terrestrial cellular infrastructure and supporting fiber and wireless backhaul networks to the massive IoT ecosystem with newly developed protocols and SoCs to support the billions of sensor nodes intended to send data to the cloud.
By 2025, the global datasphere is expected to approach 175 zettabytes per year. What’s more, the number of connected devices is anticipated to reach 50 billion by 2030. However, the traditional distributed sensing scheme with the cloud-based centralized processing of data has severe limitations in security, power management, and latency — the end-to-end (E2E) latencies for ultra-reliable low-latency communications found in 5G standards are on the order of tens of milliseconds. This has led to a demand to drive data processing to the edge, disaggregating computational (and storage) resources to reduce the massive overhead that comes with involving the entire signal chain in uplink and downlink transmissions. This, in turn, increases the agility and scalability of a network.
New advances in machine learning (ML) and deep neural networks (DNNs) with artificial intelligence promise to provide this insight at the edge, but these solutions come with a huge computational burden that cannot be satisfied with conventional software and embedded processor approaches. Additionally, the design of hyper-specialized, application-specific ICs (ASICs) are failing as shrinking process geometries drive the cost of development and production out of the realm of edge devices. Moreover, the lack of reconfigurability of ASICs severely limits any potential system upgrades. Traditional FPGA approaches are typically too expensive and power-hungry for the densities demanded in new-generation edge applications.
The niche of edge computing burdens devices with the need for extremely low power operation, tight form factors, agility in the face of changing data sets, and the ability to evolve with changing AI capabilities via remote upgradeability — all at a reasonable price point. This is, in fact, the natural domain of the FPGA with an inherent excellence in accelerating compute-intensive tasks in a flexible, hardware-customizable platform. However, much of the available off-the-shelf FPGAs are geared toward data center applications in which power and cost profiles justify the bloat in FPGA technologies. Fortunately, there is a solution out there: With the Efinix Titanium FPGA family, the requirements of near-data computation are squarely addressed with its advanced Quantum compute fabric, allowing for the flexible configuration of up to a million logic elements (LEs) and routing for high utilization ratios, regardless of the application.
The need to move data processing to the edge
In terms of connectivity, the past decade has, more or less, been dedicated to one of three things: bringing wireless connectivity around the world, increasing the strength and integrity of said connection, and ensuring that everything viable (from person to “thing”) is somehow connected. This has, in essence, been realized with next-generation 5G deployment — densifying the base cellular infrastructure and developing newer technologies to optimize data throughput, capacity, coverage, and latency requirements — along with the IoT revolution, wherein physical objects are outfitted with sensing capabilities and/or tags. These technological developments have already had profound societal implications whereby wireless connectivity is an intrinsic part of daily life. The ability to remotely monitor, track, and even control objects with sensors and actuators has almost become an assumed capability for all things, from home appliances to complex industrial machinery. However, this massive increase in device density has led to some very apparent bottlenecks.
Cloud-centric IoT extracts, accumulates, and processes the massive amounts of sensor data from IoT nodes at public/private clouds, leading to significant latencies. The various topologies for backhaul access — from the edge device to the gateway, back to the cloud via a fiber or wireless connection — introduces three main bottlenecks in terms of:
- Power budget
Traditional IoT is often defined by highly power-constrained end devices sending small payloads to internet-connected gateways at low to medium throughputs via star or mesh topologies. These multi-hop architectures fail to meet the low-latency requirements of many critical time-sensitive applications, from public safety and medical to industrial automation and more. Protocols defined for low-latency, medium-throughput, time-synchronized connections such as WirelessHART, ISA 100.11a, IEEE802.11ac, and LTE-M have a round-trip latency that go down to only 10 ms with direct gateway access; however, the typical latency falls at several hundred milliseconds.1 This is just within the realm of IoT — if we shift our focus toward cellular networks, the smallest allowable E2E latency in a 5G-based network for high-voltage electricity distribution is 5 ms, and this goes up to 10 ms for discrete automation applications.2 However, established advanced manufacturing techniques utilizing a hardwired Ethernet-based (e.g., EtherNet/IP, Profinet IO, Ethercat, etc.) or fieldbus-based (e.g., Profibus, Foundation Fieldbus, CAN, etc.) technology with time-sensitive networking has to reliably achieve sub-millisecond cycle times, sub-microsecond latency, and extremely low jitter for plant operations.3 The closed-loop sense-to-actuation time of these applications stands at less than 1 µs, with a maximum transaction error rate at less than 10–9 — numbers that traditional wireless networks struggle to compete with.
Wireless connectivity requires either asynchronous or synchronous communications. For reliable data transmission, scheduled transmissions are necessary. But this takes up significant power — the device is unable to operate in the desirable sleep or low-power modes that allow for increased battery lifetime. Moreover, bringing data to the cloud via a gateway and/or a multi-hop transmission with the smart placement of sensor nodes not only diminishes security but also increases hardware costs. This is a major objective in cellular generations beyond 5G (6G and beyond), wherein the mass collection of user information from data service providers have often led to data-leakage incidents.4 Complete anonymization and the untraceability of data can be realized with the decentralization of compute-intensive tasks.
The basic requirements for bringing intelligence to edge devices
There is a growing consensus to expand the computational infrastructure from the data center to the edge. Concepts such as federated learning shift standard centralized ML approaches from the data center to mobile phones with collaborative learning over a shared prediction model — de-embedding the ability to do ML from the requirement of storing data to the cloud.5 Advanced DNNs are developed and evolving every day to better enable edge-based processing capabilities. Successfully bringing intelligence to edge devices diverges from the classical AI examples — for instance, in personalized shopping, AI-powered assistants, or predictive analysis in manufacturing facilities. Edge/fog computing examples such as autonomous vehicle control, the remote control of robotics that requires complex feedback mechanisms, and even smart grid end devices utilizing ML can better manage energy resources from renewables and the grid based on predictive analysis of native energy usage. For applications such as these, the major deciding factors to successfully implement AI include:
- Low power consumption
Comparing the popular AI chipsets for IoT/edge nodes
The AI chipset market has consistently experienced a massive growth, in which the $7.6 billion market value in 2020 is anticipated to grow to 57.8 billion by 2026.6 Leading AI hardware varies between hyper-specialized solutions such as:
- Highly customized ASICs and SoCs
- Programmable FPGA solutions
- General-purpose GPUs and CPUs
General-purpose GPUs and CPUs often follow the von Neumann architecture wherein an instruction fetch cannot occur at the same time as a data operation, causing instructions to be executed sequentially. In multi-processor solutions such as vector CPUs and multi-core GPUs, this is somewhat bypassed but does require more data sharing across cores and increases latency. This software-managed parallelism must optimally distribute the workload between processing elements, or a computational load and communication imbalance can occur — a characteristic that doesn’t lend itself to custom data types and specific hardware optimizations. In terms of efficiency for latency, power, parallel processing, and flexibility/reconfigurability, FPGAs are inherently better than GPUs. First, while a CPU and GPUs must process data in a specific manner (e.g., SIMD, SIMT execution models), FPGAs and ASICs essentially directly implement the software algorithm in hardware whereby the logic elements simply complete the software instructions. Furthermore, this very same quality allows for more FPGA power savings and reconfigurability — one can choose to change the nature of the dataflow through the hardware as opposed to being hard-coded with ASICs, SoCs, GPUs, and CPUs.
In terms of popular AI chipsets, ASICs have taken the lead, with FPGAs following. However, in terms of the major parameters for intelligent computing at the edge, ASICs fall short. This is particularly true for cost: IoT deployments can vary between tens of nodes to hundreds of thousands of nodes. ASICs are notoriously difficult to create, requiring years of development and a massive capital expenditure of tens of millions of dollars just to produce — a cost that is typically only justifiable en masse with millions to billions of units. Moreover, AI development is continuously dynamic, with hundreds of existing topologies and their respective neural networks improving significantly within months. Over time, as new models with different features and layers emerge, companies would likely desire to embrace these changes. This calls for a low-cost flexible, reconfigurable platform that can be rapidly prototyped and deployed.
How the various popular AI chipsets meet the needs for intelligent edge applications
Why conventional FPGAs can’t deliver intelligence to the edge
While FPGAs are continually growing more competitive in the AI chipset market that has been traditionally dominated by ASICs and GPUs, these platforms have mostly been utilized for prototyping and developing an ASIC or for use in the public and private cloud for applications such as web search, image classification, and translation. These devices are typically expensive, power-hungry, and large to meet the performances required to run complex AI algorithms. The primary purpose of an FPGA is programmability, wherein the hardware fabric is composed of programmable LEs and programmable routing via switch blocks. With this fabric, a user can essentially connect any LE to one of many wiring tracks through a programmable switch. Scaling this technology has generally been met with the tactic of increasing the density of LEs and ensuring the routing switches have the routability to cover this increase. This painstaking and expensive process involves the use of teams of engineers to optimize FPGA routing and IC designers to decrease size where feasible, allowing for incremental increases in density while also pigeonholing the FPGA to costly, power-hungry applications outside the edge.
Almost 10 years ago, co-founders Sammy Cheung and Tony Ngai predicted this very situation and started Efinix with a vision of creating an FPGA technology that would deliver on the true potential of FPGAs to address the needs of the emerging edge market. Today, the Efinix Titanium family of devices stands alone in the marketplace, delivering the compute requirements of edge AI in a power envelope and footprint that makes them a natural fit for the most demanding edge applications. This is due in large part to their innovative Quantum compute fabric that consists of reconfigurable tiles, or exchangeable logic and routing (XLR) cells, and that does away with traditional routing and allows LEs to be smaller and used more flexibly. Integrated memory blocks and high-speed DSP blocks (multiplier blocks) have a logic density range from 36,000 to 1 million LEs. This fundamental difference in architecture allows for a remarkably high utilization when compared with traditional FPGAs, regardless of the end application. Efinix’s FPGA technology largely diverges from the conventional FPGA by accomplishing high density in a small device package with low power consumption while also maintaining all the flexibilities that come with FPGAs. Altogether, these features make this solution truly disruptive, one that is ahead of the competition when it comes to edge/fog computing.
Quantum core fabric versus traditional FPGA fabric
A closer look: How the Titanium FPGA family addresses the basic requirements of edge computing
Cost-effectiveness, size, and power advantages
The small, 16-nm process enables device form factors as small as a 0.5-mm pitch, 5.5 × 5.5-mm BGA to be readily integrated into an edge node. Outside of real estate considerations, the split-off from the conventional fabric of the FPGA also allows for the Titanium family of FPGAs to come in at a low price point. This, in turn, allows for additional cost savings that come with edge computing over centralized cloud-based processing while simultaneously lowering the barriers for implementing an FPGA.
IoT nodes will also inevitably require low battery consumption and often utilize energy-harvesting techniques to minimize node maintenance. The desirable sleep modes that are often found in low-power wireless modulation schemes are not typically seen with edge computing, as data processing is done as often as possible. However, energy-efficient power schemes can be employed by using parallelism to slow internal clock frequencies for lower dynamic power. This diverges from the bottlenecks found with sequential processors that only employ spatial parallelism wherein the typical solution of throwing in more processor cores will only burn power — the batch processing of in-memory data cannot provide consistent processing performance for dynamic, incoming data streams from the I/O channels. FGPAs offer both spatial and temporal parallelism and therefore employ not only data parallelism but also task and pipeline parallelism.7 This allows for more variety in efficient dataflows that reduce the impact of memory on power consumption (e.g., spatial and temporal mapping of operations on LEs to reduce off-chip memory access by reusing data stored in LE memory blocks).9
The architectural advantage: Flexibility and reconfigurability
The ultimate challenge for edge applications is finding a suitable algorithm for the specific application and mapping it efficiently to hardware. Oftentimes, the networks (e.g., DNNs, CNNs, etc.) are complex and computation-, memory-, and power-hungry, and so they require access to specialized hardware accelerators with optimized memory for the execution of algorithms over a consistent data stream while maintaining a small power envelope. By mapping workloads onto Titanium FPGAs, users can take advantage of the inherent small size, low cost, and high utilization to deliver intelligence to the edge. For companies just entering the field or for veterans making the switch, this does not have to be a complex process. Titanium-embedded RISC-V processors allow designers to run the kernel of their algorithm and rapidly innovate in the Edge Vision SoC framework.
Edge Vision SoC design flow
The RISC-V cores in the Titanium are “soft” in that they are instantiated when needed in the FPGA fabric rather than being hardened into the silicon. This keeps them flexible so that they can be customized as needed during development of the application. During compilation, the Efinity software dynamically decides whether to use the XLR cell as routing or logic, and it optimizes silicon resource use specifically for the characteristics of each design. This way, a designer can implement as many cores as needed with the required software-defined hardware acceleration.
This is the fundamental concept behind the Efinix Quantum accelerator, wherein “sockets” with all the data inputs and outputs pre-defined are readily available to be instantiated and can be programmed in software with standard calls. Software engineers can then easily pull out hot spots in their code as areas they want to target for acceleration. More specifically, within each socket, a designer can create a small piece of hardware to accelerate, for instance, the convolution for an AI algorithm and place it in the accelerator framework. Pieces of the algorithm can be moved back into the RISC-V software when needed or out into a hardware accelerator “socket” if performance is required. This fluid approach to hardware/software system partitioning is fast and easy. The end result: standard calls to a standard hardware accelerator wherein software algorithms are easily written and debugged calling small hardware accelerators that optimize system performance. This approach keeps the design concept in the software, where algorithms can be rapidly debugged, tuned, and evolved.
The Quantum core fabric found in Titanium FPGAs has the additional benefit of an intrinsic ability to ease congestion by allocating XLR cells for routing rather than logic. All of these factors combined with the cost-effectiveness of the Titanium FPGAs allow a designer to rapidly design and debug in the largest device for the original prototype, and then make the switch to the smallest device that still meets basic requirements for the end of development and production, thereby optimizing performance, power, size, and cost.
With edge computing in its early stages, the ability to interconnect with other devices is an important system-level attribute for design reuse. With Titanium, users can take advantage of the FPGA’s inherent ability to connect to virtually any device through the many I/Os (146 to 268). These I/O pins are configurable to many standards and can add some level of bridging — an amount of flexibility that is enormously difficult to achieve with other processing engines or dedicated, application-specific standard parts.
Titanium FPGAs meet all the requirements for bringing intelligence to the edge rapidly
Leveraging Titanium to serve embedded AI applications at the edge
The IoT applications that can benefit most from edge processing often overlap with applications requiring reliable, low-latency communications. The use cases for bringing complex processing to the edge while maintaining relatively low power consumption are numerous, and with time, more applications will crop up as this technology matures.
In medical applications for tele-surgery, there must be little time delay between the surgeon/controller and the equipment. For this application, a shared network architecture with both cloud and edge computing is absolutely necessary so that robotic-machine–learning algorithms are employed for all actuated tools or surgical robots to improve dexterity of human-manipulated end effectors for precise haptic feedback. This falls under the umbrella of the internet of robotic things, wherein approaches for programming a robot include imitation learning or reinforcement learning. While much of this complex field will be performed on the cloud, given the surgeon’s distant geographical location, pre-cached electronic medical records and relevant surgery history such as previously recorded robot motions can be stored locally. With this, the edge-based AI engine can allow a robot to query its local model when there is low confidence on the task to be performed. Pattern-recognition algorithms can also locally process 3D video and images and illuminate relevant features such as abnormalities and also annotate images with relevant anatomical data while minimizing data bandwidth consumed by such an operation.9
Robotics in industrial applications, meanwhile, typically perform repetitive tasks with few degrees of freedom and little variance in motions. However, these robots can be rapidly trained to successfully perform a task and change motions when small deviations occur to help prevent a plant’s downtime. Moreover, human-robot interactions can occur without risk to human life. Collaborative robots such as autonomous mobile robots and automated guided vehicles for plant-floor monitoring/maintenance combine machine vision and robotics, requiring little delay between the real-time 3D mapping of the environment and a robot’s movements using deep-learning algorithms such as simultaneous location and mapping to prevent collisions in dynamic environments. Both these applications require high compute capabilities while consuming low power.
The Titanium FPGA family is uniquely poised for these applications and more, wherein a user can develop code on the processor as normal and steadily eliminate timing bottlenecks through the flexible XLR hardware acceleration until the required near-real–time system performance is realized. This type of iterative improvement to optimize parameters such as performance, latency, and power, regardless of end application, is nearly impossible with ASICs, GPU, and CPU solutions.
Medical wearables can transmit critical information gathered locally from patient data without the need for frequent transmissions. With this type of technology, quick and effective diagnosis is only available on-site. Needless to say, wearables take size and power constraints to the extreme. But here, the Titanium Ti60 offers a unique combination of high compute capability in a small form factor with over 62,000 LEs, 160 DSP blocks, and 146 I/Os in a 3.5 × 3.4-mm WLCSP package. With low operating and standby power, this Titanium FPGA is a natural fit for the stringent size and power requirements of wearable applications.
Machine vision for process automation has often relied on ML whereby intelligent cameras equipped with MIPI CSI-2 sensors and a strong memory bandwidth are used to accomplish visual-, pixel-, or feature-based inspections. Imperfections such as scratches and roughness can be ascertained via a suitable ML algorithm (e.g., decision tree, Naïve Bayes, etc.) to train the classifier for both fault detection and classification. The FPGA can provide both image and audio processing by running an inferencing engine based upon a trained neural network. Here, the large amounts of internal block memory in the Titanium FPGAs allow for a majority of the activity to stay on-chip, thereby reducing the time- and power-consuming off-chip memory accesses. These very same characteristics can be applied to vision applications that require AI, such as increasing the quality of video conferencing, rapid human detection/facial recognition for video doorbells, and even pedestrian/obstacle recognition in autonomous driving applications.
Autonomous and remotely controlled drones and robots can be found in a massive array of potential applications, from mail/package delivery to the aforementioned tele-surgery and industrial robotics use cases. These applications require rapid response to recognize and avoid various obstacles. Other significant considerations of these applications are knowledge sharing, immersive training, and remote control/assistance via AR/VR devices. Typically, AR/VR devices require low power and large amounts of video aggregation as well as computation capability. The hardened 2.5-Gb MIPI functionality in most Titanium FPGAs helps reduce power, while the embedded memory and DSP blocks allow for the accumulation and processing of massive amounts of data for AR/VR systems.
FPGAs that can finally serve mainstream use cases
The Titanium FPGA family forges a path for companies to finally leverage the inherent flexibility, processing, and performance benefits of FPGAs at the power-, size- and cost-constrained edge. The edge presents the ultimate challenge to hardware acceleration, wherein compute-intensive algorithms must perform optimally with extreme low power operation while also addressing the need to be agile in the face of changing datasets and evolving AI capabilities for device longevity. Instead of blindly following the march of other FPGA companies toward the data center, where power and cost profiles make it easy to justify bloated FPGA technologies, Efinix addresses all the requirements of edge computing through Titanium.
Learn more about Efinix and its portfolio of Titanium FPGAs.
- Dang, S., Amin, O., Shihada, B. et al. What should 6G be?. Nat Electron 3, 20–29 (2020). https://doi.org/10.1038/s41928-019-0355-6
- Biookaghazadeh, S.; Zhao, M.; Fengbo, R. Are FPGAs Suitable for Edge Computing?
- Capra, M.; Bussolino, B.; Marchisio, A.; Shafique, M.; Masera, G.; Martina, M. An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks. Future Internet 2020, 12, 113. https://doi.org/10.3390/fi12070113
- Chowriappa, A., Wirz, R., Ashammagari, A.R. et al. Prediction from expert demonstrations for safe tele-surgery. Int. J. Autom. Comput. 10, 487–497 (2013). https://doi.org/10.1007/s11633-013-0746-5
The post FPGA comes back into its own as edge computing and AI catch fire appeared first on EETimes.