Arm is best known for its mobile CPU and GPU processors, but the company is keen to also show its expertise in more emerging fields. This week, Arm unveiled its new Project Trillium machine learning processor (MLP), three months after first revealing its existence.
The MLP exists to allow smartphones and tablets to perform machine learning independently, without needing to connect to a server somewhere. While a smartphone has limited performance, there are many advantages to this approach: less bandwidth and power needs to be spent on sending and receiving data, data doesn't leave the device so it is more secure, there's minimal latency and so on. While Arm are initially targeting mobile devices, they hope the solution to one day scale from tiny Internet of Things devices to massive data centres.
The Project Trillium MLP was designed by engineers from Arm's CPU and GPU teams, who were given free reign to design a new architecture not bound by convention. They decided to emphasise three major points in its construction:
- Efficient convolutions (a mathematical operation for the uniform mixing of data, which takes the bulk of computation time)
- Efficient data movement (as moving data requires more time than the actual computation)
- Sufficient programmability (e.g. allowing new architectures, operators and topology to be used)
The MLP uses up to sixteen compute engines to perform the necessary processing, while a microcontroller and direct memory access engine are in charge of scheduling. In order to maintain sufficient levels of performance without using a lot of energy or space, Arm are targeting quantised 8-bit data types that are common in machine learning.
The Arm MLP has quite a few tricks up its sleeve in order to operate more efficiently. For example, it can perform lossless compression of feature maps by masking repeating zeroes in a given block, generally yielding compression rates of 3:1. This allows the MLP to reduce the amount of SRAM needed, with Arm aiming to provide just 1MB for its sixteen compute engines.
Compression is also used to minimise the bandwidth used for weights. If the weights are well-trained, they tend to have a lot of zeroes in the later layers of a network, so the Project Trillium MLP masks these zeroes to achieve a compression ratio of 4:1. The MLP can also skip over less important computations that wouldn't significantly affect the final result to save on time.
The programmability requirement of the ARM MLP is satisfied through the use of a programmable layer engine (PLE). The PLE allows the processor to add new operators and also supports the Android neural network API (NNAPI) and and Arm's own neural network SDK (Arm NN). It also uses pooling, activations and compression to accelerate common tasks and thereby improve performance and efficiency once more.
The ARM MLP should be released in mid 2018 and will be available on old 16nm and bleeding-edge 7nm fabrication nodes. Developers will be able to write code using Arm NN for Android and Embedded Linux, running on their existing CPU and GPUs, and have it begin working on these new MLPs as soon as they're available.
Are you interested in mobile machine learning development? Let us know in the comments.
Body images credit: Arm.