Rohan's blog

Decoding AI, one token at a time.

Gen AI at the edge with ONNX

As someone who’s day job is to build Azure’s edge infrastructure products for enterprises and developers, it’s becoming clear to me that, while the initial hype around Gen AI has been focused largely on using cutting-edge, massive and expensive hardware in a datacenter to run complex models, there is tremendous potential in deploying some version of generative (and, frankly, non-generative) AI models at the edge.

In this post, I’ll describe the need and challenges for AI at the edge and dive into an example of how the ONNX (Open Neural Network Exchange) ecosystem can be leveraged to deploy Gen AI models at the edge.

The need for AI at the edge

Why on earth would I run the AI model on a resource-constrained edge device instead of the cloud?

The cloud is awesome – OpEx instead of CapEx, true service-oriented operating model, managed control plane for infrastructure management and services, flexibility and scalability, etc. But there are some scenarios where it’s just better/faster/cheaper/compliant to run the AI model directly on a device at the edge, such as a laptop, IoT device or mobile phone. These scenarios include:

  1. Real-time applications at the edge: If the data doesn’t have to leave the local edge network and make it’s way over the internet to a cloud server and back for each inferencing request, the application will presumably run a lot faster. This matters more for some applications (think: factory robots that need to react in real-time) and less for others (like a support chatbot)
  2. Data gravity and costs: Especially for enterprise scenarios, if the amount of data being generated makes it prohibitively expensive for all that data to be sent to the cloud for inferencing, the economics generally point towards adopting solutions that do the processing locally. Think: Manufacturing plants generating tons of IoT data every second, where it isn’t possible to optimize it enough without losing fidelity
  3. Privacy and governance of data: Users who are conscious of data privacy are far more likely to adopt AI solutions that don’t send private data outside of the users’ devices. Apple’s announcement at WWDC 2024 was heavily focused on AI with data privacy, where the company laid out how it will guarantee that their customer’s data will only be processed either directly on the customer’s device or an explicitly private datacenter that Apple assumes responsibility for. Enterprise requirements are often dictated by governance and compliance constraints that impose restrictions on locality of the data. Examples include governments that don’t want their data to leave geographical boundaries and industries like healthcare where the privacy and security of personal health information is protected by laws like HIPAA in the United States)

All of this points to the certain need to be able to extend the benefits of AI (and all it’s future enhancements and applications) to edge devices that are closer to the user and the data.

Challenges with running AI models at the edge

AI models in the cloud are generally characterized by flexible, scalable and powerful platform infrastructure. This influences the models (and their exact configurations) that are offered to cloud customers. Models are configured and tested to capture value in an unconstrained environment, and the accuracy and speed (of the inferencing operation that runs in the cloud) is really only limited by how much the customer is willing to pay and the pace of technological innovation across the software and hardware stack. Edge devices, on the other hand, to varying extents, are characterized by:

  • Capacity constraints: Phones, laptops and tiny ARM-based IoT devices doing real-time monitoring and alerting in manufacturing plants are very unlike beefy cloud servers. The available cores and memory are extremely limited and not easily scalable because of form-factor, cooling constraints, and CapEx requirements. Gen AI inferencing models need to run with a footprint that is reduced to fit on these smaller systems and yet be performant enough
  • HUGE diversity of hardware: The edge includes a range of device types and hardware capabilities, making things even more complex from a model optimization and development perspective. As an example, an ISV needs a simple, consistent way to build and deploy optimal AI models across their target scenarios, which could include large retail outlet chains (deploying $10,000 “heavy edge” servers with A2 GPUs) as well as smaller shops (where laptops and phones may be the in-store device). The energy sector is another example of an industry that employs specialized hardware for AI tasks (ex: oil refineries use devices like Moxa for computer vision applications like leak detection, and these may have extremely limited or no hardware acceleration). Supporting the plethora of edge platforms out there would require expensive and complex custom effort

We need a way to leverage the AI model ecosystem innovation on a disparate set of hardware to be able to successfully support edge scenarios.

Using Olive and ONNX to deploy AI models on my PC

The good news is: there’s been some exciting momentum in the open-source industry around solving these problems to extend the benefits of AI to the edge.

Reducing model footprint with Olive

With Olive, I was able to reduce the size of the phi-3-mini-4k-instruct from over 7GB to ~2.5GB. It’s a few simple steps that are outlined here. Olive also has some extensive documentation that can be found here.

While this optimization naturally came with some constraints on the model, it makes it possible to build apps using phi-3-mini on resource-constrained hardware, which is a huge unlock for the reasons described in the sections above.

Note: there is a LOT more to Olive than model footprint reduction. Deep dive for another day…

Gen AI chat (using Phi-3-mini) in my browser with ONNX runtime

The ONNX ecosystem has taken a great stab at meshing with the ecosystem and simplifying the journey of AI model development and deployment. There are multiple ways I’ve seen it explained – I find it best to frame it in terms of the AI model builders’ journey:

  • Enabling the data scientist to focus on building the AI model instead of tools and ecosystems and deployment specifics. ONNX is a standardized file format that makes different existing formats in the AI model ecosystem interoperable. It effectively enables data scientists to build the AI model using the tools and ecosystems of their choice, such as TensorFlow and PyTorch, without having to worry about whether it’s compatible with the rest of the software supply chain to production deployments
  • Enabling the engineer to focus on building and deploying the AI app, instead of how to optimize it for specific hardware platforms that it will run on. The ONNX ecosystem includes ONNX runtime which enables engineers to build and deploy their AI apps on the target hardware platforms with the desired efficiency and accuracy in a scalable way. ONNX runtime leverages “execution providers” to abstract away the complexities of specific hardware devices and capabilities (such as Nvidia CUDA GPUs, Intel GPUs and CPUs). The IHV ecosystem contributes via extensions/execution providers based on specifics of the hardware, in turn enabling developers to scale support across a range of hardware capabilities

There’s a really cool (i.e., straightforward) example in the onnxruntime-inference-examples github repo that I tried running on my PC. With a few simple commands, I was able to get a Gen AI chatbot working in my browser! Kudos to the contributors for making it so simple to get people pumped about the possibilities of AI at the edge!

Step 1: Clone the onnxruntime-inference-examples github repo

Step 2: Run literally two npm commands

Step 3: Open up localhost on your browser… et voila!

Step 4: Start chatting away (and testing the perf/accuracy boundaries of this optimized model)!

I should note that this did require a PC with 16GB RAM to run in a usable way – my laptop with 8GB RAM didn’t lend itself too well beyond displaying the chat interface in my browser. This is likely just some assumption in the script or model optimization (via Olive), but would’ve loved for the min hardware requirements to be specified in the docs to save me the effort.

Multi-modal chat using Phi-3-vision on my PC

As someone who started out as a computer vision researcher (back when Hidden Markov Models were cool), I couldn’t possibly end this post without giving the phi-3-vision model a shot (running locally on my PC, of course).

Luckily, and unsurprisingly, the ONNX contributors have left handy a tutorial for this model too in the onnxruntime docs. With a few simple steps, I executed this script that was available in the onnxruntime-genai repo in github and within minutes I was chatting away with an AI that could use images alongside natural language to understand context and answer my questions.

Here’s a fun example.

Loading model...
Image Path (leave empty if no image):

I picked an image that shows a single shoe with the brand name barely visible.

Image credit: Imani Bahati (Unsplash)

And gave it the following prompt:

Prompt: I really like these shoes. Can you tell me what brand they are and who founded the company that made them? Also, can you recommend shoes from the same company that I should consider using for a half-marathon later this year?

And here’s what I got back:

Processing image and prompt...
Generating response...
The shoes in the image are from the Nike brand, which was founded by Phil Knight in 1964. Nike is known for its innovative approach to athletic footwear and apparel, focusing on performance and style.

For a half-marathon, you might consider the Nike Air Zoom Pegasus 36. This model is designed for long-distance running and offers excellent cushioning and support. It's also known for its durability and comfort, making it a great choice for a half-marathon. </s>

This blog ends here, but the AI era has only just begun…

In just a couple of days, I’ve been able to spin up a multi-modal Gen AI chatbot running locally on my system. The momentum is crazy, and I’m excited to continue exploring what the awesome AI community is building to make deploying AI everywhere even simpler. Stay tuned for more content like this.

Cheers!!

R