Yollama

Written by on December 29th, 2023.

Ollama

I recently came across Ollama, a software project that allows you to run LLM inferences locally. For the less tech-savvy, this means you can have a ChatGPT-like experience unlimited and for free, with a few pors & cons, as long as you own a computer.

Ollama has the coolest name and the cutest logo. I was surprised by how easy it is to run AI inferences on your good old, regular laptop (provided it has a GPU and some RAM...). This is great because:

There are some pitfalls: The available models use a mix of licenses, are getting updates, and are trained with various datasets to optimize for various tasks, including code. Vision was also implemented recently (you can send still pictures to a model such as Llava and get text back).

Self-hosting

So I set out to experiment with it. Since Ollama provides a Docker image, and containers nowadays can access GPUs - theoretically - I thought, as the cloud engineer I am, to host my own Ollama. The idea was to run the Docker image in a Kubernetes cluster - because who doesn't have one lying around - and secure a private remote access to it using CloudFlare's tunnels. This means you can query Ollama from your phone or tablet, and not just from your computer, while retaining the benefits above.

I quickly toyed with this idea here and used a Minikube cluster for actually running Ollama. It worked, but having not configured Minikube to use the GPU, and the underlying VM having limited memory, inference was very, very slow. But I eventually got the full responses to my metaphysical questions.

One thing to keep in mind is that one Ollama process currently runs a single inference at any one time. The only way to handle concurrency is to spawn more Ollama pods, or queue queries (Ollama has an in memory queue though you could improve on that). But in order to run more processes you'll need more memory. In short, this solution is good to make your own personnal assistant, available from anywhere, but it wouldn't scale to be a product or service.

By the way, CloudFlare also has an AI product that you can use from their worker platform. They offer a selection of open source models,. It's currently free of charge - the product is in a beta phase. If you're not aiming for a local assistant but rather building a AI product, this solution may make more sense than running your own GPUs. One big benefit is that CloudFlare's GPUs are running at the edge, as close as possible to your (global) customers. It is not ready for production though.

A Sublime plugin

Next idea to play with, was a Sublime Text plugin. I never ever wrote a ST plugin before, but having an Ollama prompt one Cmd+Shift+P away seemed handy, and I do know Python. So how complicated can it be?

Turns out one does not simply learn the Sublime Text plugin APIs and concepts in 5 minutes. A few things I learned:

The very first version of this is out and cleverly called Yollama (you can imagine it's short for querY Ollama). While I'm not sure if I will put more time into this project, I do have some ideas of what to work on:

I know this sounds a lot like an already existing product, GitHub's Copilot. But as stated in the introduction, there are some advantages running the models locally.

Give it a try, and happy inferences!