Yollama

Written by Maxime Walzberg on December 29th, 2023.

Ollama

I recently came across Ollama, a software project that allows you to run LLM inferences locally. For the less tech-savvy, this means you can have a ChatGPT-like experience unlimited and for free, with a few pors & cons, as long as you own a computer.

Ollama has the coolest name and the cutest logo. I was surprised by how easy it is to run AI inferences on your good old, regular laptop (provided it has a GPU and some RAM...). This is great because:

It's free and open-source.
It's good for privacy, since processing is done locally. This also prevents you from leaking your employer's intellectual property, or worse, any kind of credentials.
Ollama can run various models.
You can run it offline, should you be off-the-grid.

There are some pitfalls:

Open-source models may not be as accurate (smart) than proprietary ones
You need a good-ish computer. A better GPU will allow you to get the responses faster, and more RAM will allow you to run more sophisticated models (models with more parameters, more likely to give you accurate responses).
If you're on an unplugged laptop, intensive use will drain your battery.
Also, keep in mind that LLMs answers are the most probable given a particular training dataset and thus may not be always and entirely correct. This is not specific to Ollama and applies to other products, like ChatGPT.

The available models use a mix of licenses, are getting updates, and are trained with various datasets to optimize for various tasks, including code. Vision was also implemented recently (you can send still pictures to a model such as Llava and get text back).

Self-hosting

So I set out to experiment with it. Since Ollama provides a Docker image, and containers nowadays can access GPUs - theoretically - I thought, as the cloud engineer I am, to host my own Ollama. The idea was to run the Docker image in a Kubernetes cluster - because who doesn't have one lying around - and secure a private remote access to it using CloudFlare's tunnels. This means you can query Ollama from your phone or tablet, and not just from your computer, while retaining the benefits above.

I quickly toyed with this idea here and used a Minikube cluster for actually running Ollama. It worked, but having not configured Minikube to use the GPU, and the underlying VM having limited memory, inference was very, very slow. But I eventually got the full responses to my metaphysical questions.

One thing to keep in mind is that one Ollama process currently runs a single inference at any one time. The only way to handle concurrency is to spawn more Ollama pods, or queue queries (Ollama has an in memory queue though you could improve on that). But in order to run more processes you'll need more memory. In short, this solution is good to make your own personnal assistant, available from anywhere, but it wouldn't scale to be a product or service.

By the way, CloudFlare also has an AI product that you can use from their worker platform. They offer a selection of open source models,. It's currently free of charge - the product is in a beta phase. If you're not aiming for a local assistant but rather building a AI product, this solution may make more sense than running your own GPUs. One big benefit is that CloudFlare's GPUs are running at the edge, as close as possible to your (global) customers. It is not ready for production though.

A Sublime plugin

Next idea to play with, was a Sublime Text plugin. I never ever wrote a ST plugin before, but having an Ollama prompt one Cmd+Shift+P away seemed handy, and I do know Python. So how complicated can it be?

Turns out one does not simply learn the Sublime Text plugin APIs and concepts in 5 minutes. A few things I learned:

It's hard to do dependency management in a Sublime plugin, unless you use Package Control (which I did not in the very beginning, being more a playground thing than a serious project). This means using Python's urllib or http.client which means more hassle.
Even if plugins are running in a separate process from the main Sublime process (which handles everything else, most notably the UI) you can't block the main Python thread. It freezes ST's interface. This means handling the HTTP request to Ollama in a separate thread, which has some implications.
You need to call a Command to do many things (and you may have to implement your own), you can't just directly write code that would, say, append text to a buffer. The calling part happens with sublime.run_command("name", arguments...). Implementing your own happens by subclassing one of the sublime_plugin.*Command classes, depending on what you aim to do. I assume this is so that ST can manage its own model and interface updates within concurrent-safe transactions, handle the Cmd+Z undo log, ...

The very first version of this is out and cleverly called Yollama (you can imagine it's short for querY Ollama). While I'm not sure if I will put more time into this project, I do have some ideas of what to work on:

Run inference on current code selection, with a user-typed query prefix (ie "What does this code do?" or "Does this code seem wrong to you?")
Auto-complete suggestions
Handle API streaming so that we show response words as soon as they are computed
If the model replies with bits of code, enable Sublime's syntax highlighting on those parts in the output panel

I know this sounds a lot like an already existing product, GitHub's Copilot. But as stated in the introduction, there are some advantages running the models locally.

Give it a try, and happy inferences!