Opencode
Find a suitable model that fit in RAM
https://ollama.com/search list model size on disk, not on VRAM.
This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918
https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization
Also find a model that can run tools (and think?) ( this may help https://www.canirun.ai/ )
Model take VRAM size, but context size also. As stated here, we may have to increase context size to 16k-32k.
https://opencode.ai/docs/fr/providers/#ollama
Increase model's context size
ollama run qwen3.5:2b
>>> /set parameter num_ctx 32768
Set parameter 'num_ctx' to '32768'
>>> /save qwen3.5:2b-32k
Created new model 'qwen3.5:2b-32k'
Check if it fits in VRAM
Run the model:
ollama run qwen3.5:2b-32k
Then check CPU vs GPU usage. It should be 100% GPU to stay fast
ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3.5:2b-32k 094e78c5fe51 5.1 GB 100% GPU 32768 4 minutes from now
you may have to disable display to have all the ram available. You can also look at ollama server log to get the number of layer on the GPU
journalctl -f -u ollama.service # Then run ollama run youmodele
...
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true)
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU
...
Then you can try to force all the layers to be on the gpu
ollama run qwen2.5:7b-instruct-q4_K_M-8k
>>> /set parameter num_gpu 29
Set parameter 'num_gpu' to '29'
>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu
Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu'
And finally check with
ollama ps
Configure Opencode
in ~/.config/opencode/config.json
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"name": "Ollama (spacemarine)",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://192.168.0.24:11434/v1"
},
"models": {
"qwen3.5:2b-32k": {
"tools": true
}
}
}
}
}
More
Quantization:
From https://smcleod.net/2024/07/understanding-ai/llm-quantisation-through-interactive-visualisations/
| Quant Type | Size | Quality | Performance (CUDA) | Performance (Metal) | Notes |
|---|---|---|---|---|---|
| IQ1_XS | Smallest | Unusable | Excellent | OK | Basically a jabbering idiot |
| Q2_K_S | Smallest | Unusable | Excellent | Excellent | Likely generates lots of errors, not very useful |
| Q2_K_M | Smallest | Very-Very-Low | Excellent | Excellent | Likely generates lots of errors, not very useful |
| IQ2_XXS | Very Small | Very-Low | Excellent | OK | Surprisingly usable for the GPU poor if you have CUDA |
| IQ2_XS | Very Small | Low | Very Good | Not Great | Surprisingly usable for the GPU poor if you have CUDA |
| Q3_K_S | Small | Low | Excellent | Excellent | Usable and quick but has had a few head injuries |
| Q4_0 | Small | Medium-Low | Excellent | Excellent | Legacy Quant Type - Not recommended |
| IQ3_XXS | Small | Medium-Low | Very Good | Poor | As good as K4_K_S but smaller |
| Q4_K_S | Medium-Small | Medium-Low | Excellent | Excellent | You may as well use Q4_K_M, or IQ3_X(X)S if you have CUDA |
| Q5_1 | Medium | Medium-Low | Excellent | Excellent | Legacy Quant Type - Not recommended |
| Q4_K_M | Medium | Medium | Excellent | Excellent | Balanced mid range quant |
| Q5_K_S | Medium-Large | Medium | Excellent | Excellent | Slightly better than Q4_K_M |
| Q5_K_M | Medium-Large | Medium-High | Excellent | Excellent | A nice little upgrade from Q4_K_M |
| Q6_K | Large | Very-High | Very Good | Very Good | Best all-rounder, quality-to-size ratio for systems with enough VRAM |
| Q8_0 | Very Large | Overkill | Good | Good | Large file size, usually overkill and practically indistinguishable from full precision for inference |