« Opencode » : différence entre les versions
Aucun résumé des modifications |
|||
| (3 versions intermédiaires par le même utilisateur non affichées) | |||
| Ligne 1 : | Ligne 1 : | ||
=== Find a suitable model that fit in RAM === | === Find a suitable model that fit in RAM === | ||
https://ollama.com/search list model size on disk, not on VRAM. | https://ollama.com/search list model size on disk, not on VRAM. | ||
This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918 | |||
https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization | https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization | ||
| Ligne 22 : | Ligne 24 : | ||
Created new model 'qwen3.5:2b-32k' | Created new model 'qwen3.5:2b-32k' | ||
</syntaxhighlight> | </syntaxhighlight> | ||
==== Check if it fits in VRAM ==== | ==== Check if it fits in VRAM ==== | ||
Run the model:<syntaxhighlight lang="bash"> | Run the model:<syntaxhighlight lang="bash"> | ||
| Ligne 51 : | Ligne 32 : | ||
NAME ID SIZE PROCESSOR CONTEXT UNTIL | NAME ID SIZE PROCESSOR CONTEXT UNTIL | ||
qwen3.5:2b-32k 094e78c5fe51 5.1 GB 100% GPU 32768 4 minutes from now | qwen3.5:2b-32k 094e78c5fe51 5.1 GB 100% GPU 32768 4 minutes from now | ||
</syntaxhighlight>you may have to disable display to have all the ram available. | |||
You can also look at ollama server log to get the number of layer on the GPU<syntaxhighlight lang="sh"> | |||
journalctl -f -u ollama.service # Then run ollama run youmodele | |||
... | |||
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true) | |||
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU | |||
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU | |||
... | |||
</syntaxhighlight>Then you can try to force all the layers to be on the gpu<syntaxhighlight lang="sh"> | |||
ollama run qwen2.5:7b-instruct-q4_K_M-8k | |||
>>> /set parameter num_gpu 29 | |||
Set parameter 'num_gpu' to '29' | |||
>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu | |||
Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu' | |||
</syntaxhighlight>And finally check with<syntaxhighlight lang="sh"> | |||
ollama ps | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Dernière version du 18 mars 2026 à 23:50
Find a suitable model that fit in RAM
https://ollama.com/search list model size on disk, not on VRAM.
This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918
https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization
Also find a model that can run tools (and think?) ( this may help https://www.canirun.ai/ )
Model take VRAM size, but context size also. As stated here, we may have to increase context size to 16k-32k.
https://opencode.ai/docs/fr/providers/#ollama
Increase model's context size
ollama run qwen3.5:2b
>>> /set parameter num_ctx 32768
Set parameter 'num_ctx' to '32768'
>>> /save qwen3.5:2b-32k
Created new model 'qwen3.5:2b-32k'
Check if it fits in VRAM
Run the model:
ollama run qwen3.5:2b-32k
Then check CPU vs GPU usage. It should be 100% GPU to stay fast
ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3.5:2b-32k 094e78c5fe51 5.1 GB 100% GPU 32768 4 minutes from now
you may have to disable display to have all the ram available. You can also look at ollama server log to get the number of layer on the GPU
journalctl -f -u ollama.service # Then run ollama run youmodele
...
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true)
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU
...
Then you can try to force all the layers to be on the gpu
ollama run qwen2.5:7b-instruct-q4_K_M-8k
>>> /set parameter num_gpu 29
Set parameter 'num_gpu' to '29'
>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu
Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu'
And finally check with
ollama ps
Configure Opencode
in ~/.config/opencode/config.json
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"name": "Ollama (spacemarine)",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://192.168.0.24:11434/v1"
},
"models": {
"qwen3.5:2b-32k": {
"tools": true
}
}
}
}
}
More
Quantization:
From https://smcleod.net/2024/07/understanding-ai/llm-quantisation-through-interactive-visualisations/
| Quant Type | Size | Quality | Performance (CUDA) | Performance (Metal) | Notes |
|---|---|---|---|---|---|
| IQ1_XS | Smallest | Unusable | Excellent | OK | Basically a jabbering idiot |
| Q2_K_S | Smallest | Unusable | Excellent | Excellent | Likely generates lots of errors, not very useful |
| Q2_K_M | Smallest | Very-Very-Low | Excellent | Excellent | Likely generates lots of errors, not very useful |
| IQ2_XXS | Very Small | Very-Low | Excellent | OK | Surprisingly usable for the GPU poor if you have CUDA |
| IQ2_XS | Very Small | Low | Very Good | Not Great | Surprisingly usable for the GPU poor if you have CUDA |
| Q3_K_S | Small | Low | Excellent | Excellent | Usable and quick but has had a few head injuries |
| Q4_0 | Small | Medium-Low | Excellent | Excellent | Legacy Quant Type - Not recommended |
| IQ3_XXS | Small | Medium-Low | Very Good | Poor | As good as K4_K_S but smaller |
| Q4_K_S | Medium-Small | Medium-Low | Excellent | Excellent | You may as well use Q4_K_M, or IQ3_X(X)S if you have CUDA |
| Q5_1 | Medium | Medium-Low | Excellent | Excellent | Legacy Quant Type - Not recommended |
| Q4_K_M | Medium | Medium | Excellent | Excellent | Balanced mid range quant |
| Q5_K_S | Medium-Large | Medium | Excellent | Excellent | Slightly better than Q4_K_M |
| Q5_K_M | Medium-Large | Medium-High | Excellent | Excellent | A nice little upgrade from Q4_K_M |
| Q6_K | Large | Very-High | Very Good | Very Good | Best all-rounder, quality-to-size ratio for systems with enough VRAM |
| Q8_0 | Very Large | Overkill | Good | Good | Large file size, usually overkill and practically indistinguishable from full precision for inference |