« Opencode » : différence entre les versions

De Mathux
Mathieu (discussion | contributions)
Aucun résumé des modifications
Mathieu (discussion | contributions)
Aucun résumé des modifications
 
(4 versions intermédiaires par le même utilisateur non affichées)
Ligne 1 : Ligne 1 :
=== Find a suitable model that fit in RAM ===
=== Find a suitable model that fit in RAM ===
https://ollama.com/search list model size on disk, not on VRAM.
https://ollama.com/search list model size on disk, not on VRAM.
This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918


https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization  
https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization  


Also find a model that can think and run tools ( this may help https://www.canirun.ai/ )
Also find a model that can run tools (and think?) ( this may help https://www.canirun.ai/ )




Ligne 23 : Ligne 25 :
Created new model 'qwen3.5:2b-32k'
Created new model 'qwen3.5:2b-32k'
</syntaxhighlight>
</syntaxhighlight>
==== Check if it fits in VRAM ====
==== Check if it fits in VRAM ====
Run the model:<syntaxhighlight lang="bash">
Run the model:<syntaxhighlight lang="bash">
Ligne 31 : Ligne 32 :
NAME              ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
NAME              ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
qwen3.5:2b-32k    094e78c5fe51    5.1 GB    100% GPU    32768      4 minutes from now
qwen3.5:2b-32k    094e78c5fe51    5.1 GB    100% GPU    32768      4 minutes from now
</syntaxhighlight>you may have to disable display to have all the ram available.
You can also look at ollama server log to get the number of layer on the GPU<syntaxhighlight lang="sh">
journalctl -f -u ollama.service # Then run ollama run youmodele
...
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true)
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU
...
</syntaxhighlight>Then you can try to force all the layers to be on the gpu<syntaxhighlight lang="sh">
ollama run qwen2.5:7b-instruct-q4_K_M-8k
>>> /set parameter num_gpu 29
Set parameter 'num_gpu' to '29'
>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu
Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu'
</syntaxhighlight>And finally check with<syntaxhighlight lang="sh">
ollama ps
</syntaxhighlight>
</syntaxhighlight>



Dernière version du 18 mars 2026 à 23:50

Find a suitable model that fit in RAM

https://ollama.com/search list model size on disk, not on VRAM.

This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918

https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization

Also find a model that can run tools (and think?) ( this may help https://www.canirun.ai/ )


Model take VRAM size, but context size also. As stated here, we may have to increase context size to 16k-32k.

https://opencode.ai/docs/fr/providers/#ollama

Increase model's context size

ollama run qwen3.5:2b   

>>> /set parameter num_ctx 32768

Set parameter 'num_ctx' to '32768'

>>> /save qwen3.5:2b-32k

Created new model 'qwen3.5:2b-32k'

Check if it fits in VRAM

Run the model:

ollama run qwen3.5:2b-32k

Then check CPU vs GPU usage. It should be 100% GPU to stay fast

ollama ps
NAME              ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
qwen3.5:2b-32k    094e78c5fe51    5.1 GB    100% GPU     32768      4 minutes from now

you may have to disable display to have all the ram available. You can also look at ollama server log to get the number of layer on the GPU

journalctl -f -u ollama.service # Then run ollama run youmodele

...
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true)
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU
...

Then you can try to force all the layers to be on the gpu

ollama run qwen2.5:7b-instruct-q4_K_M-8k
>>> /set parameter num_gpu 29
Set parameter 'num_gpu' to '29'
>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu
Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu'

And finally check with

ollama ps

Configure Opencode

in ~/.config/opencode/config.json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "name": "Ollama (spacemarine)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://192.168.0.24:11434/v1"
      },
      "models": {
        "qwen3.5:2b-32k": {
          "tools": true
        }
      }
    }
  }
}

More

Quantization:

From https://smcleod.net/2024/07/understanding-ai/llm-quantisation-through-interactive-visualisations/

Quant Type Size Quality Performance (CUDA) Performance (Metal) Notes
IQ1_XS Smallest Unusable Excellent OK Basically a jabbering idiot
Q2_K_S Smallest Unusable Excellent Excellent Likely generates lots of errors, not very useful
Q2_K_M Smallest Very-Very-Low Excellent Excellent Likely generates lots of errors, not very useful
IQ2_XXS Very Small Very-Low Excellent OK Surprisingly usable for the GPU poor if you have CUDA
IQ2_XS Very Small Low Very Good Not Great Surprisingly usable for the GPU poor if you have CUDA
Q3_K_S Small Low Excellent Excellent Usable and quick but has had a few head injuries
Q4_0 Small Medium-Low Excellent Excellent Legacy Quant Type - Not recommended
IQ3_XXS Small Medium-Low Very Good Poor As good as K4_K_S but smaller
Q4_K_S Medium-Small Medium-Low Excellent Excellent You may as well use Q4_K_M, or IQ3_X(X)S if you have CUDA
Q5_1 Medium Medium-Low Excellent Excellent Legacy Quant Type - Not recommended
Q4_K_M Medium Medium Excellent Excellent Balanced mid range quant
Q5_K_S Medium-Large Medium Excellent Excellent Slightly better than Q4_K_M
Q5_K_M Medium-Large Medium-High Excellent Excellent A nice little upgrade from Q4_K_M
Q6_K Large Very-High Very Good Very Good Best all-rounder, quality-to-size ratio for systems with enough VRAM
Q8_0 Very Large Overkill Good Good Large file size, usually overkill and practically indistinguishable from full precision for inference