« Opencode » : différence entre les versions

Dernière version du 19 mars 2026 à 00:50

Find a suitable model that fit in RAM

https://ollama.com/search list model size on disk, not on VRAM.

This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918

https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization

Also find a model that can run tools (and think?) ( this may help https://www.canirun.ai/ )

Model take VRAM size, but context size also. As stated here, we may have to increase context size to 16k-32k.

https://opencode.ai/docs/fr/providers/#ollama

Increase model's context size

ollama run qwen3.5:2b   

>>> /set parameter num_ctx 32768

Set parameter 'num_ctx' to '32768'

>>> /save qwen3.5:2b-32k

Created new model 'qwen3.5:2b-32k'

Check if it fits in VRAM

Run the model:

ollama run qwen3.5:2b-32k

Then check CPU vs GPU usage. It should be 100% GPU to stay fast

ollama ps
NAME              ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
qwen3.5:2b-32k    094e78c5fe51    5.1 GB    100% GPU     32768      4 minutes from now

you may have to disable display to have all the ram available. You can also look at ollama server log to get the number of layer on the GPU

journalctl -f -u ollama.service # Then run ollama run youmodele

...
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true)
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU
mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU
...

Then you can try to force all the layers to be on the gpu

ollama run qwen2.5:7b-instruct-q4_K_M-8k
>>> /set parameter num_gpu 29
Set parameter 'num_gpu' to '29'
>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu
Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu'

And finally check with

ollama ps

Configure Opencode

in ~/.config/opencode/config.json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "name": "Ollama (spacemarine)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://192.168.0.24:11434/v1"
      },
      "models": {
        "qwen3.5:2b-32k": {
          "tools": true
        }
      }
    }
  }
}

More

Quantization:

From https://smcleod.net/2024/07/understanding-ai/llm-quantisation-through-interactive-visualisations/

Quant Type	Size	Quality	Performance (CUDA)	Performance (Metal)	Notes
IQ1_XS	Smallest	Unusable	Excellent	OK	Basically a jabbering idiot
Q2_K_S	Smallest	Unusable	Excellent	Excellent	Likely generates lots of errors, not very useful
Q2_K_M	Smallest	Very-Very-Low	Excellent	Excellent	Likely generates lots of errors, not very useful
IQ2_XXS	Very Small	Very-Low	Excellent	OK	Surprisingly usable for the GPU poor if you have CUDA
IQ2_XS	Very Small	Low	Very Good	Not Great	Surprisingly usable for the GPU poor if you have CUDA
Q3_K_S	Small	Low	Excellent	Excellent	Usable and quick but has had a few head injuries
Q4_0	Small	Medium-Low	Excellent	Excellent	Legacy Quant Type - Not recommended
IQ3_XXS	Small	Medium-Low	Very Good	Poor	As good as K4_K_S but smaller
Q4_K_S	Medium-Small	Medium-Low	Excellent	Excellent	You may as well use Q4_K_M, or IQ3_X(X)S if you have CUDA
Q5_1	Medium	Medium-Low	Excellent	Excellent	Legacy Quant Type - Not recommended
Q4_K_M	Medium	Medium	Excellent	Excellent	Balanced mid range quant
Q5_K_S	Medium-Large	Medium	Excellent	Excellent	Slightly better than Q4_K_M
Q5_K_M	Medium-Large	Medium-High	Excellent	Excellent	A nice little upgrade from Q4_K_M
Q6_K	Large	Very-High	Very Good	Very Good	Best all-rounder, quality-to-size ratio for systems with enough VRAM
Q8_0	Very Large	Overkill	Good	Good	Large file size, usually overkill and practically indistinguishable from full precision for inference

Anonyme

Rechercher

« Opencode » : différence entre les versions

Espaces de noms

Plus

Actions de la page

Dernière version du 19 mars 2026 à 00:50

Sommaire

Find a suitable model that fit in RAM

Increase model's context size

Check if it fits in VRAM

Configure Opencode

More

Navigation

Navigation

Outils wiki

Outils wiki

@@ Ligne 1 : / Ligne 1 : @@
 === Find a suitable model that fit in RAM ===
 https://ollama.com/search list model size on disk, not on VRAM.
+This site list some vram usage : https://localllm.in/blog/ollama-vram-requirements-for-local-llms or https://github.com/ollama/ollama/issues/6852#issuecomment-2440229918
 https://www.canirun.ai/model/llama3.1-8b may help but Ollama may not give access to all quatization
-Also find a model that can think and run tools ( this may help https://www.canirun.ai/ )
+Also find a model that can run tools (and think?) ( this may help https://www.canirun.ai/ )
@@ Ligne 23 : / Ligne 25 : @@
 Created new model 'qwen3.5:2b-32k'
 </syntaxhighlight>
 ==== Check if it fits in VRAM ====
 Run the model:<syntaxhighlight lang="bash">
@@ Ligne 31 : / Ligne 32 : @@
 NAME              ID              SIZE      PROCESSOR    CONTEXT    UNTIL
 qwen3.5:2b-32k    094e78c5fe51    5.1 GB    100% GPU     32768      4 minutes from now
+</syntaxhighlight>you may have to disable display to have all the ram available.
+You can also look at ollama server log to get the number of layer on the GPU<syntaxhighlight lang="sh">
+journalctl -f -u ollama.service # Then run ollama run youmodele
+...
+mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: loading model tensors, this can take a while... (mmap = true)
+mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloading 28 repeating layers to GPU
+mars 18 23:44:03 spacemarine ollama[1639]: load_tensors: offloaded 28/29 layers to GPU
+...
+</syntaxhighlight>Then you can try to force all the layers to be on the gpu<syntaxhighlight lang="sh">
+ollama run qwen2.5:7b-instruct-q4_K_M-8k
+>>> /set parameter num_gpu 29
+Set parameter 'num_gpu' to '29'
+>>> /save qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu
+Created new model 'qwen2.5:7b-instruct-q4_K_M-8k-29_num_gpu'
+</syntaxhighlight>And finally check with<syntaxhighlight lang="sh">
+ollama ps
 </syntaxhighlight>

Anonyme

Rechercher

« Opencode » : différence entre les versions

Dernière version du 19 mars 2026 à 00:50

Find a suitable model that fit in RAM

Increase model's context size

Check if it fits in VRAM

Configure Opencode

More

Navigation

Outils wiki

Outils de la page

« Opencode » : différence entre les versions