screwdriver-wrenchTool Calling Guide for Local LLMs

Tool calling is when a LLM is allowed to trigger specific functions (like “search my files,” “run a calculator,” or “call an API”) by emitting a structured request instead of guessing the answer in text. You use tool calls because they make outputs more reliable and up-to-date, and they let the model take real actions (query systems, validate facts, enforce schemas) rather than hallucinating.

In this tutorial, you will learn how to use local LLMs via Tool Calling with Mathematical, story, Python code and terminal function examples. Inference is done locally via llama.cpp, llama-server and OpenAI endpoints.

Our guide should work for nearly any model including:

Qwen3-Coder-Next TutorialGLM-4.7-Flash Tutorial

🔨Tool Calling Setup

Our first step is to Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

circle-info

In this example we're using Devstral 2, when switching a model, ensure you use the correct sampling parameters. You can view all of them in our guides here.

Now we'll showcase multiple methods of running tool-calling for many different use-cases below:

Writing a story:

Mathematical operations:

Execute generated Python code

Execute arbitrary terminal functions

🌠 Qwen3-Coder-Next Tool Calling

In a new terminal, we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions which will parse the function calls automatically and call OpenAI endpoint for any LLM:

Now we'll showcase multiple methods of running tool-calling for many different use-cases below:

Execute generated Python code

Execute arbitrary terminal functions

We confirm the file was created and it was!

GLM-4.7-Flash + GLM 4.7 Calling

We first download GLM-4.7 or GLM-4.7-Flash via some Python code, then launch it via llama-server in a separate terminal (like using tmux). In this example we download the large GLM-4.7 model:

If you ran it successfully, you should see:

Now launch it via llama-server in a new terminal. Use tmux if you want:

And you will get:

Now in a new terminal and executing Python code, reminder to run Tool Calling Setup We use GLM 4.7's optimal parameters of temperature = 0.7 and top_p = 1.0

Tool Call for mathematical operations for GLM 4.7

Tool Call to execute generated Python code for GLM 4.7

📙 Devstral 2 Tool Calling

We first download Devstral 2 via some Python code, then launch it via llama-server in a separate terminal (like using tmux):

If you ran it successfully, you should see:

Now launch it via llama-server in a new terminal. Use tmux if you want:

You will see the below if it succeeded:

We then call the model with the following message and with Devstral's suggested parameters of temperature = 0.15 only. Reminder to run Tool Calling Setup

Last updated

Was this helpful?