MCP Query Test

This is a repository to test the reliability and exactness of chaining LLMs and a simple MCPs tool to answer question with data.

The goal is to provide information for the following questions:

How reliable are LLMs and generic MCPs to answer questions with data?
Can an LLM translate my natural language question into proper data analysis query and get an accurate answer every time?

This repository contains:

A simple MCP Server
An MCP tool that builds a dynamic Pandas Pipeline based on arguments.
An set of 12 tests that ask the same questions and evaluate the answer that the LLM provides after using the MCP tool.

Note: This repository is a simple test to explore and understand LLMs behaviour when using MCP tools. The learnings shouldn't be generalized to every use case. It is some test codebase for technical exploration.

Running the tests

Clone the repository
Create the virtual environment

uv sync

Run the server

uv run mcp-pandas-query

Set the environment variables in a .env file

# Example .env content
EXAMPLE_BASE_URL=https://api.z.ai/api/coding/paas/v4
EXAMPLE_API_KEY=<buyme>
EXAMPLE_MODEL_NAME=glm-4.7

Run the tests (server must be running)

uv run pytest

Learnings

The result of running the tests is that, when using the tool, the LLM sometimes provides the right answer and sometimes it provides a wrong answer.

tests/test_tool_calling.py::test_top_5_desembolsos PASSED                                                                                                         [  7%]
tests/test_tool_calling.py::test_calculo_de_desembolsos FAILED                                                                                                    [ 15%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_2 PASSED                                                                                                  [ 23%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_3 PASSED                                                                                                  [ 30%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_4 PASSED                                                                                                  [ 38%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_5 FAILED                                                                                                  [ 46%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_6 PASSED                                                                                                  [ 53%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_7 FAILED                                                                                                  [ 61%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_8 FAILED                                                                                                  [ 69%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_9 PASSED                                                                                                  [ 76%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_10 FAILED                                                                                                 [ 84%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_11 FAILED                                                                                                 [ 92%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_12 FAILED                                                                                                 [100%]

This means that even when the MCP Tool has the logic to query the data and get a proper answer, the LLM not always knows how to use it properly. Even more, it can call the tool several times and infer a result and provide a wrong answer to the user.

Looking at the errors we can see that for the errors wrong numbers are given as an answer.

================================================================================== short test summary info ==================================================================================
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos - AssertionError: assert '3534566249.57' in 'Veo que obtuve los datos por año. Déjame calcular el total sumando todos los montos:\n\nTotal de desembolsos al sector privado en Guatemala =...
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_5 - assert '3534566249.57' in '{"type": "text", "text": "3798672494.42"}'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_7 - AssertionError: assert '3534566249.57' in 'Calculando la suma de los montos de todos los desembolsos al sector privado de Guatemala:\n\n3927595178.04'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_8 - AssertionError: assert '3534566249.57' in '3534567249.57'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_10 - AssertionError: assert '3534566249.57' in '3,534,566,249.57'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_11 - AssertionError: assert '3534566249.57' in '3246567350.83'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_12 - AssertionError: assert '3534566249.57' in ''
=================================================================== 7 failed, 6 passed, 13 warnings in 871.17s (0:14:31) ====================================================================

The most interesting one on the example is test_calculo_de_desembolsos_8 with a small amount that can pass unnoticeable, it returns 35345677249.57 instead of 3534566249.57. Where that diference of a thousand come from?

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/mcp_pandas_query		src/mcp_pandas_query
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCP Query Test

Running the tests

Learnings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MCP Query Test

Running the tests

Learnings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages