This is a repository to test the reliability and exactness of chaining LLMs and a simple MCPs tool to answer question with data.
The goal is to provide information for the following questions:
- How reliable are LLMs and generic MCPs to answer questions with data?
- Can an LLM translate my natural language question into proper data analysis query and get an accurate answer every time?
This repository contains:
- A simple MCP Server
- An MCP tool that builds a dynamic Pandas Pipeline based on arguments.
- An set of 12 tests that ask the same questions and evaluate the answer that the LLM provides after using the MCP tool.
Note: This repository is a simple test to explore and understand LLMs behaviour when using MCP tools. The learnings shouldn't be generalized to every use case. It is some test codebase for technical exploration.
-
Clone the repository
-
Create the virtual environment
uv sync- Run the server
uv run mcp-pandas-query- Set the environment variables in a
.envfile
# Example .env content
EXAMPLE_BASE_URL=https://api.z.ai/api/coding/paas/v4
EXAMPLE_API_KEY=<buyme>
EXAMPLE_MODEL_NAME=glm-4.7
- Run the tests (server must be running)
uv run pytestThe result of running the tests is that, when using the tool, the LLM sometimes provides the right answer and sometimes it provides a wrong answer.
tests/test_tool_calling.py::test_top_5_desembolsos PASSED [ 7%]
tests/test_tool_calling.py::test_calculo_de_desembolsos FAILED [ 15%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_2 PASSED [ 23%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_3 PASSED [ 30%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_4 PASSED [ 38%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_5 FAILED [ 46%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_6 PASSED [ 53%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_7 FAILED [ 61%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_8 FAILED [ 69%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_9 PASSED [ 76%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_10 FAILED [ 84%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_11 FAILED [ 92%]
tests/test_tool_calling.py::test_calculo_de_desembolsos_12 FAILED [100%]
This means that even when the MCP Tool has the logic to query the data and get a proper answer, the LLM not always knows how to use it properly. Even more, it can call the tool several times and infer a result and provide a wrong answer to the user.
Looking at the errors we can see that for the errors wrong numbers are given as an answer.
================================================================================== short test summary info ==================================================================================
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos - AssertionError: assert '3534566249.57' in 'Veo que obtuve los datos por año. Déjame calcular el total sumando todos los montos:\n\nTotal de desembolsos al sector privado en Guatemala =...
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_5 - assert '3534566249.57' in '{"type": "text", "text": "3798672494.42"}'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_7 - AssertionError: assert '3534566249.57' in 'Calculando la suma de los montos de todos los desembolsos al sector privado de Guatemala:\n\n3927595178.04'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_8 - AssertionError: assert '3534566249.57' in '3534567249.57'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_10 - AssertionError: assert '3534566249.57' in '3,534,566,249.57'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_11 - AssertionError: assert '3534566249.57' in '3246567350.83'
FAILED tests/test_tool_calling.py::test_calculo_de_desembolsos_12 - AssertionError: assert '3534566249.57' in ''
=================================================================== 7 failed, 6 passed, 13 warnings in 871.17s (0:14:31) ====================================================================
The most interesting one on the example is test_calculo_de_desembolsos_8 with a small amount that can pass unnoticeable, it returns 35345677249.57 instead of 3534566249.57. Where that diference of a thousand come from?