Introduction
In the era of large language models (LLMs), deploying them efficiently requires a robust infrastructure. MCP (Model Control Protocol) servers enable streamlined interaction, scaling, and efficient querying of LLMs. In this post, we will walk through setting up an MCP server, configuring it, and using it with an AI LLM.
Prerequisites
Before setting up an MCP server, ensure you have the following:
- Python 3.8+
torch
,transformers
,fastapi
,uvicorn
- A suitable AI model, such as OpenAI’s GPT or Meta’s LLaMA
Installing Dependencies
First, install the required dependencies:
pip install torch transformers fastapi uvicorn
Creating the MCP Server
Create a Python file mcp_server.py
and add the following code:
from fastapi import FastAPI, HTTPException
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
MODEL_NAME = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16)
def generate_response(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
@app.post("/query")
async def query_model(prompt: str):
if not prompt:
raise HTTPException(status_code=400, detail="Prompt cannot be empty")
response = generate_response(prompt)
return {"response": response}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Running the MCP Server
Execute the following command to start the server:
python mcp_server.py
Your MCP server should now be running on http://localhost:8000
.
Querying the Server
Use a simple Python client or curl
to send queries:
import requests
response = requests.post("http://localhost:8000/query", json={"prompt": "Hello, AI!"})
print(response.json())
Or use curl
:
curl -X POST "http://localhost:8000/query" -H "Content-Type: application/json" -d '{"prompt": "Hello, AI!"}'