Skip to content

Latest commit

 

History

History

pulze-intent-v0.1

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

pulze-intent-v0.1 (model, dataset)

Intent-tuned LLM router that selects the best LLM for a user query.

Usage

Local

Fetch artifacts from Huggingface:

huggingface-cli download pulze/intent-v0.1 --local-dir .dist --local-dir-use-symlinks=False

Start the services:

docker compose up -d --build
curl -s 127.0.0.1:8888/ \
    -X POST \
    -d '{"query":"give me instructions for making ramen at home"}' \
    -H 'Content-Type: application/json' | jq .

Output:

{
  "hits": [
    {
      "id": "0c571369-e985-41e1-b14b-3620c4bb40b5",
      "category": "writing_cooking_recipe",
      "similarity": 0.8069034
    },
    {
      "id": "9f44d3c0-95f5-43cc-a881-6e23adf9c68b",
      "category": "writing_cooking_recipe",
      "similarity": 0.778615
    },
    {
      "id": "21292586-73f3-4bf7-9ada-ae6917d4cd74",
      "category": "writing_cooking_recipe",
      "similarity": 0.77417636
    },
    {
      "id": "edd39535-f9b7-4188-a56e-055846d0ba23",
      "category": "writing_cooking_recipe",
      "similarity": 0.772714
    },
    {
      "id": "3cc563e1-1816-4ff1-8d2e-60fb9392c6de",
      "category": "writing_cooking_recipe",
      "similarity": 0.76833653
    },
    {
      "id": "15c9e10f-d217-418a-b367-a16e3cd3a541",
      "category": "writing_cooking_recipe",
      "similarity": 0.76015425
    },
    {
      "id": "ad33a141-269f-4a88-b99c-456cf67d9221",
      "category": "writing_cooking_recipe",
      "similarity": 0.75983727
    },
    {
      "id": "d6ee2a78-3b7a-44d6-9778-adc1e1f9a3db",
      "category": "writing_cooking_recipe",
      "similarity": 0.75918543
    },
    {
      "id": "afa1f32e-e69e-4d75-9a73-7a11b0259a24",
      "category": "writing_cooking_recipe",
      "similarity": 0.7565732
    },
    {
      "id": "5a569973-3934-49f6-8901-11f6b490a6cd",
      "category": "writing_cooking_recipe",
      "similarity": 0.7564193
    }
  ],
  "scores": [
    {
      "target": "gpt-3.5-turbo-0125",
      "score": 0.83
    },
    {
      "target": "command-r-plus",
      "score": 0.93
    },
    {
      "target": "llama-3-70b-instruct",
      "score": 0.95
    },
    {
      "target": "gpt-4-turbo-2024-04-09",
      "score": 0.96
    },
    {
      "target": "dbrx-instruct",
      "score": 0.91
    },
    {
      "target": "mixtral-8x7b-instruct",
      "score": 0.91
    },
    {
      "target": "mistral-small",
      "score": 0.9
    },
    {
      "target": "mistral-large",
      "score": 0.91
    },
    {
      "target": "mistral-medium",
      "score": 0.89
    },
    {
      "target": "claude-3-opus-20240229",
      "score": 0.91
    },
    {
      "target": "claude-3-sonnet-20240229",
      "score": 0.9
    },
    {
      "target": "command-r",
      "score": 0.88
    },
    {
      "target": "claude-3-haiku-20240307",
      "score": 0.89
    }
  ]
}

Kubernetes

See this example.

Models

  • claude-3-haiku-20240307
  • claude-3-opus-20240229
  • claude-3-sonnet-20240229
  • command-r
  • command-r-plus
  • dbrx-instruct
  • gpt-3.5-turbo-0125
  • gpt-4-turbo-2024-04-09
  • llama-3-70b-instruct
  • mistral-large
  • mistral-medium
  • mistral-small
  • mixtral-8x7b-instruct

Data

Prompts and Intent Categories

Prompt and intent categories are derived from the GAIR-NLP/Auto-J scenario classification dataset.

Citation:

@article{li2023generative,
  title={Generative Judge for Evaluating Alignment},
  author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei},
  journal={arXiv preprint arXiv:2310.05470},
  year={2023}
}

Response Evaluation

Candidate model responses were evaluated pairwise using openai/gpt-4-turbo-2024-04-09, with the following prompt:

You are an expert, impartial judge tasked with evaluating the quality of responses generated by two AI assistants.

Think step by step, and evaluate the responses, <response1> and <response2> to the instruction, <instruction>. Follow these guidelines:
- Avoid any position bias and ensure that the order in which the responses were presented does not influence your judgement
- Do not allow the length of the responses to influence your judgement - a concise response can be as effective as a longer one
- Consider factors such as adherence to the given instruction, helpfulness, relevance, accuracy, depth, creativity, and level of detail
- Be as objective as possible

Make your decision on which of the two responses is better for the given instruction from the following choices:
If <response1> is better, use "1".
If <response2> is better, use "2".
If both answers are equally good, use "0".
If both answers are equally bad, use "0".

<instruction>
{INSTRUCTION}
</instruction>

<response1>
{RESPONSE1}
</response1>

<response2>
{RESPONSE2}
</response2>

Each pair of models is subject to 2 matches, with the positions of the respective responses swapped in the evaluation prompt. A model is considered a winner only if it wins both matches.

For each prompt, we then compute Bradley-Terry scores for the respective models using the same method as that used in the LMSYS Chatbot Arena Leaderboard. Finally, we normalize all scores to a scale from 0 to 1 for interoperability with other weighted ranking systems.

Model

The embedding model was generated by first fine-tuning BAAI/bge-base-en-v1.5 with the intent categories from the dataset above, using contrastive learning with cosine similarity loss, and subsequently merging the resultant model with the base model at a 3:2 ratio.