Evaluations SDK

The SDK enables you to evaluate your runs directly within your code using your own LLM agents.

You or your team can create these assertions (which we refer to as a checklist) via the dashboard.

Examples of evaluation assertions include:

Ensuring the LLM's response is at least 90% similar to an expected output (using cosine similarity)
Verifying that the LLM's response contains a specific name or location
Checking that the LLM's response is valid JSON

We are continuously expanding the range of assertions, offering limitless possibilities.

Pre-requisites

Create a testing dataset on the dashboard.
Create a checklist on the dashboard.
Ensure you have an LLM agent that can be tested against the dataset (currently, it should accept an array of messages)

The testing dataset will be utilized to conduct the evaluations.

Assuming your agent is configured as follows:

from openai import OpenAI
import lunary

client = OpenAI()

lunary.monitor(client)

@lunary.agent()
def my_llm_agent(input):
  res = client.chat.completions.create(
    model="gpt-4o", 
    messages=input 
  )
  return res.choices[0].message.content

Differences from other tools

Our platform distinguishes itself in the LLM testing and evaluation space for several reasons:

Evaluations are managed via the dashboard rather than in code, which simplifies maintenance and fosters collaboration with non-technical team members. Although this approach offers less flexibility than custom code evaluations, our expanding set of blocks meets most requirements, including an upcoming custom code block.
You can test metrics such as OpenAI's cost that are not directly accessible in your code, and you also have access to a historical record of evaluations.
The platform is tightly integrated with features like Prompt Templates and Observability, for instance, enabling you to test templates before deployment.

Usage

We offer several methods to facilitate test execution within your code.

The overarching concept is to run your agent on a testing dataset and then assess the data captured from your agents and LLM calls.

Follow this step-by-step guide to execute tests in your code:

Fetch the dataset

Retrieve the dataset you wish to evaluate. This provides a series of inputs on which to run your agent.

dataset = lunary.get_dataset("some-dataset")

const dataset = await lunary.getDataset("some-dataset")

GET https://api.lunary.ai/v1/datasets/{dataset_slug}

Run your agents

The next step is to run your LLM agents on the dataset runs you fetched.

dataset = lunary.get_dataset("my_dataset")

for item in dataset:
  result = my_llm_agent(item.input)

Evaluate the result

dataset = lunary.get_dataset("my-dataset")

for item in dataset:

  prompt = item.input
  result = my_llm_agent(item.input)

  passed, results = lunary.evaluate(
    checklist="some-slug",
    output=result,
    input=prompt,
    ideal_output=item.ideal_output,
    # model="gpt-4o", # optional, for model-specific evaluations such as cost
  )

  print(passed)

Example with testing framework

You can integrate the SDK with your testing framework of choice.

Here is an example of how you can integrate the SDK with a testing framework like pytest:

import lunary

def test_my_agent():
  dataset = lunary.get_dataset("my-dataset")

  for item in dataset:
    prompt = item.input
    result = my_llm_agent(item.input)

    passed, results = lunary.evaluate(
      checklist="some-slug",
      output=result,
      input=prompt,
      ideal_output=item.ideal_output
    )

    assert passed

Questions? We're here to help.

Call with founder