Models

Score

Each model is assigned an Equistamp score, which is calculated by taking the average of the most recent score on our trusted evaluations. If a model hasn't yet taken a given evaluation, the median score for that evaluation (over all models) is used. Per evaluation scores are scaled so that 0 is the minimum score (usually also 0), and 100 is the max score for that model.

Our reasoning behind this is:

Penalise models that haven't taken many tests, as they could just take the easy ones and so inflate their ratings
Reward models for taking hard tests - a model that gets 50% on a really hard test shouldn't have a worse overall score than one that gets 90% on a really easy one
Only take trusted evaluations into account, as otherwise it would be trivial to score really well by uploading a bunch of custom evaluations and only taking them
Have a relative scale, so it's obvious how models compare to each other

Model Pages

Models List

This is the main view for models. It shows a list of all available models, which can be sorted and filtered. On this page each model has a card with its most important information, the most obvious being name and score. Enterprise users have the option of making private models - these will only be displayed to the owner of a model and will not be visible to any other users.

New models can be added from this view by clicking on the "Add model" which is always the first item in the models list.

Clicking on a model's card will open a new page with detailed information about the model, along with the option of viewing any previous runs and graphs showing how well it did on various evaluations.

Model details page

Clicking on a model from the main list will open the details of that model. Here you can see various extra information like the publisher, model architecture or number of parameters. There is also a graph showing how well this model did on various evaluations over time. The graph can be filtered to drill down to specific evaluations, and the data can be displayed in various way - the default view shows the top three evaluations for each point in time. Hovering over the graph will show the specific evaluations and scores for that date, and clicking on a point will display the evaluations from that data point in the left panel.

Clicking on "View Evaluations" will show all previous evaluations of this model.

If you are the owner of a model, this page will display an "Edit" button, which on clicking will allow you to edit the various parameters of your model, and also an "Edit endpoint" button which will open the endpoint configuration wizard which allows one to configure how evaluations tasks should be sent to this model.

Model creation page

The [model creation page](/models/create** allows adding new models to the system, which can subsequently be tested on various evaluations. Opening this page displays a form with various fields for e.g. name, description etc., some of which are required, others are optional, but should be quite obvious. The following settings are quite significant:

Name

Each model must have a unique name. This also applies to private models, so e.g. creating a model called test will most likely fail. If you have a well known AI model which you want to add, but someone else has already created a model with that name, please contact us at admin@equistamp.com.

Visibility options

Visibility

If you are an enterprise user, you will have the option to make this model private. Private models are only ever visible to their creators and system administrators, for whom private models act just like any other models, i.e. are visible in the models list and have details pages that can be viewed. Private models are also hidden when displaying previous runs of evaluations. That being said, private models are always taken into account when calculating Equistamp scores, so if you have a private model that blows the competition out of the water, this will be detectable by the maximum score (visible to other users) being suspiciously low.

Usability

Models, by default, cannot be evaluated by everyone. This is by design. When you add a model, it's up to you to cover any costs of running it. We will cover the running of evaluations, but it's on you to make sure your model can handle all requests. This can potentially be quite large if lots of people run evaluations on your model, so to avoid surprises models have to be explicitly set as publicly runnable. If you enable this you will be shown a confirmation dialog to make sure you want this.

Limitations

This section contains various limitations that will be used to not overwhelm the model. Not all models use all settings, but most models use at least one or two. These limitation settings are optional - if not set, we will assume that there aren't any limits.

When evaluations are being run, the rate of tasks sent will be limited so as to stay below the minimum of these values. The actual rate is variable and scaled up and down on the basis of failed/successful responses, but should not exceed the set limits. That being said, this also depends on what the model returns, so it's possible for the limits to be breached, e.g. assume a model that has a limit of 1000 tokens per minute, which has already processed 800 over the last minute. Sending an additional 100 tokens to it will still be under the 1000 tokens limit, but the answer from the model is up to the model and could return e.g. 200 tokens, which would result in a total of 1100 tokens for that minute. We strive to respect limits, but can not guarantee that they will never be passed.

The available limit options are:

max tokens per minute - it's assumed that on average 1 token is around 4 characters. This includes both request and response tokens.
max context window tokens - the max number of tokens in the context window of a single call. This is basically a limit on the size of tasks (including responses) that can be processed by a model - there is no point in sending a task that has 100 tokens to a model that has a context window of 100 tokens, as there won't be any tokens left for the response
max requests per minute

Costs

Running AI models costs money, often quite a lot. These settings can be used to let users know how much that is. Currently this is just informational, but for commercial models cost information is crucial when considering which model to use, so we encourage filling out these fields if you want people to use your model. The final cost per run is assumed to be the sum of all these fields (when appropriate). Some models only charge per token, some only for run time, some can charge a set sum per hour of run time with an additional cost per token, so only fill in those fields that apply. It's assumed that empty fields are the same as 0 when calculating final costs.

The following options are available (all costs are assumed to be in USD):

cost per 1000 input tokens - it's assumed that on average a token is 4 characters; if you charge per character, rather than per token, please just multiply your per character cost by 4
cost per 1000 output tokens - this has the same assumptions as input token cost
cost per instance hour - some models charge by run time, rather than the amount of tokens processed

Endpoint configuration

The most important thing when adding your model is to define the endpoint to be called with tasks to be evaluated. This can be done with the endpoint configuration wizard, which will appear after clicking on the "Edit endpoint" button.

Endpoint configuration wizard

The endpoint configuration wizard is used to specify your model's endpoint to be called with tasks. We send tasks as POST requests with a JSON body, which you can configure with the wizard. The wizard is available after clicking on "Edit endpoint" button while creating models or on the model details page of models that you own. If you need more control over how requests are send and parsed, you can use our query DSL.

The wizard contains 5 steps:

Endpoint url - the url to be called, e.g. 'https://this.is.my/model/endpoint'
Headers - additional headers to be sent, or overrides for default ones. This is useful for e.g. API keys and such. This field needs to be sent as plain text in requests, so assume that this might not be secure and plan appropriately. The structure for this field is a JSON object, e.g.
```
{
   "Connection": "keep-alive",
   "Api-Key": "DEADBEEF"
}
```
Body template - this is where you can configure the actual JSON object to be sent. It should contain a %PROMPT% literal somewhere, which will be replaced by the actual task text, so assuming that the task is "What time is it?", the following JSON object:
```
{
   "type": "task",
   "versions": [1, 2, 3],
   "task": "%PROMPT%"
}
```
will be sent to your endpoint as:
```
{
   "type": "task",
   "versions": [1, 2, 3],
   "task": "What time is it?"
}
```
Endpoint test - this step will send a real test request to your endpoint to see if everything works correctly. Once your endpoint returns something, it will be displayed here. Errors will also be displayed, so you can use this bit to debug the previous three steps. Changes to any of the previous fields will trigger a new request.
Response parser - Once your endpoint is replying successfully, you can specify the field in the response to use as the model's answer. We expect to get a single string with the response. If your model already returns only the raw response as a single string, then leave this field as an empty list. Otherwise please provide the path to the appropriate field. When you modify the Response parser field, it will display what it would have extracted from the response in the previous step. Assuming that your endpoint returned the following:
```
{
  "status": 200,
  "headers": {"Connection": "keep-alive"},
  "json": {
     "res": "it's 10 PX",
  },
  "raw": "it's 10 PX"
}
```

the following Response parser strings would result in the corresponding values:

["status"] - 200
["headers", "Connection"] - "keep-alive"
["json"] - {"res": "it's 10 PX"}
["json", "res"] - "it's 10 PX"
["some", "missing", "key"] - null

Model Runs

Each model details page has a link to its previous evaluations. This is a page with all tests that this model has taken, grouped by evaluation. This can be further filtered by evaluation name on the group level, and by start - end dates, score and status (running, failed, completed) on the individual test level.

Clicking on an evaluation name will show a list of all previous runs of this model - evaluation pair, and clicking on a specific run will show a graph with more details. This graph is not always available and it can take up to 30 minutes for it to get updated.