Evaluations

Create an evaluation

Evaluations can be created with the creation form. You must have at least the Pro subscription level to be able to do this.

Basic evaluation settings

The first section of the evaluations creator contains basic information about the evaluation. This is mainly used for display purposes - the only exception being the Minimum questions to complete.

Name - this is the name of the evaluation, which must be unique system wide. In general it's first come, first served, but if you are the author of a very well known evaluation, we can help you work something out to avoid name squatting.
Description - a description of what the evaluation checks, etc. This field will be rendered as markdown.
Type - the types of tasks administrated by this evaluation. This is purely for informational reasons - it doesn't limit you in any way
Modalities - the types of modalities used in your tasks. This also is purely informational
Visibility - This setting can be used to specify whether your evaluation is visible to everyone. By default evaluations are publicly visible, and also can be used to test all available models. This option is only available to enterprise users.
Minimum questions to complete - this controls the number of tasks per evaluation session. When a model is tested with your evaluation, we will run at least this many tasks. The actual number of tasks seen by a model can be larger, because of various network or model errors. We retry tasks that didn't get a proper answer from a model (e.g. because of rate limiting), but only a limited number of times. Because of this, we add 5-10% extra tasks to compensate for tasks that didn't get proper responses. This doesn't mean that all of the extra tasks will be sent - as soon as we manage to grade minumum questions to complete tasks, we finish the evaluation session.

Tasks CSV URL

The next step is to provide a CSV file with the tasks for your evaluation. Check our sample file for an example of what the file should look like. You can either host it yourself and provide an URL to the file, or you can use Google Sheets and provide the link to a Sheet with the appropriate data. In the case of Google Sheets make sure to set it's sharing permissions to whoever has the link, as otherwise we won't be able to access it.

This step will try to read the file, so you'll be notified right away if we can't access it.

Configure CSV columns

In this step you specify how we should treat the various columns in your data. The first row should be a header containing the names of your columns. If you don't add this, then the values from the first row will be used as the header, which can be confusing. Each row will be made into a task on the basis of the mapping you specify. We do some basic guessing of the types of columns on the basis of the column names, but you'll most likely need to configure things anyway. Each column can be one of the following types:

Generic task columns

Redacted - rows where this type of column are not empty (so values like false or 0 will be interpreted as true) will be imported as redacted tasks. Redacted tasks are not used when evaluating models or humans, so this is basically a way to disable tasks. There can only be one such column.
Type - this specifies the type of the resulting task. Rows for which no values are defined will use the default task type. There can only be one such column. The following tasks types are supported:
- mcq - Multiple Choice Questions
Question - this column type defines questions to be sent to models to be solved, e.g. What time is it?. You can have multiple question columns - a random one will be chosen to be sent to the model during testing. This can be used for questions like What do you call a bending of the body in respect and What is used to shoot arrows, both of which can be answered with bow.

Boolean question columns

Correct - any rows which are 1 or case insensitive true or yes (so e.g. TrUe, TRue or true) will be deemed to be true statements. Anything else is false.

Multiple Choice Question columns

Correct answer - these columns contain correct answers. There can be multiple correct answer columns, but Multiple Choice Question tasks must have at least one. If you define more than 10 correct answer columns, we will ignore the additional ones.
Incorrect answer - these columns contain incorrect answers. There can be multiple incorrect answer columns, but Multiple Choice Question tasks must have at least one. If you define more than 20 incorrect answer columns, we will ignore the additional ones.

Free response question columns

Correct answer - these columns contain correct answers. There can be multiple correct answer columns, but Free Response Question tasks must have at least one. If you define more than 10 correct answer columns, we will ignore the additional ones.
Incorrect answer - these columns contain incorrect answers. There can be multiple incorrect answer columns, but Free Response Question tasks must have at least one. If you define more than 20 incorrect answer columns, we will ignore the additional ones.

Json question columns

Schema - a JSON schema specifying the structure of the expected JSON. If this is provided, all responses must conform to this schema. If not provided, then the schema will be assumed to be any valid JSON. The schema can be provided via a reference (see below).
Expected - an expected JSON object. The JSON returned by the model must have the same values as the expected object

Paraphrases

One problem with common evaluations is that they are often part of the training set. This can also happen with custom tasks, as sending them to models can result in that text getting added to future training runs. To avoid this, you can add paraphrases to your tasks - when a task has paraphrases defined, they will always be used, rather than the actual text. You can define paraphrases for any text column types. There can be multiple paraphrases for each column - the more the better. Paraphrases are declared in two steps: * first select the Paraphrase type for the column you want to use as a paraphrase * next select the column you are paraphrasing

References

In the case of tasks with schemas, or other large objects, the uploaded files would quickly become very large, containing lots of duplicate values. To avoid this, we support references for some columns, where you can just provide a string identifier, rather than the whole object. Any columns that support references will check if a given row's value is in the set of known references, and if so, will use the schema that the reference is pointing to. Reference keys can contain English letters (upper and lowercase), digits and "-", "_", and ".".

Check CSV file

Once you've defined the columns mapping, the next step will check whether everything is correct. This will do a mock attempt of creating tasks from your file, where it will just check for errors. All rows with errors or warnings will be displayed so you can correct them. This is optional - you can submit a file with errors, and those rows will simply be skipped.

If you file has columns that support references, and you have rows that seem to have references, this step will display a list of all detected references. You will have to fill them out here. Once you've specified the references, you can click Recheck to check the CSV file again with your updated references. Any errors will be displayed below the appropriate references.

Once your file has been checked, you can Save your evaluation. This will redirect you to the page of the newly created evaluation. Your tasks probably won't be imported yet - it can take up to 15 minutes for all tasks to be added.