Evaluations
Create an evaluation
Evaluations can be created with the creation form. You must have at least the Pro subscription level to be able to do this.
Basic evaluation settings
The first section of the evaluations creator contains basic information about the evaluation. This is mainly used
for display purposes - the only exception being the Minimum questions to complete.
- Name - this is the name of the evaluation, which must be unique system wide. In general it's first come, first served, but if you are the author of a very well known evaluation, we can help you work something out to avoid name squatting.
- Description - a description of what the evaluation checks, etc. This field will be rendered as markdown.
- Type - the types of tasks administrated by this evaluation. This is purely for informational reasons - it doesn't limit you in any way
- Modalities - the types of modalities used in your tasks. This also is purely informational
- Visibility - This setting can be used to specify whether your evaluation is visible to everyone. By default evaluations are publicly visible, and also can be used to test all available models. This option is only available to enterprise users.
- Minimum questions to complete - this controls the number of tasks per evaluation session. When a model is tested
with your evaluation, we will run at least this many tasks. The actual number of tasks seen by a model can be larger, because
of various network or model errors. We retry tasks that didn't get a proper answer from a model (e.g. because of rate
limiting), but only a limited number of times. Because of this, we add 5-10% extra tasks to compensate for tasks
that didn't get proper responses. This doesn't mean that all of the extra tasks will be sent - as soon as we manage to
grade
minumum questions to completetasks, we finish the evaluation session.
Tasks CSV URL
The next step is to provide a CSV file with the tasks for your evaluation. Check our sample file for an example of what the file should look like. You can either host it yourself and provide an URL to the file, or you can use Google Sheets and provide the link to a Sheet with the appropriate data. In the case of Google Sheets make sure to set it's sharing permissions to whoever has the link, as otherwise we won't be able to access it.
This step will try to read the file, so you'll be notified right away if we can't access it.
Configure CSV columns
In this step you specify how we should treat the various columns in your data. The first row should be a header containing the names of your columns. If you don't add this, then the values from the first row will be used as the header, which can be confusing. Each row will be made into a task on the basis of the mapping you specify. We do some basic guessing of the types of columns on the basis of the column names, but you'll most likely need to configure things anyway. Each column can be one of the following types:
Generic task columns
- Redacted - rows where this type of column are not empty (so values like
falseor0will be interpreted as true) will be imported as redacted tasks. Redacted tasks are not used when evaluating models or humans, so this is basically a way to disable tasks. There can only be one such column. - Type - this specifies the type of the resulting task. Rows for which no values are defined will use the
default task type. There can only be one such column. The following tasks types are supported:mcq- Multiple Choice Questions
- Question - this column type defines questions to be sent to models to be solved, e.g.
What time is it?. You can have multiple question columns - a random one will be chosen to be sent to the model during testing. This can be used for questions likeWhat do you call a bending of the body in respectandWhat is used to shoot arrows, both of which can be answered withbow.
Boolean question columns
- Correct - any rows which are
1or case insensitivetrueoryes(so e.g.TrUe,TRueortrue) will be deemed to be true statements. Anything else is false.
Multiple Choice Question columns
- Correct answer - these columns contain correct answers. There can be multiple correct answer columns, but Multiple Choice Question tasks must have at least one. If you define more than 10 correct answer columns, we will ignore the additional ones.
- Incorrect answer - these columns contain incorrect answers. There can be multiple incorrect answer columns, but Multiple Choice Question tasks must have at least one. If you define more than 20 incorrect answer columns, we will ignore the additional ones.
Free response question columns
- Correct answer - these columns contain correct answers. There can be multiple correct answer columns, but Free Response Question tasks must have at least one. If you define more than 10 correct answer columns, we will ignore the additional ones.
- Incorrect answer - these columns contain incorrect answers. There can be multiple incorrect answer columns, but Free Response Question tasks must have at least one. If you define more than 20 incorrect answer columns, we will ignore the additional ones.
Json question columns
- Schema - a JSON schema specifying the structure of the expected JSON. If this is provided, all responses must conform to this schema. If not provided, then the schema will be assumed to be any valid JSON. The schema can be provided via a reference (see below).
- Expected - an expected JSON object. The JSON returned by the model must have the same values as the
expectedobject
Paraphrases
One problem with common evaluations is that they are often part of the training set. This can also happen with custom
tasks, as sending them to models can result in that text getting added to future training runs. To avoid this, you can
add paraphrases to your tasks - when a task has paraphrases defined, they will always be used, rather than the actual
text. You can define paraphrases for any text column types. There can be multiple paraphrases for each column - the more
the better. Paraphrases are declared in two steps:
* first select the Paraphrase type for the column you want to use as a paraphrase
* next select the column you are paraphrasing
References
In the case of tasks with schemas, or other large objects, the uploaded files would quickly become very large, containing lots of duplicate values. To avoid this, we support references for some columns, where you can just provide a string identifier, rather than the whole object. Any columns that support references will check if a given row's value is in the set of known references, and if so, will use the schema that the reference is pointing to. Reference keys can contain English letters (upper and lowercase), digits and "-", "_", and ".".
Check CSV file
Once you've defined the columns mapping, the next step will check whether everything is correct. This will do a mock attempt of creating tasks from your file, where it will just check for errors. All rows with errors or warnings will be displayed so you can correct them. This is optional - you can submit a file with errors, and those rows will simply be skipped.
If you file has columns that support references, and you have rows that seem to have references, this step will display
a list of all detected references. You will have to fill them out here. Once you've specified the references, you can
click Recheck to check the CSV file again with your updated references. Any errors will be displayed below the appropriate
references.
Once your file has been checked, you can Save your evaluation. This will redirect you to the page of the newly created
evaluation. Your tasks probably won't be imported yet - it can take up to 15 minutes for all tasks to be added.