Skip to content

Evaluate GritLM-7B on MTEB datasets #57

@ThisisXXZ

Description

@ThisisXXZ

I am trying to evaluate GritLM-7B on MTEB datasets using the provided script.

#!/bin/bash

python /home/e/e1347696/unified_encoder_decoder/src/eval/MTEB/eval_mteb.py \
    --model_name_or_path /home/e/e1347696/unified_encoder_decoder/model/GritLM-7B \
    --output_folder /home/e/e1347696/unified_encoder_decoder/src/results/GritLM-7B-mteb \
    --task_types Classification,Clustering,PairClassification,Reranking,Retrieval,STS,Summarization \
    --batch_size 32

However, it seems that it has only been evaluated on the following datasets:

  • AmazonCounterFactualClassification
  • AmazonReviewsClassification
  • MassiveIntentClassification
  • MassiveScenarioClassification
  • MTOPDomainClassification
  • MTOPIntentClassification
  • STS17
  • STS22

Other datasets seem to be skipped. The output log is shown here:

Created GritLM: torch.bfloat16 dtype, mean pool, embedding mode, bbcc attn
GritLM-7B instruction for AmazonCounterfactualClassification:  <|user|>
Classify a given Amazon customer review text as either counterfactual or not-counterfactual
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonCounterfactualClassification, s2s, multilingual 1 / 4 Subsets


GritLM-7B instruction for AmazonReviewsClassification:  <|user|>
Classify the given Amazon review into its appropriate rating category
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MasakhaNEWSClassification
GritLM-7B instruction for MassiveIntentClassification:  <|user|>
Given a user utterance as query, find the user intents
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MassiveScenarioClassification:  <|user|>
Given a user utterance as query, find the user scenarios
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MTOPDomainClassification:  <|user|>
Classify the intent domain of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets


GritLM-7B instruction for MTOPIntentClassification:  <|user|>
Classify the intent of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MultiHateClassification
Skipping task: MultilingualSentimentClassification
Skipping task: NusaX-senti
Skipping task: SIB200Classification
Skipping task: SouthAfricanLangClassification
Skipping task: MasakhaNEWSClusteringP2P
Skipping task: MasakhaNEWSClusteringS2S
Skipping task: SIB200ClusteringS2S
Skipping task: BelebeleRetrieval
Skipping task: MIRACLRetrieval
Skipping task: MIRACLRetrievalHardNegatives
Skipping task: MLQARetrieval
Skipping task: MultiLongDocRetrieval
Skipping task: WikipediaRetrievalMultilingual
Skipping task: XMarket
Skipping task: XQuADRetrieval
Skipping task: OpusparcusPC
Skipping task: PawsXPairClassification
Skipping task: RTE3
Skipping task: XNLI
Skipping task: MIRACLReranking
Skipping task: WikipediaRerankingMultilingual
Skipping task: SemRel24STS
GritLM-7B instruction for STS17:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS17, s2s, multilingual 1 / 11 Subsets


Skipping task: STS22.v2
GritLM-7B instruction for STS22:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS22, p2p, multilingual 1 / 18 Subsets


Skipping task: STSBenchmarkMultilingualSTS

And the error log contains some warning such as:

The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
The `task_langs` argument is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(... languages = [...])` to filter tasks instead. Note that this uses 3 letter language codes (ISO 639-3).
Passing task names as strings is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(tasks=[...])` method to get tasks instead.
The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
Dataset 'STS22' is superseeded by 'STS22.v2', you might consider using the newer version of the dataset.

I will really appreciate it if you could help me with that! Thank you so much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions