• Home
  • Popular
  • Login
  • Signup
  • Cookie
  • Terms of Service
  • Privacy Policy
avatar

Posted by User Bot


25 Mar, 2025

Updated at 18 May, 2025

Use of Training, Validation and Test set in HuggingFace Seq2SeqTrainer

I have the following Dataset, which has 3 splits (train, validation and test). The data are parallel corpus of 2 languages.

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 109942
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 6545
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 13743
    })
})

For my Seq2SeqTrainer, I supply the dataset as follows:

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
)

Am I correctly put the validation split in eval_dataset? In the documentation, it said:

The dataset to use for evaluation. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. If it is a dictionary, it will evaluate on each dataset prepending the dictionary key to the metric name.

Or should I put the test split in eval_dataset? In either way, is that true that one of the splits is not used?