I have the following Dataset, which has 3 splits (train
, validation
and test
). The data are parallel corpus of 2 languages.
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 109942
})
validation: Dataset({
features: ['translation'],
num_rows: 6545
})
test: Dataset({
features: ['translation'],
num_rows: 13743
})
})
For my Seq2SeqTrainer
, I supply the dataset as follows:
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = tokenized_dataset['train'],
eval_dataset = tokenized_dataset['validation'],
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
)
Am I correctly put the validation
split in eval_dataset
? In the documentation, it said:
The dataset to use for evaluation. If it is a
Dataset
, columns not accepted by themodel.forward()
method are automatically removed. If it is a dictionary, it will evaluate on each dataset prepending the dictionary key to the metric name.
Or should I put the test
split in eval_dataset
? In either way, is that true that one of the splits is not used?