tokenize the dataset using the Dataset.map() function
checkpoint ="bert-base-uncased"tokenizer = AutoTokenizer.from_pretrained(checkpoint)# no padding at this stagedef f(x):return tokenizer(x["sentence"], truncation=True)tokenized_datasets = raw_datasets.map(f, batched=True).with_format('pytorch')
Loading cached processed dataset at /home/limin/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-989431ea55d09aff.arrow
Loading cached processed dataset at /home/limin/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-dcfc5e44e548784f.arrow
Loading cached processed dataset at /home/limin/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-974d6c4aa35125e2.arrow
# try it with some samplessamples = train_dataset[:10]print([len(x) for x in samples["input_ids"]])batch = data_collator(samples){k: v.shape for k, v in batch.items()}
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.