easydel.trainers.packer#
- easydel.trainers.packer.pack_sequences(dataset: Any, max_length: int = 512, pad_token_id: int = 0, reset_position_ids: bool = False, num_proc: Optional[int] = None)[source]#
Pack sequences together with their attention masks and position IDs
# With continuous position IDs packed_dataset = pack_sequences(
dataset, max_length=512, pad_token_id=0, reset_position_ids=False
)
# With reset position IDs for each sequence packed_dataset = pack_sequences(
dataset, max_length=512, pad_token_id=0, reset_position_ids=True
)
# Example output format for a packed sequence with two sequences: # reset_position_ids=False: {
‘input_ids’: [seq1_tokens + [PAD] + seq2_tokens + [PAD] + padding], ‘attention_mask’: [1,1,1,0,1,1,1,0,0,0], ‘position_ids’: [0,1,2,3,4,5,6,7,0,0]
}
# reset_position_ids=True: {
‘input_ids’: [seq1_tokens + [PAD] + seq2_tokens + [PAD] + padding], ‘attention_mask’: [1,1,1,0,1,1,1,0,0,0], ‘position_ids’: [0,1,2,0,0,1,2,0,0,0]
}
- Parameters
dataset – Dataset containing ‘input_ids’ and ‘attention_mask’
max_length – Maximum length of packed sequence
pad_token_id – Token ID used for padding
reset_position_ids – If True, reset position IDs for each sequence in the pack
- Returns
Dataset with packed sequences, attention masks, and position IDs
- Return type
packed_dataset