easydel.trainers.packer#
Sequence packing utilities for efficient training.
This module provides functionality to pack multiple sequences into fixed-length batches, maximizing GPU/TPU utilization by reducing padding waste. It handles attention masks and position IDs correctly for packed sequences.
Packing is especially beneficial when training on datasets with varying sequence lengths, as it reduces the amount of wasted computation on padding tokens.
- easydel.trainers.packer.pack_sequences(dataset: Dataset, max_length: int = 512, pad_token_id: int = 0, reset_position_ids: bool = False, num_proc: int | None = None)[source]#
Pack multiple sequences into fixed-length batches for efficient training.
Combines multiple variable-length sequences into fixed-size packed sequences, reducing padding waste and improving training efficiency. Correctly handles attention masks and position IDs for packed sequences
# With continuous position IDs packed_dataset = pack_sequences(
dataset, max_length=512, pad_token_id=0, reset_position_ids=False
)
# With reset position IDs for each sequence packed_dataset = pack_sequences(
dataset, max_length=512, pad_token_id=0, reset_position_ids=True
)
# Example output format for a packed sequence with two sequences: # reset_position_ids=False: {
‘input_ids’: [seq1_tokens + [PAD] + seq2_tokens + [PAD] + padding], ‘attention_mask’: [1,1,1,0,1,1,1,0,0,0], ‘position_ids’: [0,1,2,3,4,5,6,7,0,0]
}
# reset_position_ids=True: {
‘input_ids’: [seq1_tokens + [PAD] + seq2_tokens + [PAD] + padding], ‘attention_mask’: [1,1,1,0,1,1,1,0,0,0], ‘position_ids’: [0,1,2,0,0,1,2,0,0,0]
}
- Parameters
dataset – HuggingFace Dataset containing ‘input_ids’ and ‘attention_mask’ columns. Each example should have variable-length sequences to pack.
max_length – Maximum length of each packed sequence (default 512). Sequences are packed until this limit is reached.
pad_token_id – Token ID used for padding and as separator between packed sequences (default 0).
reset_position_ids – If True, position IDs reset to 0 for each sequence within a pack. If False, position IDs are continuous across packed sequences (default False).
num_proc – Number of processes to use for parallel processing. None uses single process (default None).
- Returns
- New dataset with packed sequences containing:
’input_ids’: Packed token sequences
’attention_mask’: Attention masks (0 for padding/separators)
’position_ids’: Position embeddings for each token
- Return type
Dataset
- Raises
KeyError – If dataset doesn’t contain required columns.
Note
Sequences are separated by pad_token_id with attention_mask=0
Remaining space in the last pack is filled with padding
Position IDs handle both continuous and reset modes correctly
Efficient for training when sequences have varying lengths