easydel.trainers.packer

Contents

easydel.trainers.packer#

Sequence packing utilities for efficient training.

This module provides functionality to pack multiple sequences into fixed-length batches, maximizing GPU/TPU utilization by reducing padding waste. It handles attention masks and position IDs correctly for packed sequences.

Packing is especially beneficial when training on datasets with varying sequence lengths, as it reduces the amount of wasted computation on padding tokens.

easydel.trainers.packer.pack_sequences(dataset: Dataset, max_length: int = 512, pad_token_id: int = 0, reset_position_ids: bool = False, num_proc: int | None = None)[source]#

Pack multiple sequences into fixed-length batches for efficient training.

Combines multiple variable-length sequences into fixed-size packed sequences, reducing padding waste and improving training efficiency. Correctly handles attention masks and position IDs for packed sequences

# With continuous position IDs packed_dataset = pack_sequences(

dataset, max_length=512, pad_token_id=0, reset_position_ids=False

)

# With reset position IDs for each sequence packed_dataset = pack_sequences(

dataset, max_length=512, pad_token_id=0, reset_position_ids=True

)

# Example output format for a packed sequence with two sequences: # reset_position_ids=False: {

‘input_ids’: [seq1_tokens + [PAD] + seq2_tokens + [PAD] + padding], ‘attention_mask’: [1,1,1,0,1,1,1,0,0,0], ‘position_ids’: [0,1,2,3,4,5,6,7,0,0]

}

# reset_position_ids=True: {

‘input_ids’: [seq1_tokens + [PAD] + seq2_tokens + [PAD] + padding], ‘attention_mask’: [1,1,1,0,1,1,1,0,0,0], ‘position_ids’: [0,1,2,0,0,1,2,0,0,0]

}

Parameters
  • dataset – HuggingFace Dataset containing ‘input_ids’ and ‘attention_mask’ columns. Each example should have variable-length sequences to pack.

  • max_length – Maximum length of each packed sequence (default 512). Sequences are packed until this limit is reached.

  • pad_token_id – Token ID used for padding and as separator between packed sequences (default 0).

  • reset_position_ids – If True, position IDs reset to 0 for each sequence within a pack. If False, position IDs are continuous across packed sequences (default False).

  • num_proc – Number of processes to use for parallel processing. None uses single process (default None).

Returns

New dataset with packed sequences containing:
  • ’input_ids’: Packed token sequences

  • ’attention_mask’: Attention masks (0 for padding/separators)

  • ’position_ids’: Position embeddings for each token

Return type

Dataset

Raises

KeyError – If dataset doesn’t contain required columns.

Note

  • Sequences are separated by pad_token_id with attention_mask=0

  • Remaining space in the last pack is filled with padding

  • Position IDs handle both continuous and reset modes correctly

  • Efficient for training when sequences have varying lengths