easydel.trainers.prompt_utils

easydel.trainers.prompt_utils#

Prompt formatting and chat template utilities.

This module provides utilities for converting between different conversation formats, applying chat templates, and handling various prompt structures. Originally from HuggingFace TRL, adapted for EasyDeL.

Key functionality: - Convert between OpenAI format and simpler dictionary formats - Apply chat templates to conversational datasets - Detect conversational vs instruction formats - Handle multi-turn conversations and function calling

easydel.trainers.prompt_utils.apply_chat_template(example: dict[str, list[dict[str, str]]], tokenizer: Any, tools: list[Union[dict, Callable]] | None = None, **template_kwargs) → dict[str, str][source]#

Apply chat template to conversational examples.

Formats conversation data using the tokenizer’s chat template, handling various input formats and optionally including tool schemas.

Parameters

example – Dictionary containing conversation data. Supported keys: ‘prompt’, ‘chosen’, ‘rejected’, ‘completion’, ‘messages’, ‘label’.
tokenizer – Tokenizer with chat template support.
tools – Optional list of tool/function schemas for function calling.

Returns

Formatted example with chat template applied to text fields.

Return type

dict

Raises

ValueError – If example format is not supported.

Note

Handles both single and multi-turn conversations. Preserves original structure while applying templates.

easydel.trainers.prompt_utils.convert_to_openai_format(input_data: Union[list[list[dict[str, str]]], list[dict[str, str]], dict[str, str]]) → list[dict[str, str | list[dict[str, str]]]][source]#

Converts various input formats (list[list[dict]], list[dict], dict) into the OpenAI Chat Completions message list format.

If the input_data already conforms to the target OpenAIMessageList format (specifically with content as list of parts), it is returned directly.

Target Format Example for one message: {

“role”: “user”, “content”: [{“type”: “text”, “text”: “message content here”}]

}

Parameters: input_data – Data in one of the supported formats or already in the target OpenAIMessageList format. Keys like ‘role’, ‘content’, ‘text’, ‘message’ are searched case-insensitively within dictionaries during conversion.
Returns: A list of messages in the target OpenAI format. Returns an empty list if the input is invalid, cannot be parsed, results in no valid messages, or is an unsupported type. Returns the input directly if it already matches the target format.

easydel.trainers.prompt_utils.extract_prompt(example: dict[str, collections.abc.Sequence]) → dict[str, collections.abc.Sequence][source]#

Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and rejected completions.

For more details, see [maybe_extract_prompt].

easydel.trainers.prompt_utils.is_conversational(example: dict[str, Any]) → bool[source]#

Check if an example is in conversational format.

Detects whether the example contains conversation-style data with role and content fields.

Parameters

example – Dictionary to check. Looks for keys like ‘prompt’, ‘chosen’, ‘rejected’, ‘completion’, or ‘messages’.

Returns

True if example contains conversational data with role/content: structure, False otherwise.

Return type

bool

Note

Used to determine whether to apply chat templates during processing.

easydel.trainers.prompt_utils.is_conversational_from_value(example: dict[str, Any]) → bool[source]#

Check if the example is in a conversational format (from/value). Note that this format isn’t recommended. Prefer the ChatML format (role/content)

Parameters: example (dict[str, Any]) – A single data entry of a dataset. The example can have different keys depending on the dataset type.
Returns: True if the data is in a conversational Chatformat, False otherwise.
Return type: bool

Examples:

```python >>> example = {“conversations”: [{“from”: “user”, “value”: “What color is the sky?”}]} >>> is_conversational_from_value(example) True

>>> example = {"conversations": [{"role": "user", "content": "What color is the sky?"}]}
>>> is_conversational_from_value(example)
False

>>> example = {"conversations": "The sky is"}
>>> is_conversational_from_value(example)
False
```

easydel.trainers.prompt_utils.keep_array_and_primitives(example: TListOrMapping) → TListOrMapping[source]#

Recursively keeps only numpy/jax arrays, ints, floats, and bools from a nested structure.

Parameters: example (list or Mapping) – Input nested structure (list or dictionary) to filter.
Returns: Filtered structure containing only arrays and primitive types.

Example: `python >>> import numpy as np >>> example = { ... "array": np.array([1, 2, 3]), ... "int_val": 42, ... "float_val": 3.14, ... "bool_val": True, ... "string": "remove_me", ... "nested": {"keep": 1, "remove": "text"} ... } >>> keep_array_and_primitives(example) {'array': array([1, 2, 3]), 'int_val': 42, 'float_val': 3.14, 'bool_val': True, 'nested': {'keep': 1}} `

easydel.trainers.prompt_utils.keep_arrays_map(example: dict[str, Any], array_fields: list[str] | None = None, drop_fields: list[str] | None = None) → dict[str, Any][source]#: Keep only array fields and convert them to numpy arrays for HF datasets compatibility.

easydel.trainers.prompt_utils.maybe_apply_chat_template(example: dict[str, list[dict[str, str]]], tokenizer: Any, tools: list[Union[dict, Callable]] | None = None) → dict[str, str][source]#

Conditionally apply chat template to conversational examples.

Checks if the example is in conversational format and applies the chat template if needed, otherwise returns the example unchanged.

Parameters

example – Dictionary that may contain conversation data.
tokenizer – Tokenizer with chat template support.
tools – Optional list of tool/function schemas.

Returns

Example with chat template applied if conversational,: otherwise unchanged.

Return type

dict

Note

Useful for datasets that may contain mixed formats.

easydel.trainers.prompt_utils.maybe_convert_to_chatml(example: dict[str, list]) → dict[str, list][source]#

Convert a conversational dataset with fields from and value to ChatML format.

This function modifies conversational data to align with OpenAI’s ChatML format: - Replaces the key “from” with “role” in message dictionaries. - Replaces the key “value” with “content” in message dictionaries. - Renames “conversations” to “messages” for consistency with ChatML.

Parameters: example (dict[str, list]) – A single data entry containing a list of messages.
Returns: Example reformatted to ChatML style.
Return type: dict[str, list]

Example: ```python >>> example = { … “conversations”: [ … {“from”: “user”, “value”: “What color is the sky?”}, … {“from”: “assistant”, “value”: “It is blue.”}, … ] … } >>> maybe_convert_to_chatml(example) {‘messages’: [{‘role’: ‘user’, ‘content’: ‘What color is the sky?’},

{‘role’: ‘assistant’, ‘content’: ‘It is blue.’}]}

```

easydel.trainers.prompt_utils.maybe_extract_prompt(example: dict[str, list]) → dict[str, list][source]#

Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and rejected completions.

If the example already contains a “prompt” key, the function returns the example as is. Else, the function identifies the longest common sequence (prefix) of conversation turns between the “chosen” and “rejected” completions and extracts this as the prompt. It then removes this prompt from the respective “chosen” and “rejected” completions.

Parameters

example (dict[str, list]) – A dictionary representing a single data entry in the preference dataset. It must contain the keys “chosen” and “rejected”, where each value is either conversational or standard (str).

Returns

A dictionary containing:

”prompt”: The longest common prefix between the “chosen” and “rejected” completions.
”chosen”: The remainder of the “chosen” completion, with the prompt removed.
”rejected”: The remainder of the “rejected” completion, with the prompt removed.

Return type

dict[str, list]

Examples:

```python >>> example = { … “chosen”: [ … {“role”: “user”, “content”: “What color is the sky?”}, … {“role”: “assistant”, “content”: “It is blue.”}, … ], … “rejected”: [ … {“role”: “user”, “content”: “What color is the sky?”}, … {“role”: “assistant”, “content”: “It is green.”}, … ], … } >>> extract_prompt(example) {‘prompt’: [{‘role’: ‘user’, ‘content’: ‘What color is the sky?’}],

‘chosen’: [{‘role’: ‘assistant’, ‘content’: ‘It is blue.’}], ‘rejected’: [{‘role’: ‘assistant’, ‘content’: ‘It is green.’}]}

```

Or, with the map method of datasets.Dataset:

```python >>> from trl import extract_prompt >>> from datasets import Dataset

>>> dataset_dict = {
...     "chosen": [
...         [
...             {"role": "user", "content": "What color is the sky?"},
...             {"role": "assistant", "content": "It is blue."},
...         ],
...         [
...             {"role": "user", "content": "Where is the sun?"},
...             {"role": "assistant", "content": "In the sky."},
...         ],
...     ],
...     "rejected": [
...         [
...             {"role": "user", "content": "What color is the sky?"},
...             {"role": "assistant", "content": "It is green."},
...         ],
...         [
...             {"role": "user", "content": "Where is the sun?"},
...             {"role": "assistant", "content": "In the sea."},
...         ],
...     ],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = dataset.map(extract_prompt)
>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}
```

easydel.trainers.prompt_utils.maybe_unpair_preference_dataset(dataset: DatasetType, num_proc: int | None = None, desc: str | None = None) → DatasetType[source]#

Unpair a preference dataset if it is paired.

Parameters

dataset (Dataset or DatasetDict) – Preference dataset to unpair. The dataset must have columns “chosen”, “rejected” and optionally “prompt”.
num_proc (int, optional) – Number of processes to use for processing the dataset.
desc (str, optional) – Meaningful description to be displayed alongside with the progress bar while mapping examples.

Returns

The unpaired preference dataset if it was paired, otherwise the original dataset.

Return type

Dataset or DatasetDict

Example:

```python >>> from datasets import Dataset

>>> dataset_dict = {
...     "prompt": ["The sky is", "The sun is"],
...     "chosen": [" blue.", "in the sky."],
...     "rejected": [" green.", " in the sea."],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = unpair_preference_dataset(dataset)
>>> dataset
Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 4
})

>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}
```

easydel.trainers.prompt_utils.pack_dataset(dataset: DatasetType, seq_length: int, strategy: str = 'bfd', map_kwargs: dict[str, Any] | None = None) → DatasetType[source]#

Pack sequences in a dataset into chunks of size seq_length.

Parameters

dataset (Dataset or DatasetDict) – Dataset to pack
seq_length (int) – Target sequence length to pack to.
strategy (str, optional, defaults to “bfd”) –
Packing strategy to use. Can be either:
- ”bfd” (Best Fit Decreasing): Slower but preserves sequence boundaries. Sequences are never cut in the
  middle.
- ”wrapped”: Faster but more aggressive. Ignores sequence boundaries and will cut sequences in the middle
  to completely fill each packed sequence with data.
map_kwargs (dict, optional) – Additional keyword arguments to pass to the dataset’s map method when packing examples.

Returns

The dataset with packed sequences. The number of examples may decrease as sequences are combined.

Return type

Dataset or DatasetDict

Example: ```python >>> from datasets import Dataset >>> from trl import pack_dataset

>>> examples = {
...     "input_ids": [[1, 2, 3], [4, 5], [6, 7, 8], [9]],
...     "attention_mask": [[1, 1, 0], [1, 0], [1, 0, 0], [1]],
... }
>>> dataset = Dataset.from_dict(examples)
>>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd")
>>> packed_dataset[:]
{'input_ids': [[1, 2, 3, 9], [6, 7, 8], [4, 5]],
'attention_mask': [[1, 1, 0, 1], [1, 0, 0], [1, 0]],
'seq_lengths': [[3, 1], [3], [2]]}
```

easydel.trainers.prompt_utils.pad_and_truncate_dataset(dataset: datasets.arrow_dataset.Dataset | datasets.dataset_dict.DatasetDict, max_length: int, padding_token_id: int | None = None, padding_values: dict[str, Any] | None = None, truncate: bool = True, padding: bool = True, side: Literal['left', 'right'] = 'left', map_kwargs: dict[str, Any] | None = None, make_it_1d: bool = True) → datasets.arrow_dataset.Dataset | datasets.dataset_dict.DatasetDict[source]#

Pad and/or truncate sequences in a dataset to a specified max_length.

Preserves array backends:

If a column’s sequences are numpy arrays, outputs numpy arrays.
If a column’s sequences are JAX arrays, outputs JAX arrays.
If a column’s sequences are Python lists, outputs lists.

Special handling:

Columns ending with ‘_ids’ or named ‘labels’ use padding_token_id (required if padding such columns)
‘attention_mask’ columns use 0 for padding
‘position_ids’ columns are continued sequentially when padding
Custom padding values can be specified via padding_values, which overrides defaults.

Notes

If an entire batch column is None, backend cannot be inferred; it falls back to Python lists for that batch.
Hugging Face Datasets stores data in Arrow; on retrieval, types may depend on dataset.set_format(). This function preserves types within the map, but downstream representation may vary unless you set a format.

easydel.trainers.prompt_utils.remove_none_values(example: TListOrMapping) → TListOrMapping[source]#

Recursively removes entries with None values from a nested structure (list or dictionary).

Parameters: example (list or Mapping) – Input nested structure (list or dictionary) from which to remove None.

Example: `python >>> [ ... { ... "a": {"aa": None, "ab": 1}, ... "b": "my_string", ... } ... ] >>> remove_none_values(example) [{'a': {'ab': 1}, 'b': 'my_string'}] `

easydel.trainers.prompt_utils.reverse_openai_format(openai_messages: list[dict[str, str | list[dict[str, str]]]], content_key_name: str = 'content') → Optional[Union[dict[str, str], list[dict[str, str]]]][source]#

Converts a list of OpenAI Chat Completion messages back into simpler formats.

Input Format Example: [

{
“role”: “user”, “content”: [{“type”: “text”, “text”: “Hello AI.”}]

}, {

“role”: “assistant”, “content”: [{“type”: “text”, “text”: “Hello User!”}]

}

]

Output Format Examples: - If input has 1 message: {“role”: “user”, “content”: “Hello AI.”} - If input has >1 message: [

{“role”: “user”, “content”: “Hello AI.”}, {“role”: “assistant”, “content”: “Hello User!”}

]

If input is empty: []

Parameters

openai_messages – A list of messages in the OpenAI format.
content_key_name – The key name to use for the message text in the output dictionaries (e.g., “content”, “text”). Defaults to “content”.

Returns

A single dictionary if only one message was processed, a list of dictionaries if multiple messages were processed, an empty list if the input was empty, or None if the input list structure is invalid.

easydel.trainers.prompt_utils.truncate_dataset(dataset: DatasetType, max_length: int, map_kwargs: dict[str, Any] | None = None) → DatasetType[source]#

Truncate sequences in a dataset to a specified max_length.

Parameters

dataset (Dataset or DatasetDict) – Dataset to truncate.
max_length (int) – Maximum sequence length to truncate to.
map_kwargs (dict, optional) – Additional keyword arguments to pass to the dataset’s map method when truncating examples.

Returns

The dataset with truncated sequences.

Return type

Dataset or DatasetDict

Example: ```python >>> from datasets import Dataset

>>> examples = {
...     "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]],
...     "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]],
... }
>>> dataset = Dataset.from_dict(examples)
>>> truncated_dataset = truncate_dataset(dataset, max_length=2)
>>> truncated_dataset[:]
{'input_ids': [[1, 2], [4, 5], [8]],
 'attention_mask': [[0, 1], [0, 0], [1]]}
```

easydel.trainers.prompt_utils.unpair_preference_dataset(dataset: DatasetType, num_proc: int | None = None, desc: str | None = None) → DatasetType[source]#

Unpair a preference dataset.

Parameters

dataset (Dataset or DatasetDict) – Preference dataset to unpair. The dataset must have columns “chosen”, “rejected” and optionally “prompt”.
num_proc (int, optional) – Number of processes to use for processing the dataset. (Unused in the current implementation.)
desc (str, optional) – Meaningful description to be displayed alongside with the progress bar while mapping examples.

Returns

The unpaired preference dataset.

Return type

Dataset

Example:

```python >>> from datasets import Dataset

>>> dataset_dict = {
...     "prompt": ["The sky is", "The sun is"],
...     "chosen": [" blue.", "in the sky."],
...     "rejected": [" green.", " in the sea."],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = unpair_preference_dataset(dataset)
>>> dataset
Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 4
})

>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}
```

easydel.trainers.prompt_utils

Contents

easydel.trainers.prompt_utils#