easydel.trainers.prompt_utils#
Prompt formatting and chat template utilities.
This module provides utilities for converting between different conversation formats, applying chat templates, and handling various prompt structures. Originally from HuggingFace TRL, adapted for EasyDeL.
Key functionality: - Convert between OpenAI format and simpler dictionary formats - Apply chat templates to conversational datasets - Detect conversational vs instruction formats - Handle multi-turn conversations and function calling
- easydel.trainers.prompt_utils.apply_chat_template(example: dict[str, list[dict[str, str]]], tokenizer: Any, tools: list[Union[dict, Callable]] | None = None, **template_kwargs) dict[str, str][source]#
Apply chat template to conversational examples.
Formats conversation data using the tokenizer’s chat template, handling various input formats and optionally including tool schemas.
- Parameters
example – Dictionary containing conversation data. Supported keys: ‘prompt’, ‘chosen’, ‘rejected’, ‘completion’, ‘messages’, ‘label’.
tokenizer – Tokenizer with chat template support.
tools – Optional list of tool/function schemas for function calling.
- Returns
Formatted example with chat template applied to text fields.
- Return type
dict
- Raises
ValueError – If example format is not supported.
Note
Handles both single and multi-turn conversations. Preserves original structure while applying templates.
- easydel.trainers.prompt_utils.convert_to_openai_format(input_data: Union[list[list[dict[str, str]]], list[dict[str, str]], dict[str, str]]) list[dict[str, str | list[dict[str, str]]]][source]#
Converts various input formats (list[list[dict]], list[dict], dict) into the OpenAI Chat Completions message list format.
If the input_data already conforms to the target OpenAIMessageList format (specifically with content as list of parts), it is returned directly.
Target Format Example for one message: {
“role”: “user”, “content”: [{“type”: “text”, “text”: “message content here”}]
}
- Parameters
input_data – Data in one of the supported formats or already in the target OpenAIMessageList format. Keys like ‘role’, ‘content’, ‘text’, ‘message’ are searched case-insensitively within dictionaries during conversion.
- Returns
A list of messages in the target OpenAI format. Returns an empty list if the input is invalid, cannot be parsed, results in no valid messages, or is an unsupported type. Returns the input directly if it already matches the target format.
- easydel.trainers.prompt_utils.extract_prompt(example: dict[str, collections.abc.Sequence]) dict[str, collections.abc.Sequence][source]#
Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and rejected completions.
For more details, see [maybe_extract_prompt].
- easydel.trainers.prompt_utils.is_conversational(example: dict[str, Any]) bool[source]#
Check if an example is in conversational format.
Detects whether the example contains conversation-style data with role and content fields.
- Parameters
example – Dictionary to check. Looks for keys like ‘prompt’, ‘chosen’, ‘rejected’, ‘completion’, or ‘messages’.
- Returns
- True if example contains conversational data with role/content
structure, False otherwise.
- Return type
bool
Note
Used to determine whether to apply chat templates during processing.
- easydel.trainers.prompt_utils.is_conversational_from_value(example: dict[str, Any]) bool[source]#
Check if the example is in a conversational format (from/value). Note that this format isn’t recommended. Prefer the ChatML format (role/content)
- Parameters
example (dict[str, Any]) – A single data entry of a dataset. The example can have different keys depending on the dataset type.
- Returns
True if the data is in a conversational Chatformat, False otherwise.
- Return type
bool
Examples:
```python >>> example = {“conversations”: [{“from”: “user”, “value”: “What color is the sky?”}]} >>> is_conversational_from_value(example) True
>>> example = {"conversations": [{"role": "user", "content": "What color is the sky?"}]} >>> is_conversational_from_value(example) False
>>> example = {"conversations": "The sky is"} >>> is_conversational_from_value(example) False ```
- easydel.trainers.prompt_utils.keep_array_and_primitives(example: TListOrMapping) TListOrMapping[source]#
Recursively keeps only numpy/jax arrays, ints, floats, and bools from a nested structure.
- Parameters
example (list or Mapping) – Input nested structure (list or dictionary) to filter.
- Returns
Filtered structure containing only arrays and primitive types.
Example:
`python >>> import numpy as np >>> example = { ... "array": np.array([1, 2, 3]), ... "int_val": 42, ... "float_val": 3.14, ... "bool_val": True, ... "string": "remove_me", ... "nested": {"keep": 1, "remove": "text"} ... } >>> keep_array_and_primitives(example) {'array': array([1, 2, 3]), 'int_val': 42, 'float_val': 3.14, 'bool_val': True, 'nested': {'keep': 1}} `
- easydel.trainers.prompt_utils.keep_arrays_map(example: dict[str, Any], array_fields: list[str] | None = None, drop_fields: list[str] | None = None) dict[str, Any][source]#
Keep only array fields and convert them to numpy arrays for HF datasets compatibility.
- easydel.trainers.prompt_utils.maybe_apply_chat_template(example: dict[str, list[dict[str, str]]], tokenizer: Any, tools: list[Union[dict, Callable]] | None = None) dict[str, str][source]#
Conditionally apply chat template to conversational examples.
Checks if the example is in conversational format and applies the chat template if needed, otherwise returns the example unchanged.
- Parameters
example – Dictionary that may contain conversation data.
tokenizer – Tokenizer with chat template support.
tools – Optional list of tool/function schemas.
- Returns
- Example with chat template applied if conversational,
otherwise unchanged.
- Return type
dict
Note
Useful for datasets that may contain mixed formats.
- easydel.trainers.prompt_utils.maybe_convert_to_chatml(example: dict[str, list]) dict[str, list][source]#
Convert a conversational dataset with fields from and value to ChatML format.
This function modifies conversational data to align with OpenAI’s ChatML format: - Replaces the key “from” with “role” in message dictionaries. - Replaces the key “value” with “content” in message dictionaries. - Renames “conversations” to “messages” for consistency with ChatML.
- Parameters
example (dict[str, list]) – A single data entry containing a list of messages.
- Returns
Example reformatted to ChatML style.
- Return type
dict[str, list]
Example: ```python >>> example = { … “conversations”: [ … {“from”: “user”, “value”: “What color is the sky?”}, … {“from”: “assistant”, “value”: “It is blue.”}, … ] … } >>> maybe_convert_to_chatml(example) {‘messages’: [{‘role’: ‘user’, ‘content’: ‘What color is the sky?’},
{‘role’: ‘assistant’, ‘content’: ‘It is blue.’}]}
- easydel.trainers.prompt_utils.maybe_extract_prompt(example: dict[str, list]) dict[str, list][source]#
Extracts the shared prompt from a preference data example, where the prompt is implicit within both the chosen and rejected completions.
If the example already contains a “prompt” key, the function returns the example as is. Else, the function identifies the longest common sequence (prefix) of conversation turns between the “chosen” and “rejected” completions and extracts this as the prompt. It then removes this prompt from the respective “chosen” and “rejected” completions.
- Parameters
example (dict[str, list]) – A dictionary representing a single data entry in the preference dataset. It must contain the keys “chosen” and “rejected”, where each value is either conversational or standard (str).
- Returns
- A dictionary containing:
”prompt”: The longest common prefix between the “chosen” and “rejected” completions.
”chosen”: The remainder of the “chosen” completion, with the prompt removed.
”rejected”: The remainder of the “rejected” completion, with the prompt removed.
- Return type
dict[str, list]
Examples:
```python >>> example = { … “chosen”: [ … {“role”: “user”, “content”: “What color is the sky?”}, … {“role”: “assistant”, “content”: “It is blue.”}, … ], … “rejected”: [ … {“role”: “user”, “content”: “What color is the sky?”}, … {“role”: “assistant”, “content”: “It is green.”}, … ], … } >>> extract_prompt(example) {‘prompt’: [{‘role’: ‘user’, ‘content’: ‘What color is the sky?’}],
‘chosen’: [{‘role’: ‘assistant’, ‘content’: ‘It is blue.’}], ‘rejected’: [{‘role’: ‘assistant’, ‘content’: ‘It is green.’}]}
Or, with the map method of datasets.Dataset:
```python >>> from trl import extract_prompt >>> from datasets import Dataset
>>> dataset_dict = { ... "chosen": [ ... [ ... {"role": "user", "content": "What color is the sky?"}, ... {"role": "assistant", "content": "It is blue."}, ... ], ... [ ... {"role": "user", "content": "Where is the sun?"}, ... {"role": "assistant", "content": "In the sky."}, ... ], ... ], ... "rejected": [ ... [ ... {"role": "user", "content": "What color is the sky?"}, ... {"role": "assistant", "content": "It is green."}, ... ], ... [ ... {"role": "user", "content": "Where is the sun?"}, ... {"role": "assistant", "content": "In the sea."}, ... ], ... ], ... } >>> dataset = Dataset.from_dict(dataset_dict) >>> dataset = dataset.map(extract_prompt) >>> dataset[0] {'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}], 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]} ```
- easydel.trainers.prompt_utils.maybe_unpair_preference_dataset(dataset: DatasetType, num_proc: int | None = None, desc: str | None = None) DatasetType[source]#
Unpair a preference dataset if it is paired.
- Parameters
dataset (Dataset or DatasetDict) – Preference dataset to unpair. The dataset must have columns “chosen”, “rejected” and optionally “prompt”.
num_proc (int, optional) – Number of processes to use for processing the dataset.
desc (str, optional) – Meaningful description to be displayed alongside with the progress bar while mapping examples.
- Returns
The unpaired preference dataset if it was paired, otherwise the original dataset.
- Return type
Dataset or DatasetDict
Example:
```python >>> from datasets import Dataset
>>> dataset_dict = { ... "prompt": ["The sky is", "The sun is"], ... "chosen": [" blue.", "in the sky."], ... "rejected": [" green.", " in the sea."], ... } >>> dataset = Dataset.from_dict(dataset_dict) >>> dataset = unpair_preference_dataset(dataset) >>> dataset Dataset({ features: ['prompt', 'completion', 'label'], num_rows: 4 })
>>> dataset[0] {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} ```
- easydel.trainers.prompt_utils.pack_dataset(dataset: DatasetType, seq_length: int, strategy: str = 'bfd', map_kwargs: dict[str, Any] | None = None) DatasetType[source]#
Pack sequences in a dataset into chunks of size seq_length.
- Parameters
dataset (Dataset or DatasetDict) – Dataset to pack
seq_length (int) – Target sequence length to pack to.
strategy (str, optional, defaults to “bfd”) –
Packing strategy to use. Can be either:
- ”bfd” (Best Fit Decreasing): Slower but preserves sequence boundaries. Sequences are never cut in the
middle.
- ”wrapped”: Faster but more aggressive. Ignores sequence boundaries and will cut sequences in the middle
to completely fill each packed sequence with data.
map_kwargs (dict, optional) – Additional keyword arguments to pass to the dataset’s map method when packing examples.
- Returns
The dataset with packed sequences. The number of examples may decrease as sequences are combined.
- Return type
Dataset or DatasetDict
Example: ```python >>> from datasets import Dataset >>> from trl import pack_dataset
>>> examples = { ... "input_ids": [[1, 2, 3], [4, 5], [6, 7, 8], [9]], ... "attention_mask": [[1, 1, 0], [1, 0], [1, 0, 0], [1]], ... } >>> dataset = Dataset.from_dict(examples) >>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd") >>> packed_dataset[:] {'input_ids': [[1, 2, 3, 9], [6, 7, 8], [4, 5]], 'attention_mask': [[1, 1, 0, 1], [1, 0, 0], [1, 0]], 'seq_lengths': [[3, 1], [3], [2]]} ```
- easydel.trainers.prompt_utils.pad_and_truncate_dataset(dataset: datasets.arrow_dataset.Dataset | datasets.dataset_dict.DatasetDict, max_length: int, padding_token_id: int | None = None, padding_values: dict[str, Any] | None = None, truncate: bool = True, padding: bool = True, side: Literal['left', 'right'] = 'left', map_kwargs: dict[str, Any] | None = None, make_it_1d: bool = True) datasets.arrow_dataset.Dataset | datasets.dataset_dict.DatasetDict[source]#
Pad and/or truncate sequences in a dataset to a specified max_length.
- Preserves array backends:
If a column’s sequences are numpy arrays, outputs numpy arrays.
If a column’s sequences are JAX arrays, outputs JAX arrays.
If a column’s sequences are Python lists, outputs lists.
- Special handling:
Columns ending with ‘_ids’ or named ‘labels’ use padding_token_id (required if padding such columns)
‘attention_mask’ columns use 0 for padding
‘position_ids’ columns are continued sequentially when padding
Custom padding values can be specified via padding_values, which overrides defaults.
Notes
If an entire batch column is None, backend cannot be inferred; it falls back to Python lists for that batch.
Hugging Face Datasets stores data in Arrow; on retrieval, types may depend on dataset.set_format(). This function preserves types within the map, but downstream representation may vary unless you set a format.
- easydel.trainers.prompt_utils.remove_none_values(example: TListOrMapping) TListOrMapping[source]#
Recursively removes entries with None values from a nested structure (list or dictionary).
- Parameters
example (list or Mapping) – Input nested structure (list or dictionary) from which to remove None.
Example:
`python >>> [ ... { ... "a": {"aa": None, "ab": 1}, ... "b": "my_string", ... } ... ] >>> remove_none_values(example) [{'a': {'ab': 1}, 'b': 'my_string'}] `
- easydel.trainers.prompt_utils.reverse_openai_format(openai_messages: list[dict[str, str | list[dict[str, str]]]], content_key_name: str = 'content') Optional[Union[dict[str, str], list[dict[str, str]]]][source]#
Converts a list of OpenAI Chat Completion messages back into simpler formats.
Input Format Example: [
- {
“role”: “user”, “content”: [{“type”: “text”, “text”: “Hello AI.”}]
}, {
“role”: “assistant”, “content”: [{“type”: “text”, “text”: “Hello User!”}]
}
]
Output Format Examples: - If input has 1 message: {“role”: “user”, “content”: “Hello AI.”} - If input has >1 message: [
{“role”: “user”, “content”: “Hello AI.”}, {“role”: “assistant”, “content”: “Hello User!”}
]
If input is empty: []
- Parameters
openai_messages – A list of messages in the OpenAI format.
content_key_name – The key name to use for the message text in the output dictionaries (e.g., “content”, “text”). Defaults to “content”.
- Returns
A single dictionary if only one message was processed, a list of dictionaries if multiple messages were processed, an empty list if the input was empty, or None if the input list structure is invalid.
- easydel.trainers.prompt_utils.truncate_dataset(dataset: DatasetType, max_length: int, map_kwargs: dict[str, Any] | None = None) DatasetType[source]#
Truncate sequences in a dataset to a specified max_length.
- Parameters
dataset (Dataset or DatasetDict) – Dataset to truncate.
max_length (int) – Maximum sequence length to truncate to.
map_kwargs (dict, optional) – Additional keyword arguments to pass to the dataset’s map method when truncating examples.
- Returns
The dataset with truncated sequences.
- Return type
Dataset or DatasetDict
Example: ```python >>> from datasets import Dataset
>>> examples = { ... "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]], ... "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]], ... } >>> dataset = Dataset.from_dict(examples) >>> truncated_dataset = truncate_dataset(dataset, max_length=2) >>> truncated_dataset[:] {'input_ids': [[1, 2], [4, 5], [8]], 'attention_mask': [[0, 1], [0, 0], [1]]} ```
- easydel.trainers.prompt_utils.unpair_preference_dataset(dataset: DatasetType, num_proc: int | None = None, desc: str | None = None) DatasetType[source]#
Unpair a preference dataset.
- Parameters
dataset (Dataset or DatasetDict) – Preference dataset to unpair. The dataset must have columns “chosen”, “rejected” and optionally “prompt”.
num_proc (int, optional) – Number of processes to use for processing the dataset. (Unused in the current implementation.)
desc (str, optional) – Meaningful description to be displayed alongside with the progress bar while mapping examples.
- Returns
The unpaired preference dataset.
- Return type
Dataset
Example:
```python >>> from datasets import Dataset
>>> dataset_dict = { ... "prompt": ["The sky is", "The sun is"], ... "chosen": [" blue.", "in the sky."], ... "rejected": [" green.", " in the sea."], ... } >>> dataset = Dataset.from_dict(dataset_dict) >>> dataset = unpair_preference_dataset(dataset) >>> dataset Dataset({ features: ['prompt', 'completion', 'label'], num_rows: 4 })
>>> dataset[0] {'prompt': 'The sky is', 'completion': ' blue.', 'label': True} ```