easydel.infra.elarge_model.builders

easydel.infra.elarge_model.builders#

Builder functions for creating models and inference engines from ELM configurations.

This module provides high-level functions to build EasyDeL models and eSurge inference engines from ELM configuration dictionaries.

easydel.infra.elarge_model.builders.build_dataset(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any])[source]#

Build a dataset from ELM configuration with data mixture.

Creates a unified dataset from the mixture configuration using the new DatasetMixture.build() method. Supports all modern features including token packing, block-deterministic mixing, and streaming.

Parameters: cfg_like – ELM configuration dictionary or mapping
Returns: The loaded and processed dataset
Return type: Dataset or IterableDataset

Example

>>> cfg = {
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "data.json", "content_field": "text"}
...         ],
...         "block_mixture": True,
...         "pack_tokens": True,
...         "pack_seq_length": 2048
...     }
... }
>>> dataset = build_dataset(cfg)

easydel.infra.elarge_model.builders.build_esurge(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any], model: easydel.infra.base_module.EasyDeLBaseModule | None = None)[source]#

Build an eSurge inference engine from ELM configuration.

Creates an eSurge instance with the model, tokenizer, and inference configuration specified in the ELM config.

Parameters: cfg_like – ELM configuration dictionary or mapping
Returns: Configured eSurge inference engine
Return type: eSurge
Raises: NotImplementedError – If the task type is not supported by eSurge

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "esurge": {"max_model_len": 4096, "max_num_seqs": 32}
... }
>>> engine = build_esurge(cfg)
>>>

easydel.infra.elarge_model.builders.build_model(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) → EasyDeLBaseModule[source]#

Build an EasyDeL model from ELM configuration.

Automatically selects the appropriate model class based on the task type specified in the configuration.

Parameters: cfg_like – ELM configuration dictionary or mapping
Returns: The loaded model instance
Return type: EasyDeLBaseModule

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b", "task": "causal_lm"},
...     "loader": {"dtype": "bf16"}
... }
>>> model = build_model(cfg)
>>>

easydel.infra.elarge_model.builders.build_sharded_source(cfg_like: ELMConfig | Mapping[str, Any]) → ShardedDataSource | None[source]#

Build a ShardedDataSource from ELM configuration.

Uses the new ShardedDataSource architecture for efficient streaming and lazy transforms. Supports mixing, packing, and field transforms.

This function creates a unified ShardedDataSource from the mixture configuration, optionally applying: - Field renaming via transforms - Dataset mixing via MixedShardedSource - Sequence packing via PackedShardedSource

Parameters: cfg_like – ELM configuration dictionary or mapping containing a ‘mixture’ section with dataset configurations.
Returns: ShardedDataSource if mixture is configured, None otherwise.

Example

>>> cfg = {
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "data.json", "content_field": "text"}
...         ],
...         "use_sharded_source": True,
...         "pack_tokens": True,
...         "pack_seq_length": 2048
...     }
... }
>>> source = build_sharded_source(cfg)
>>> for batch in source.open_shard(source.shard_names[0]):
...     process(batch)

easydel.infra.elarge_model.builders.build_tokenized_dataset(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any], save: bool = True)[source]#

Build, tokenize, and optionally save a dataset from ELM configuration.

This is the main entry point for the tokenization pipeline. It: 1. Loads the dataset from the mixture configuration 2. Tokenizes using the specified tokenizer 3. Optionally saves to disk or HuggingFace Hub

Parameters

cfg_like – ELM configuration dictionary or mapping
save – Whether to save the tokenized dataset (default: True)

Returns

Tuple of (tokenized_dataset, save_path) if save=True, else tokenized_dataset

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "data.json", "content_field": "text"}
...         ],
...         "streaming": False,  # Must be False for saving
...         "tokenization": {
...             "max_length": 2048,
...             "text_field": "text",
...             "output_field": "tokens",
...             "num_proc": 4
...         },
...         "save": {
...             "output_path": "tokenized_data",
...             "format": "parquet"
...         }
...     }
... }
>>> dataset, path = build_tokenized_dataset(cfg)

easydel.infra.elarge_model.builders.save_dataset(dataset, output_path: str, format: str = 'parquet', num_shards: int | None = None, compression: str | None = 'snappy', max_shard_size: str | int = '500MB', overwrite: bool = False, push_to_hub: bool = False, hub_repo_id: str | None = None, hub_private: bool = False, hub_token: str | None = None)[source]#

Save a dataset to disk or HuggingFace Hub.

Parameters

dataset – HuggingFace Dataset to save
output_path – Path to save the dataset
format – Output format - “parquet”, “arrow”, “json”, “jsonl” (default: “parquet”)
num_shards – Number of shards (default: None, auto-detect)
compression – Compression algorithm (default: “snappy”)
max_shard_size – Maximum shard size (default: “500MB”)
overwrite – Whether to overwrite existing files (default: False)
push_to_hub – Push to HuggingFace Hub (default: False)
hub_repo_id – Hub repository ID (required if push_to_hub=True)
hub_private – Make Hub repo private (default: False)
hub_token – HuggingFace token (default: None)

Returns

Path to saved dataset or Hub URL if pushed

Example

>>> save_dataset(tokenized_dataset, "output/tokenized", format="parquet")
>>> # Or push to hub
>>> save_dataset(tokenized_dataset, "output/tokenized",
...              push_to_hub=True, hub_repo_id="username/my-dataset")

easydel.infra.elarge_model.builders.to_data_mixture_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) → dict[str, Any][source]#

Convert ELM configuration to kwargs for DatasetMixture creation.

Transforms the mixture configuration section into the format expected by the DatasetMixture and DataManager classes. Supports all modern features including token packing and block-deterministic mixing.

Parameters: cfg_like – ELM configuration dictionary or mapping
Returns: Dictionary of keyword arguments for DatasetMixture initialization

Example

>>> cfg = {
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "train.json", "content_field": "text"}
...         ],
...         "batch_size": 32,
...         "block_mixture": True,
...         "pack_tokens": True,
...         "pack_seq_length": 2048
...     }
... }
>>> kwargs = to_data_mixture_kwargs(cfg)
>>> mixture = DatasetMixture(**kwargs)

easydel.infra.elarge_model.builders.to_esurge_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) → dict[str, Any][source]#

Convert ELM configuration to kwargs for eSurge initialization.

Extracts eSurge-specific configuration values and infers defaults from base configuration when needed.

Parameters: cfg_like – ELM configuration dictionary or mapping
Returns: Dictionary of keyword arguments for eSurge initialization

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "esurge": {"max_model_len": 4096, "max_num_seqs": 32}
... }
>>> kwargs = to_esurge_kwargs(cfg)
>>> kwargs["max_model_len"]
4096

easydel.infra.elarge_model.builders.to_from_pretrained_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) → dict[str, Any][source]#

Convert ELM configuration to kwargs for model.from_pretrained() calls.

Extracts and transforms configuration values from various sections into the format expected by EasyDeL’s from_pretrained methods.

Parameters: cfg_like – ELM configuration dictionary or mapping
Returns: Dictionary of keyword arguments for from_pretrained() methods

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "loader": {"dtype": "bf16"},
...     "sharding": {"axis_dims": (1, 1, 1, -1, 1)}
... }
>>> kwargs = to_from_pretrained_kwargs(cfg)
>>> model = AutoEasyDeLModelForCausalLM.from_pretrained(**kwargs)

easydel.infra.elarge_model.builders.tokenize_dataset(dataset, tokenizer, text_field: str = 'text', output_field: str = 'tokens', max_length: int = 2048, truncation: bool = True, padding: bool | str = False, add_special_tokens: bool = True, return_attention_mask: bool = True, num_proc: int | None = None, batched: bool = True, batch_size: int = 1000, remove_columns: list[str] | None = None, keep_in_memory: bool = False)[source]#

Tokenize a dataset using the provided tokenizer.

Parameters

dataset – HuggingFace Dataset or IterableDataset to tokenize
tokenizer – HuggingFace tokenizer instance
text_field – Field name containing text to tokenize (default: “text”)
output_field – Field name for tokenized output (default: “tokens”)
max_length – Maximum sequence length (default: 2048)
truncation – Whether to truncate sequences (default: True)
padding – Padding strategy (default: False)
add_special_tokens – Add special tokens like BOS/EOS (default: True)
return_attention_mask – Return attention masks (default: True)
num_proc – Number of processes for parallel tokenization (default: None)
batched – Process examples in batches (default: True)
batch_size – Batch size for batched processing (default: 1000)
remove_columns – Columns to remove after tokenization (default: None)
keep_in_memory – Keep processed dataset in memory (default: False)

Returns

Tokenized dataset with token IDs in the output_field

Example

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
>>> tokenized = tokenize_dataset(dataset, tokenizer, text_field="content")

easydel.infra.elarge_model.builders

Contents

easydel.infra.elarge_model.builders#