easydel.infra.elarge_model.builders#

Builder functions for creating models and inference engines from ELM configurations.

This module provides high-level functions to build EasyDeL models and eSurge inference engines from ELM configuration dictionaries.

easydel.infra.elarge_model.builders.build_dataset(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any])[source]#

Build a dataset from ELM configuration with data mixture.

Creates a unified dataset from the mixture configuration using the new DatasetMixture.build() method. Supports all modern features including token packing, block-deterministic mixing, and streaming.

Parameters

cfg_like – ELM configuration dictionary or mapping

Returns

The loaded and processed dataset

Return type

Dataset or IterableDataset

Example

>>> cfg = {
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "data.json", "content_field": "text"}
...         ],
...         "block_mixture": True,
...         "pack_tokens": True,
...         "pack_seq_length": 2048
...     }
... }
>>> dataset = build_dataset(cfg)
easydel.infra.elarge_model.builders.build_esurge(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any], model: easydel.infra.base_module.EasyDeLBaseModule | None = None)[source]#

Build an eSurge inference engine from ELM configuration.

Creates an eSurge instance with the model, tokenizer, and inference configuration specified in the ELM config.

Parameters

cfg_like – ELM configuration dictionary or mapping

Returns

Configured eSurge inference engine

Return type

eSurge

Raises

NotImplementedError – If the task type is not supported by eSurge

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "esurge": {"max_model_len": 4096, "max_num_seqs": 32}
... }
>>> engine = build_esurge(cfg)
>>>
easydel.infra.elarge_model.builders.build_model(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) EasyDeLBaseModule[source]#

Build an EasyDeL model from ELM configuration.

Automatically selects the appropriate model class based on the task type specified in the configuration.

Parameters

cfg_like – ELM configuration dictionary or mapping

Returns

The loaded model instance

Return type

EasyDeLBaseModule

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b", "task": "causal_lm"},
...     "loader": {"dtype": "bf16"}
... }
>>> model = build_model(cfg)
>>>
easydel.infra.elarge_model.builders.build_sharded_source(cfg_like: ELMConfig | Mapping[str, Any]) ShardedDataSource | None[source]#

Build a ShardedDataSource from ELM configuration.

Uses the new ShardedDataSource architecture for efficient streaming and lazy transforms. Supports mixing, packing, and field transforms.

This function creates a unified ShardedDataSource from the mixture configuration, optionally applying: - Field renaming via transforms - Dataset mixing via MixedShardedSource - Sequence packing via PackedShardedSource

Parameters

cfg_like – ELM configuration dictionary or mapping containing a ‘mixture’ section with dataset configurations.

Returns

ShardedDataSource if mixture is configured, None otherwise.

Example

>>> cfg = {
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "data.json", "content_field": "text"}
...         ],
...         "use_sharded_source": True,
...         "pack_tokens": True,
...         "pack_seq_length": 2048
...     }
... }
>>> source = build_sharded_source(cfg)
>>> for batch in source.open_shard(source.shard_names[0]):
...     process(batch)
easydel.infra.elarge_model.builders.build_tokenized_dataset(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any], save: bool = True)[source]#

Build, tokenize, and optionally save a dataset from ELM configuration.

This is the main entry point for the tokenization pipeline. It: 1. Loads the dataset from the mixture configuration 2. Tokenizes using the specified tokenizer 3. Optionally saves to disk or HuggingFace Hub

Parameters
  • cfg_like – ELM configuration dictionary or mapping

  • save – Whether to save the tokenized dataset (default: True)

Returns

Tuple of (tokenized_dataset, save_path) if save=True, else tokenized_dataset

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "data.json", "content_field": "text"}
...         ],
...         "streaming": False,  # Must be False for saving
...         "tokenization": {
...             "max_length": 2048,
...             "text_field": "text",
...             "output_field": "tokens",
...             "num_proc": 4
...         },
...         "save": {
...             "output_path": "tokenized_data",
...             "format": "parquet"
...         }
...     }
... }
>>> dataset, path = build_tokenized_dataset(cfg)
easydel.infra.elarge_model.builders.save_dataset(dataset, output_path: str, format: str = 'parquet', num_shards: int | None = None, compression: str | None = 'snappy', max_shard_size: str | int = '500MB', overwrite: bool = False, push_to_hub: bool = False, hub_repo_id: str | None = None, hub_private: bool = False, hub_token: str | None = None)[source]#

Save a dataset to disk or HuggingFace Hub.

Parameters
  • dataset – HuggingFace Dataset to save

  • output_path – Path to save the dataset

  • format – Output format - “parquet”, “arrow”, “json”, “jsonl” (default: “parquet”)

  • num_shards – Number of shards (default: None, auto-detect)

  • compression – Compression algorithm (default: “snappy”)

  • max_shard_size – Maximum shard size (default: “500MB”)

  • overwrite – Whether to overwrite existing files (default: False)

  • push_to_hub – Push to HuggingFace Hub (default: False)

  • hub_repo_id – Hub repository ID (required if push_to_hub=True)

  • hub_private – Make Hub repo private (default: False)

  • hub_token – HuggingFace token (default: None)

Returns

Path to saved dataset or Hub URL if pushed

Example

>>> save_dataset(tokenized_dataset, "output/tokenized", format="parquet")
>>> # Or push to hub
>>> save_dataset(tokenized_dataset, "output/tokenized",
...              push_to_hub=True, hub_repo_id="username/my-dataset")
easydel.infra.elarge_model.builders.to_data_mixture_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) dict[str, Any][source]#

Convert ELM configuration to kwargs for DatasetMixture creation.

Transforms the mixture configuration section into the format expected by the DatasetMixture and DataManager classes. Supports all modern features including token packing and block-deterministic mixing.

Parameters

cfg_like – ELM configuration dictionary or mapping

Returns

Dictionary of keyword arguments for DatasetMixture initialization

Example

>>> cfg = {
...     "mixture": {
...         "informs": [
...             {"type": "json", "data_files": "train.json", "content_field": "text"}
...         ],
...         "batch_size": 32,
...         "block_mixture": True,
...         "pack_tokens": True,
...         "pack_seq_length": 2048
...     }
... }
>>> kwargs = to_data_mixture_kwargs(cfg)
>>> mixture = DatasetMixture(**kwargs)
easydel.infra.elarge_model.builders.to_esurge_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) dict[str, Any][source]#

Convert ELM configuration to kwargs for eSurge initialization.

Extracts eSurge-specific configuration values and infers defaults from base configuration when needed.

Parameters

cfg_like – ELM configuration dictionary or mapping

Returns

Dictionary of keyword arguments for eSurge initialization

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "esurge": {"max_model_len": 4096, "max_num_seqs": 32}
... }
>>> kwargs = to_esurge_kwargs(cfg)
>>> kwargs["max_model_len"]
4096
easydel.infra.elarge_model.builders.to_from_pretrained_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) dict[str, Any][source]#

Convert ELM configuration to kwargs for model.from_pretrained() calls.

Extracts and transforms configuration values from various sections into the format expected by EasyDeL’s from_pretrained methods.

Parameters

cfg_like – ELM configuration dictionary or mapping

Returns

Dictionary of keyword arguments for from_pretrained() methods

Example

>>> cfg = {
...     "model": {"name_or_path": "meta-llama/Llama-2-7b"},
...     "loader": {"dtype": "bf16"},
...     "sharding": {"axis_dims": (1, 1, 1, -1, 1)}
... }
>>> kwargs = to_from_pretrained_kwargs(cfg)
>>> model = AutoEasyDeLModelForCausalLM.from_pretrained(**kwargs)
easydel.infra.elarge_model.builders.tokenize_dataset(dataset, tokenizer, text_field: str = 'text', output_field: str = 'tokens', max_length: int = 2048, truncation: bool = True, padding: bool | str = False, add_special_tokens: bool = True, return_attention_mask: bool = True, num_proc: int | None = None, batched: bool = True, batch_size: int = 1000, remove_columns: list[str] | None = None, keep_in_memory: bool = False)[source]#

Tokenize a dataset using the provided tokenizer.

Parameters
  • dataset – HuggingFace Dataset or IterableDataset to tokenize

  • tokenizer – HuggingFace tokenizer instance

  • text_field – Field name containing text to tokenize (default: “text”)

  • output_field – Field name for tokenized output (default: “tokens”)

  • max_length – Maximum sequence length (default: 2048)

  • truncation – Whether to truncate sequences (default: True)

  • padding – Padding strategy (default: False)

  • add_special_tokens – Add special tokens like BOS/EOS (default: True)

  • return_attention_mask – Return attention masks (default: True)

  • num_proc – Number of processes for parallel tokenization (default: None)

  • batched – Process examples in batches (default: True)

  • batch_size – Batch size for batched processing (default: 1000)

  • remove_columns – Columns to remove after tokenization (default: None)

  • keep_in_memory – Keep processed dataset in memory (default: False)

Returns

Tokenized dataset with token IDs in the output_field

Example

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
>>> tokenized = tokenize_dataset(dataset, tokenizer, text_field="content")