easydel.infra.elarge_model.builders#
Builder functions for creating models and inference engines from ELM configurations.
This module provides high-level functions to build EasyDeL models and eSurge inference engines from ELM configuration dictionaries.
- easydel.infra.elarge_model.builders.build_dataset(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any])[source]#
Build a dataset from ELM configuration with data mixture.
Creates a unified dataset from the mixture configuration using the new DatasetMixture.build() method. Supports all modern features including token packing, block-deterministic mixing, and streaming.
- Parameters
cfg_like – ELM configuration dictionary or mapping
- Returns
The loaded and processed dataset
- Return type
Dataset or IterableDataset
Example
>>> cfg = { ... "mixture": { ... "informs": [ ... {"type": "json", "data_files": "data.json", "content_field": "text"} ... ], ... "block_mixture": True, ... "pack_tokens": True, ... "pack_seq_length": 2048 ... } ... } >>> dataset = build_dataset(cfg)
- easydel.infra.elarge_model.builders.build_esurge(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any], model: easydel.infra.base_module.EasyDeLBaseModule | None = None)[source]#
Build an eSurge inference engine from ELM configuration.
Creates an eSurge instance with the model, tokenizer, and inference configuration specified in the ELM config.
- Parameters
cfg_like – ELM configuration dictionary or mapping
- Returns
Configured eSurge inference engine
- Return type
- Raises
NotImplementedError – If the task type is not supported by eSurge
Example
>>> cfg = { ... "model": {"name_or_path": "meta-llama/Llama-2-7b"}, ... "esurge": {"max_model_len": 4096, "max_num_seqs": 32} ... } >>> engine = build_esurge(cfg) >>>
- easydel.infra.elarge_model.builders.build_model(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) EasyDeLBaseModule[source]#
Build an EasyDeL model from ELM configuration.
Automatically selects the appropriate model class based on the task type specified in the configuration.
- Parameters
cfg_like – ELM configuration dictionary or mapping
- Returns
The loaded model instance
- Return type
Example
>>> cfg = { ... "model": {"name_or_path": "meta-llama/Llama-2-7b", "task": "causal_lm"}, ... "loader": {"dtype": "bf16"} ... } >>> model = build_model(cfg) >>>
- easydel.infra.elarge_model.builders.build_sharded_source(cfg_like: ELMConfig | Mapping[str, Any]) ShardedDataSource | None[source]#
Build a ShardedDataSource from ELM configuration.
Uses the new ShardedDataSource architecture for efficient streaming and lazy transforms. Supports mixing, packing, and field transforms.
This function creates a unified ShardedDataSource from the mixture configuration, optionally applying: - Field renaming via transforms - Dataset mixing via MixedShardedSource - Sequence packing via PackedShardedSource
- Parameters
cfg_like – ELM configuration dictionary or mapping containing a ‘mixture’ section with dataset configurations.
- Returns
ShardedDataSource if mixture is configured, None otherwise.
Example
>>> cfg = { ... "mixture": { ... "informs": [ ... {"type": "json", "data_files": "data.json", "content_field": "text"} ... ], ... "use_sharded_source": True, ... "pack_tokens": True, ... "pack_seq_length": 2048 ... } ... } >>> source = build_sharded_source(cfg) >>> for batch in source.open_shard(source.shard_names[0]): ... process(batch)
- easydel.infra.elarge_model.builders.build_tokenized_dataset(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any], save: bool = True)[source]#
Build, tokenize, and optionally save a dataset from ELM configuration.
This is the main entry point for the tokenization pipeline. It: 1. Loads the dataset from the mixture configuration 2. Tokenizes using the specified tokenizer 3. Optionally saves to disk or HuggingFace Hub
- Parameters
cfg_like – ELM configuration dictionary or mapping
save – Whether to save the tokenized dataset (default: True)
- Returns
Tuple of (tokenized_dataset, save_path) if save=True, else tokenized_dataset
Example
>>> cfg = { ... "model": {"name_or_path": "meta-llama/Llama-2-7b"}, ... "mixture": { ... "informs": [ ... {"type": "json", "data_files": "data.json", "content_field": "text"} ... ], ... "streaming": False, # Must be False for saving ... "tokenization": { ... "max_length": 2048, ... "text_field": "text", ... "output_field": "tokens", ... "num_proc": 4 ... }, ... "save": { ... "output_path": "tokenized_data", ... "format": "parquet" ... } ... } ... } >>> dataset, path = build_tokenized_dataset(cfg)
- easydel.infra.elarge_model.builders.save_dataset(dataset, output_path: str, format: str = 'parquet', num_shards: int | None = None, compression: str | None = 'snappy', max_shard_size: str | int = '500MB', overwrite: bool = False, push_to_hub: bool = False, hub_repo_id: str | None = None, hub_private: bool = False, hub_token: str | None = None)[source]#
Save a dataset to disk or HuggingFace Hub.
- Parameters
dataset – HuggingFace Dataset to save
output_path – Path to save the dataset
format – Output format - “parquet”, “arrow”, “json”, “jsonl” (default: “parquet”)
num_shards – Number of shards (default: None, auto-detect)
compression – Compression algorithm (default: “snappy”)
max_shard_size – Maximum shard size (default: “500MB”)
overwrite – Whether to overwrite existing files (default: False)
push_to_hub – Push to HuggingFace Hub (default: False)
hub_repo_id – Hub repository ID (required if push_to_hub=True)
hub_private – Make Hub repo private (default: False)
hub_token – HuggingFace token (default: None)
- Returns
Path to saved dataset or Hub URL if pushed
Example
>>> save_dataset(tokenized_dataset, "output/tokenized", format="parquet") >>> # Or push to hub >>> save_dataset(tokenized_dataset, "output/tokenized", ... push_to_hub=True, hub_repo_id="username/my-dataset")
- easydel.infra.elarge_model.builders.to_data_mixture_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) dict[str, Any][source]#
Convert ELM configuration to kwargs for DatasetMixture creation.
Transforms the mixture configuration section into the format expected by the DatasetMixture and DataManager classes. Supports all modern features including token packing and block-deterministic mixing.
- Parameters
cfg_like – ELM configuration dictionary or mapping
- Returns
Dictionary of keyword arguments for DatasetMixture initialization
Example
>>> cfg = { ... "mixture": { ... "informs": [ ... {"type": "json", "data_files": "train.json", "content_field": "text"} ... ], ... "batch_size": 32, ... "block_mixture": True, ... "pack_tokens": True, ... "pack_seq_length": 2048 ... } ... } >>> kwargs = to_data_mixture_kwargs(cfg) >>> mixture = DatasetMixture(**kwargs)
- easydel.infra.elarge_model.builders.to_esurge_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) dict[str, Any][source]#
Convert ELM configuration to kwargs for eSurge initialization.
Extracts eSurge-specific configuration values and infers defaults from base configuration when needed.
- Parameters
cfg_like – ELM configuration dictionary or mapping
- Returns
Dictionary of keyword arguments for eSurge initialization
Example
>>> cfg = { ... "model": {"name_or_path": "meta-llama/Llama-2-7b"}, ... "esurge": {"max_model_len": 4096, "max_num_seqs": 32} ... } >>> kwargs = to_esurge_kwargs(cfg) >>> kwargs["max_model_len"] 4096
- easydel.infra.elarge_model.builders.to_from_pretrained_kwargs(cfg_like: easydel.infra.elarge_model.types.ELMConfig | collections.abc.Mapping[str, Any]) dict[str, Any][source]#
Convert ELM configuration to kwargs for model.from_pretrained() calls.
Extracts and transforms configuration values from various sections into the format expected by EasyDeL’s from_pretrained methods.
- Parameters
cfg_like – ELM configuration dictionary or mapping
- Returns
Dictionary of keyword arguments for from_pretrained() methods
Example
>>> cfg = { ... "model": {"name_or_path": "meta-llama/Llama-2-7b"}, ... "loader": {"dtype": "bf16"}, ... "sharding": {"axis_dims": (1, 1, 1, -1, 1)} ... } >>> kwargs = to_from_pretrained_kwargs(cfg) >>> model = AutoEasyDeLModelForCausalLM.from_pretrained(**kwargs)
- easydel.infra.elarge_model.builders.tokenize_dataset(dataset, tokenizer, text_field: str = 'text', output_field: str = 'tokens', max_length: int = 2048, truncation: bool = True, padding: bool | str = False, add_special_tokens: bool = True, return_attention_mask: bool = True, num_proc: int | None = None, batched: bool = True, batch_size: int = 1000, remove_columns: list[str] | None = None, keep_in_memory: bool = False)[source]#
Tokenize a dataset using the provided tokenizer.
- Parameters
dataset – HuggingFace Dataset or IterableDataset to tokenize
tokenizer – HuggingFace tokenizer instance
text_field – Field name containing text to tokenize (default: “text”)
output_field – Field name for tokenized output (default: “tokens”)
max_length – Maximum sequence length (default: 2048)
truncation – Whether to truncate sequences (default: True)
padding – Padding strategy (default: False)
add_special_tokens – Add special tokens like BOS/EOS (default: True)
return_attention_mask – Return attention masks (default: True)
num_proc – Number of processes for parallel tokenization (default: None)
batched – Process examples in batches (default: True)
batch_size – Batch size for batched processing (default: 1000)
remove_columns – Columns to remove after tokenization (default: None)
keep_in_memory – Keep processed dataset in memory (default: False)
- Returns
Tokenized dataset with token IDs in the output_field
Example
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b") >>> tokenized = tokenize_dataset(dataset, tokenizer, text_field="content")