�
%�gw� � �. � d dl mZ d dlZd dlZd dlZd dlZd dlZd dlZd dlZd dl Z d dl
Z
d dlmZ d dlm
Z
d dlmZ d dlmZ d dlmZmZmZ d dlZd dlmc mZ d dlmZ d d lmZ d
dlmZmZm Z m!Z! d
dl"m#Z#m$Z$m%Z% d
d
l&m'Z' d
dl(m)Z) d
dl*m+Z+ d
dl,m-Z-m.Z.m/Z/ d
dl0m1Z1m2Z2m3Z3 d
dlm4Z4m5Z5m6Z6m7Z7m8Z8m9Z9m:Z:m;Z;m<Z<m=Z=m>Z>m?Z?m@Z@mAZAmBZBmCZCmDZDmEZEmFZFmGZGmHZHmIZImJZJmKZKmLZLmMZMmNZNmOZOmPZPmQZQmRZRmSZSmTZTmUZUmVZVmWZWmXZXmYZYmZZZm[Z[m\Z\m]Z]m^Z^m_Z_m`Z`maZambZbmcZcmdZdmeZemfZfmgZgmhZhmiZimjZjmkZkmlZlmmZmmnZnmoZompZpmqZqmrZrmsZsmtZtmuZumvZvmwZwmxZxmyZymzZzm{Z{m|Z| d
dl}m~Z~mZm�Z�m�Z�m�Z� d
dl�m�Z� d
dl�m�Z� ef� � rd
dlm�Z�m�Z�m�Z�m�Z�m�Z�m�Z� ei� � rd
dlm�Z�m�Z�m�Z�m�Z�m�Z�m�Z�m�Z�m�Z� d dl�m�Z� eo� � rd dl�m�c m�Z� d dl�m�c m�Z� emd�� � rd dl�Z� d dl�m�Z� n# e�$ r d dl�m�Z� Y nw xY w e'e�� � Z� e�� � Z� e�� � Z� e�� � Z� e�� � Z� G d� d� � Z�dS )� )�annotationsN)�OrderedDict)�contextmanager)�partial)�
MethodType)�Any�Callable�Union)�"split_torch_state_dict_into_shards)�is_torchao_available� )�load_accelerator_state�load_custom_state�save_accelerator_state�save_custom_state)�DataLoaderDispatcher�prepare_data_loader�skip_first_batches)�
get_logger)�AcceleratedOptimizer)�AcceleratedScheduler)�AcceleratorState�
GradientState�PartialState)�LOGGER_TYPE_TO_CLASS�GeneralTracker�filter_trackers)I�
MODEL_NAME�SAFE_WEIGHTS_INDEX_NAME�SAFE_WEIGHTS_NAME�SAFE_WEIGHTS_PATTERN_NAME�WEIGHTS_INDEX_NAME�WEIGHTS_NAME�WEIGHTS_PATTERN_NAME�AORecipeKwargs�AutocastKwargs�DataLoaderConfiguration�DeepSpeedPlugin�DistributedDataParallelKwargs�DistributedType�
DynamoBackend�FP8RecipeKwargs�FullyShardedDataParallelPlugin�GradientAccumulationPlugin�GradScalerKwargs�InitProcessGroupKwargs�
KwargsHandler�
LoggerType�MegatronLMPlugin�MSAMPRecipeKwargs�
PrecisionType�
ProfileKwargs�ProjectConfiguration�RNGType�TERecipeKwargs�TorchDynamoPlugin�TorchTensorParallelPlugin�apply_fp8_autowrap�check_os_kernel� clean_state_dict_for_safetensors�compare_versions�
convert_model�convert_model_to_fp8_ao�convert_outputs_to_fp32�ensure_weights_retied�extract_model_from_parallel�fsdp2_prepare_model�!fsdp2_switch_optimizer_parameters�gather�
gather_object�get_fsdp2_grad_scaler�get_grad_scaler�#get_mixed_precision_context_manager�get_pretty_name�has_offloaded_params�is_bf16_available�'is_bitsandbytes_multi_backend_available�is_deepspeed_available�is_ipex_available�is_lomo_available�is_megatron_lm_available�is_mlu_available�is_msamp_available�is_musa_available�is_npu_available�is_torch_version�is_torch_xla_available�is_transformer_engine_available�is_xpu_available�load_fsdp_model�load_fsdp_optimizer�pad_across_processes�parse_choice_from_env�recursively_apply�reduce�release_memory�save�save_fsdp_model�save_fsdp_optimizer�wait_for_everyone)�!BETA_TP_AVAILABLE_PYTORCH_VERSION�&BETA_TP_AVAILABLE_TRANSFORMERS_VERSION�FSDP2_PYTORCH_VERSION�FSDP_PYTORCH_VERSION�PROFILE_PATTERN_NAME)�get_state_dict_offloaded_model)�is_compiled_module)�DeepSpeedEngineWrapper�DeepSpeedOptimizerWrapper�DeepSpeedSchedulerWrapper�
DummyOptim�DummyScheduler�map_pytorch_optim_to_deepspeed)�MegatronEngine�MegatronLMDummyDataLoader�MegatronLMDummyScheduler�MegatronLMOptimizerWrapper�MegatronLMSchedulerWrapper�megatron_lm_initialize�megatron_lm_prepare_data_loader�-megatron_lm_prepare_model_optimizer_scheduler)�JoinF)�check_device)�LRScheduler)�_LRSchedulerc �z � e Zd ZdZdeddddddddddddddddddfd�d*�Zed+� � � Zed,� � � Zed-� � � Z ed.� � � Z
ed/� � � Zed0� � � Zed1� � � Z
ed2� � � Zed3� � � Zed4� � � Zej d�d6�� � Zed7� � � Zed8� � � Zed9� � � Zed:� � � Zed;� � � Zed<� � � Zed=� � � Zed>� � � Zed?� � � Zed@� � � ZedA� � � Zed�d�dE�� � Zd�d�dH�Zd�d�dI�Z d�dJ�Z!d�d�dL�Z"d�d�dN�Z#edO� � � Z$edP� � � Z%edQ� � � Z&e'edR� � � � � Z(dS� Z)edT� � � Z*e*j dU� � � Z*edV� � � Z+e+j dW� � � Z+edX� � � Z,ed�dY�� � Z-dZ� Z.d�d[�Z/dd\�d]�Z0d�d�da�Z1db� Z2dc� Z3dd� Z4de� Z5df� Z6dg� Z7dh� Z8 d�d�dk�Z9d�d�dn�Z:d�dq�Z;dr� Z<ds� Z=dt� Z>d�du�Z?d�dw�Z@dx� ZAdy� ZBd�dz�ZCd�d}�ZDd�d�ZEd�d�d��ZFd�� ZGedi fd�d��� � ZHd�d�d��ZIedi fd�d��� � ZJd�� ZKd�d��ZL d�d�d��ZMd�d��ZNd�d�d��ZOd�d��ZPd�d�d��ZQd�� ZRd�� ZSdd��d��ZTd�� ZUd�d��ZVd�� ZWed�d�d��� � ZXed�d�d��� � ZYed�� � � ZZd�d�d��Z[d�� Z\d�d��Z]d�d��Z^ed�� � � Z_dS )��Acceleratora�
Creates an instance of an accelerator for distributed training or mixed precision training.
Args:
device_placement (`bool`, *optional*, defaults to `True`):
Whether or not the accelerator should put objects on device (tensors yielded by the dataloader, model,
etc...).
mixed_precision (`str`, *optional*):
Whether or not to use mixed precision training. Choose from 'no','fp16','bf16' or 'fp8'. Will default to
the value in the environment variable `ACCELERATE_MIXED_PRECISION`, which will use the default value in the
accelerate config of the current system or the flag passed with the `accelerate.launch` command. 'fp8'
requires the installation of transformers-engine.
gradient_accumulation_steps (`int`, *optional*, default to 1):
The number of steps that should pass before gradients are accumulated. A number > 1 should be combined with
`Accelerator.accumulate`. If not passed, will default to the value in the environment variable
`ACCELERATE_GRADIENT_ACCUMULATION_STEPS`. Can also be configured through a `GradientAccumulationPlugin`.
cpu (`bool`, *optional*):
Whether or not to force the script to execute on CPU. Will ignore GPU available if set to `True` and force
the execution on one process only.
dataloader_config (`DataLoaderConfiguration`, *optional*):
A configuration for how the dataloaders should be handled in distributed scenarios.
deepspeed_plugin ([`~utils.DeepSpeedPlugin`] or dict of `str`: [`~utils.DeepSpeedPlugin`], *optional*):
Tweak your DeepSpeed related args using this argument. This argument is optional and can be configured
directly using *accelerate config*. If using multiple plugins, use the configured `key` property of each
plugin to access them from `accelerator.state.get_deepspeed_plugin(key)`. Alias for `deepspeed_plugins`.
fsdp_plugin ([`~utils.FullyShardedDataParallelPlugin`], *optional*):
Tweak your FSDP related args using this argument. This argument is optional and can be configured directly
using *accelerate config*
torch_tp_plugin ([`~utils.TorchTensorParallelPlugin`], *optional*):
Tweak your torch tensor parallel. This argument is optional and can be configured directly using
*accelerate config*
megatron_lm_plugin ([`~utils.MegatronLMPlugin`], *optional*):
Tweak your MegatronLM related args using this argument. This argument is optional and can be configured
directly using *accelerate config*
rng_types (list of `str` or [`~utils.RNGType`]):
The list of random number generators to synchronize at the beginning of each iteration in your prepared
dataloaders. Should be one or several of:
- `"torch"`: the base torch random number generator
- `"cuda"`: the CUDA random number generator (GPU only)
- `"xla"`: the XLA random number generator (TPU only)
- `"generator"`: the `torch.Generator` of the sampler (or batch sampler if there is no sampler in your
dataloader) or of the iterable dataset (if it exists) if the underlying dataset is of that type.
Will default to `["torch"]` for PyTorch versions <=1.5.1 and `["generator"]` for PyTorch versions >= 1.6.
log_with (list of `str`, [`~utils.LoggerType`] or [`~tracking.GeneralTracker`], *optional*):
A list of loggers to be setup for experiment tracking. Should be one or several of:
- `"all"`
- `"tensorboard"`
- `"wandb"`
- `"comet_ml"`
If `"all"` is selected, will pick up all available trackers in the environment and initialize them. Can
also accept implementations of `GeneralTracker` for custom trackers, and can be combined with `"all"`.
project_config ([`~utils.ProjectConfiguration`], *optional*):
A configuration for how saving the state can be handled.
project_dir (`str`, `os.PathLike`, *optional*):
A path to a directory for storing data such as logs of locally-compatible loggers and potentially saved
checkpoints.
step_scheduler_with_optimizer (`bool`, *optional*, defaults to `True`):
Set `True` if the learning rate scheduler is stepped at the same time as the optimizer, `False` if only
done under certain circumstances (at the end of each epoch, for instance).
kwargs_handlers (list of [`~utils.KwargsHandler`], *optional*)
A list of [`~utils.KwargsHandler`] to customize how the objects related to distributed training, profiling
or mixed precision are created. See [kwargs](kwargs) for more information.
dynamo_backend (`str` or [`~utils.DynamoBackend`], *optional*, defaults to `"no"`):
Set to one of the possible dynamo backends to optimize your training with torch dynamo.
dynamo_plugin ([`~utils.TorchDynamoPlugin`], *optional*):
A configuration for how torch dynamo should be handled, if more tweaking than just the `backend` or `mode`
is needed.
gradient_accumulation_plugin ([`~utils.GradientAccumulationPlugin`], *optional*):
A configuration for how gradient accumulation should be handled, if more tweaking than just the
`gradient_accumulation_steps` is needed.
**Available attributes:**
- **device** (`torch.device`) -- The device to use.
- **distributed_type** ([`~utils.DistributedType`]) -- The distributed training configuration.
- **local_process_index** (`int`) -- The process index on the current machine.
- **mixed_precision** (`str`) -- The configured mixed precision mode.
- **num_processes** (`int`) -- The total number of processes used for training.
- **optimizer_step_was_skipped** (`bool`) -- Whether or not the optimizer update was skipped (because of
gradient overflow in mixed precision), in which
case the learning rate should not be changed.
- **process_index** (`int`) -- The overall index of the current process among all processes.
- **state** ([`~state.AcceleratorState`]) -- The distributed setup state.
- **sync_gradients** (`bool`) -- Whether the gradients are currently being synced across all processes.
- **use_distributed** (`bool`) -- Whether the current configuration is for distributed training.
TNr
F�device_placement�bool�
split_batches�mixed_precision�PrecisionType | str | None�gradient_accumulation_steps�int�cpu�dataloader_config�DataLoaderConfiguration | None�deepspeed_plugin�3DeepSpeedPlugin | dict[str, DeepSpeedPlugin] | None�fsdp_plugin�%FullyShardedDataParallelPlugin | None�torch_tp_plugin� TorchTensorParallelPlugin | None�megatron_lm_plugin�MegatronLMPlugin | None� rng_types�list[str | RNGType] | None�log_with�Rstr | LoggerType | GeneralTracker | list[str | LoggerType | GeneralTracker] | None�project_dir�str | os.PathLike | None�project_config�ProjectConfiguration | None�gradient_accumulation_plugin�!GradientAccumulationPlugin | None�step_scheduler_with_optimizer�kwargs_handlers�list[KwargsHandler] | None�dynamo_backend�DynamoBackend | str | None�
dynamo_plugin�TorchDynamoPlugin | None�deepspeed_pluginsc �R � g | _ |�|| _ nt |
�� � | _ |
�!| j �| j � |
� � |�>t |� � }|t vr&t d|� dt
j � � � �� � �|�|�t d� � �|�t |�� � }n|�t � � }|�|�t d� � �|�|}|�zt j i k r5t � � j t j k rt � � j }n�t"