� %�gQ���ddlZddlZddlmZddlmZmZddlZddlm Z ddl m Z m Z ddl mZmZddlmZdd lmZmZmZmZmZmZeje��Z dd ej jd ed eeejfdeeeee eej!ffdee"edeeee efee effdeeeejfde#fd�Z$ dd�Z%dd�Z& dd�Z'd�Z(d�Z)de jfd�Z*d�Z+dS)�N)�deepcopy)�Optional�Union)�is_4bit_bnb_available�is_8bit_bnb_available�)�dispatch_model�init_empty_weights�)�BnbQuantizationConfig)�find_tied_parameters�get_balanced_memory�infer_auto_device_map�load_checkpoint_in_model�offload_weight�set_module_tensor_to_deviceF�model�bnb_quantization_config�weights_location� device_map�no_split_module_classes� max_memory�offload_folder�offload_state_dictc �����|j}|j} | rt��std���|rt ��st d���g} t �t��rCt�� ����dkrd��� ��D��} |j �t|��|_ |r|j � | ��|j } |j�g|_|j} | � | ��||_| |_t#|��} | jdk�rct&�d��t+||| ���}|j}|���� ��D]�\�}t1�fd �| D����r�|�t4j��|jt4jkr\��d d ���d d ���t=|�d��}|�|�t4j����t5j|��r|�|����| jd krP|� t4j �!����t4j �"��n�t4j �#��r2|�t4j �!����n_t4j$�#��r2|�t4j$�!����ntKd���t&�&d| j�d���|S|�tKd|�d����tO��5t+||| ���}ddd��n #1swxYwYtQ||�||����|���d��)��vrd}t1�fd�dD����}tU||�|j|||j| o|���tW|�|���S)a} This function will quantize the input model with the associated config passed in `bnb_quantization_config`. If the model is in the meta device, we will load and dispatch the weights according to the `device_map` passed. If the model is already loaded, we will quantize the model and put the model on the GPU, Args: model (`torch.nn.Module`): Input model. The model can be already loaded or on the meta device bnb_quantization_config (`BnbQuantizationConfig`): The bitsandbytes quantization parameters weights_location (`str` or `os.PathLike`): The folder weights_location to load. It can be: - a path to a file containing a whole model state dict - a path to a `.json` file containing the index to a sharded checkpoint - a path to a folder containing a unique `.index.json` file and the shards of a checkpoint. - a path to a folder containing a unique pytorch_model.bin file. device_map (`Dict[str, Union[int, str, torch.device]]`, *optional*): A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. no_split_module_classes (`List[str]`, *optional*): A list of layer class names that should never be split across device (for instance any layer that has a residual connection). max_memory (`Dict`, *optional*): A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset. offload_folder (`str` or `os.PathLike`, *optional*): If the `device_map` contains any value `"disk"`, the folder where we will offload weights. offload_state_dict (`bool`, *optional*, defaults to `False`): If `True`, will temporarily offload the CPU state dict on the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard does not fit. Returns: `torch.nn.Module`: The quantized model z�You have a version of `bitsandbytes` that is not compatible with 8bit quantization, make sure you have the latest version of `bitsandbytes` installed.z�You have a version of `bitsandbytes` that is not compatible with 4bit quantization,make sure you have the latest version of `bitsandbytes` installed.r c� �g|] \}}|dv� |�� S))�disk�cpu�)�.0�key�values �d/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/accelerate/utils/bnb.py� <listcomp>z+load_and_quantize_model.<locals>.<listcomp>is&��_�_�_�*�#�u�e��F^�F^�#�F^�F^�F^�N�metaz�It is not recommended to quantize a loaded model. The model should be instantiated under the `init_empty_weights` context manager.)�modules_to_not_convertc3� �K�|]}|�vV�� dS�Nr)r �module_to_keep_in_fp32�names �r#� <genexpr>z*load_and_quantize_model.<locals>.<genexpr>�s*�����e�e�6L�)�T�1�e�e�e�e�e�er%�.weight��.bias�cuda�/No GPU found. A GPU is needed for quantization.zThe model device type is zC. However, gpu is needed for quantization.We move the model to gpu.za`weights_location` needs to be the folder path containing the weights of the model, but we found � )rrrTc3�^�K�|]'}|t������vV��(dSr))�list�values)r �xrs �r#r,z*load_and_quantize_model.<locals>.<genexpr>�s<�����N�N��a�4� � 1� 1� 3� 3�4�4�4�N�N�N�N�N�Nr%�rr)�dtyperr�keep_in_fp32_modules�offload_8bit_bnb)r� offload_dir),� load_in_4bit� load_in_8bitr� ImportErrorr� ValueError� isinstance�dict�len�keys�items� skip_modules�get_keys_to_not_convert�extendr9�is_loaded_in_4bit�is_loaded_in_8bit�get_parameter_device�type�logger�warning�replace_with_bnb_layers� torch_dtype� state_dict�any�to�torch�float32r8�replace�getattr�is_floating_pointr0�current_device� empty_cache� is_available�xpu� RuntimeError�infor �get_quantized_model_device_mapr5rr )rrrrrrrrr<r=�modules_on_cpur'r9� model_devicer8�param�offloadr+s ` @r#�load_and_quantize_modelrc,s����X+�7�L�*�7�L�� �1�3�3� �� R� � � �� �1�3�3� �� Q� � � � �N��*�d�#�#�`��J�O�O�,=�,=�(>�(>��(B�(B�_�_� �0@�0@�0B�0B�_�_�_���+�3�/F�u�/M�/M��,��D��,�3�3�N�C�C�C�4�A���3�;�79��4�2�G���!�!�"6�7�7�7�+�E��*�E��'��.�.�L���F�"�"���� _� � � �(��/F�_u�v�v�v��'�3�� �+�+�-�-�3�3�5�5� � �K�D�%��e�e�e�e�Pd�e�e�e�e�e� ������'�'�'��;�%�-�/�/��<�<� �2�6�6�>�>�w��K�K�D�#�E�4��6�6�E��(������/�/�/���(��/�/� ��������� � �� &� &� �J�J�u�z�0�0�2�2� 3� 3� 3� �J� "� "� $� $� $� $� �Z� $� $� &� &� R� �H�H�U�Z�.�.�0�0� 1� 1� 1� 1� �Y� #� #� %� %� R� �H�H�U�Y�-�-�/�/� 0� 0� 0� 0��P�Q�Q� Q�� � � (� �(9� (� (� (� � � �� � � !�� D�qA� D� D� D� � � �  � !� !� � �+��.�G]����E� � � � � � � � � � � ���� � � � �4� � #� �!�$;�  � � � � � %�*�*@�V�z�O`�O`�Ob�Ob�Eb�Eb�!%� ��N�N�N�N�o�N�N�N�N�N�� � � � �)�5�)�1�!8�!M�)�5�g� � � � ��e� ��W�W�W�Ws�O�O#�&O#c�$��� ����tj���r!dtj���i�nNtj���r!dtj���i�nt d���t �d��t�t��r��dvrtd���i}|� �fd�|� ��D����|� �fd�|� ��D����i}||d<||d <�j |d <�d krt|f�d k|d �|��}||d<t|fi|���t�t ��r|�j�jz� �� fd�����D��}dD]H}||���vr0�jrtd���t �d���I~�S)Nr.r1z\The device_map was not initialized.Setting device_map to `{'':torch.cuda.current_device()}`.)�auto�balanced�balanced_low_0� sequentialziIf passing a string for `device_map`, please choose 'auto', 'balanced', 'balanced_low_0' or 'sequential'.c�h���i|]-\�}t�fd��jD�����%��j��.S)c3� �K�|]}|�vV�� dSr)r�r �mr+s �r#r,z<get_quantized_model_device_map.<locals>.<dictcomp>.<genexpr>�s'�����O�O�Q�q�D�y�O�O�O�O�O�Or%)rQrErO�r �_r+rs @�r#� <dictcomp>z2get_quantized_model_device_map.<locals>.<dictcomp>�s[���� � � ��D�!��O�O�O�O�*A�*N�O�O�O�O�O� ��-�9� � � r%c�r���i|]2\�}t�fd��jD�����%�tj��3S)c3� �K�|]}|�vV�� dSr)rrks �r#r,z<get_quantized_model_device_map.<locals>.<dictcomp>.<genexpr>�s'�����W�W�Q�q�D�y�W�W�W�W�W�Wr%)rQr9rSrTrms @�r#roz2get_quantized_model_device_map.<locals>.<dictcomp>�sY���� � � ��D�!��W�W�W�W�*A�*V�W�W�W�W�W� ��e�m� � � r%�special_dtypesrr8rhrg)�low_zerorrc�*��i|]}|�v�|�|��Srr)r r!r�modules_not_to_converts ��r#roz2get_quantized_model_device_map.<locals>.<dictcomp>�s1���+ �+ �+ �%(��Lb�Ab�Ab�C��C��Ab�Ab�Abr%r7aH Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in `torch_dtype`, you need to pass a custom `device_map` to `load_and_quantize_model`. Check https://huggingface.co/docs/accelerate/main/en/usage_guides/quantization#offload-modules-to-cpu-and-disk for more details. ziSome modules are are offloaded to the CPU or the disk. Note that these modules will be converted to 8-bit)rSr0rZrXr[r\rLr]r@�strr?�update�named_parameters� target_dtyperrrArEr9rCr5r<) rrrrrrr�kwargs�device_map_without_some_modules�devicerus `` @r#r^r^�s�������� �:� "� "� $� $� R��e�j�7�7�9�9�:�J�J� �Y� #� #� %� %� R��e�i�6�6�8�8�9�J�J��P�Q�Q� Q�� � �r�s�s�s��*�c�"�"�&<� �Q� Q� Q�� ��� � ����� � � � �$�5�5�7�7� � � � � � � ��� � � � �$�5�5�7�7� � � � � � ���#1��� �,C��(�)�1�>��w�� �� %� %�,���$�(8�8�%���� ��J� *��|��*�5�;�;�F�;�;� ��*�d�#�#�,�!8�!E�H_�Ht�!t��+ �+ �+ �+ �+ �,6�O�O�,=�,=�+ �+ �+ �'�&� � �F��8�?�?�A�A�A�A�*�7��$�� � � ��K�K�D����� ,� �r%c�p�|�g}t||||��\}}|st�d��|S)a, A helper function to replace all `torch.nn.Linear` modules by `bnb.nn.Linear8bit` modules or by `bnb.nn.Linear4bit` modules from the `bitsandbytes`library. The function will be run recursively and replace `torch.nn.Linear` modules. Parameters: model (`torch.nn.Module`): Input model or `torch.nn.Module` as the function is run recursively. modules_to_not_convert (`List[str]`): Names of the modules to not quantize convert. In practice we keep the `lm_head` in full precision for numerical stability reasons. current_key_name (`List[str]`, *optional*): An array to track the current key of the recursion. This is used to check whether the current key (part of it) is not in the list of modules to not convert. NaYou are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.)�_replace_with_bnb_layersrLrM)rrr'�current_key_name�has_been_replaceds r#rNrNs]�� �%�!#��7� �&�(>�@P� � ��E� � � ���� � � � � �Lr%c ���ddl}d}|���D�]�\}}|�g}|�|��t|tj���r'||v�r"d�|��}d} |D]} | |vr| dz|vs| |krd} n�| r�|jr6|j�|j |j |j dud|j ���} nW|j rA|j�|j |j |j du|j|j|j���} nt%d���|jj| j_|j �|j j| j _| �d��t-||| ��d}t/t1|�������dkrt5||||��\} } || z}|�d �����||fS) z� Private method that wraps the recursion for module replacement. Returns the converted model and a boolean that indicates if the conversion has been successfull or not. rNF�.T)�has_fp16_weights� threshold)�compress_statistics� quant_typez1load_in_8bit and load_in_4bit can't be both False�����)� bitsandbytes�named_children�appendr@�nn�Linear�joinr=� Linear8bitLt� in_features� out_features�bias�llm_int8_thresholdr<� Linear4bit�bnb_4bit_compute_dtype�bnb_4bit_use_double_quant�bnb_4bit_quant_typer?�weight�data�requires_grad_�setattrrBr4�childrenr~�pop)rrr'r�bnbr�r+�module�current_key_name_str�proceedr!� bnb_modulern�_has_been_replaceds r#r~r~5sS���������,�,�.�.�/!�/!� ��f� � #�!� �����%�%�%� �f�b�i� (� (�$ )�T�9O�-O�-O�#&�8�8�,<�#=�#=� ��G�-� � ���0�0�0�s�S�y�DX�7X�7X��0�0�0�#�G��E�1�� )�*�7�Z�!$��!4�!4��*��+�� �4�/�).�"9�"L� "5�"�"�J�J�-�9� Z�!$��!2�!2��*��+�� �4�/�/�F�,C�,]�#:�#N� "3�"�"�J�J�%�%X�Y�Y�Y�)/��);� �!�&��;�*�+1�;�+;�J�O�(��)�)�%�0�0�0���t�Z�0�0�0�$(�!� �t�F�O�O�%�%�&�&� '� '�!� +� +�$<��/�1G�IY�%�%� !�A�!�!2�4F� F� ����R� � � � � �#� #�#r%c�r�t��5t|��}ddd��n #1swxYwYt|��}t|t��rRt t |�����g��t |�����z}nt |g��}t|��dk}d}t|d��rt||j �� }|s|rgSt |� ����}|ddg}t|��t|��z }t t|����t |��z} ddg} g} | D]6} | D]} | | vr| �| d��} �| �| ���7| S) a� An utility function to get the key of the module to keep in full precision if any For example for CausalLM modules we may want to keep the lm_head in full precision for numerical stability reasons. For other architectures, we want to keep the tied weights of the model. The function will return a list of the keys of the modules to not convert in int8. Parameters: model (`torch.nn.Module`): Input model NrF�base_model_prefixr�r-r/r.)r rr r@rA�sumr4r5rCrB�hasattrr�r��setrUr�)r� tied_model� tied_params� tied_keys�has_tied_params� is_base_model� list_modules�list_last_module� intersection�list_untouched�names_to_remove�filtered_module_namesr+�name_to_removes r#rFrFws �� � � �%�%��e�_�_� �%�%�%�%�%�%�%�%�%�%�%����%�%�%�%�'�z�2�2�K��+�t�$�$�)���[�/�/�1�1�2�2�B�7�7�$�{�?O�?O�?Q�?Q�:R�:R�R� � �� �R�(�(� ��)�n�n�q�(�O��M��u�)�*�*�D�#�E�5�+B�C�C�C� � ����� ���,�,�.�.�/�/�L�$�R�(��+�,���'�(�(�3�y�>�>�9�L��#�i�.�.�)�)�D��,>�,>�>�N�!�'�*�O����+�+��-� 8� 8�N���%�%��|�|�N�B�7�7����$�$�T�*�*�*�*� � s �+�/�/c�v�ddl}|���D]}t||jj��rdS� dS)zUCheck if we have `bnb.nn.Linear4bit` or `bnb.nn.Linear8bitLt` layers inside our modelrNTF)r��modulesr@r�r�)rr�rls r#�has_4bit_bnb_layersr��sO������ �]�]�_�_���� �a���*� +� +� ��4�4� � �5r%� parameterc�N�t|�����jSr))�next� parametersr|)r�s r#rJrJ�s�� � �$�$�&�&� '� '� .�.r%c ���|��t||d||���|}|}d|vrS|�d��} | dd�D]+} t|| ��} | �t|�d| �d����| }�,| d}d|j|_t |j||||���t|j|d��r7t |j|j|� d d��||���n:t ||||���t ||� d d��||���t||d |tj |� ������dS) Nr)r8r"r�r�z has no attribute F)�index�SCBr�r&) r�splitrVr?� _parameters� requires_gradrr�r�rUrS�empty�size) rra� param_name� new_dtyper� offload_index�fp16_statistics� tensor_namer��splitsr�� new_modules r#�quantize_and_offload_8bitr��s�����#�E�:�q� �QV�W�W�W�W� � ��� �+� � � �&�&�s�+�+�F������ $� $��$�V�U�3�3� ��%�$��%J�%J�%�%J�%J�%J�K�K�K�#��� ��*�K�8=���;�'�5��v�)�+�6� �N�Zg�h�h�h�h� �6�%�k�2�E� :� :� � ��"�;�/�3��"�"�8�U�3�3��#�  � � � �� �u�j�.� �N�N�N�N��� �(:�(:�8�U�(K�(K�^�cp�q�q�q�q���z�6��RW�R]�_d�_i�_i�_k�_k�Rl�m�m�m�m�m�mr%)NNNNNF)NNN)NN),�logging�os�copyr�typingrrrS�torch.nnr��accelerate.utils.importsrr� big_modelingr r � dataclassesr �modelingr rrrrr� getLogger�__name__rL�Modulerv�PathLikerA�intr|r4�boolrcr^rNr~rFr�rJr�rr%r#�<module>r�s��� ���� � � � �������"�"�"�"�"�"�"�"� � � � ��������������� >�=�=�=�=�=�=�=�.�.�.�.�.�.����������������� �� �8� $� $�� 15�EI�37�CG�8<�$�VX�VX� �8�?�VX�2�VX��C���,�-�VX���c�5��c�5�<�)?�#@�@�A�B� VX� &�d�3�i�0� VX� ��e�C��H�o�u�S�#�X��>�?�@� VX��U�3�� �#3�4�5�VX��VX�VX�VX�VX�t_c�M�M�M�M�`����F �� ?$�?$�?$�?$�D1!�1!�1!�h���/�B�I�/�/�/�/�n�n�n�n�nr%
Memory