� ��g�G����ddlZddlZddlZddlZddlmZddlmZmZm Z ddl m Z d�Z dd �Zdd �Zdd�ZGd�d��ZGd�d��Zd�Zdd�Zdd�Zd�Zed d���Zdd�ZdS)!�N)�contextmanager)�Any�Dict�List�)�languagec���d�|��}dddd|zdg}tj|��}|�tjj���d��}d�|D��}|S)N�,� nvidia-smi�-i�0z --query-gpu=z--format=csv,noheader,nounitsc�,�g|]}t|����S�)�int��.0�xs �^/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/triton/testing.py� <listcomp>znvsmi.<locals>.<listcomp>s�� � � �a�3�q�6�6� � � �)�join� subprocess� check_output�decode�sys�stdout�encoding�split)�attrs�cmd�out�rets r�nvsmir# ss�� �H�H�U�O�O�E� ��s�N�U�$:�<[� \�C� � !�#� &� &�C� �*�*�S�Z�(� )� )� /� /�� 4� 4�C� � �3� � � �C� �Jr��meanc��ddl}|dvsJ�|j���|j���krt d���|��|�5|D]2}|���|�d��d|_�3|j���}|j� |��5|��ddd��n #1swxYwY|j� ��|j� d���}|j� d���}|� ��|� ��|� ��|j� ��|�|��} tdt!|| z ����} |j���}|j� |��5t#| ��D]} |� |D] }d|_� |��� ddd��n #1swxYwY|j� ��g} d} t#| ��D]�} |j� d���}|j� d���}|� ��|� ��|� ��|j� ��| |�|��| z gz } ��|�| ��}t'||��|�����S) a+ Benchmark the runtime of the provided function. :param fn: Function to benchmark :type fn: Callable :param rep: Repetition time (in ms) :type rep: int :param grad_to_none: Reset the gradient of the provided tensor to None :type grad_to_none: torch.tensor, optional rN��min�maxr%�medianzQCannot capture graph in default stream. Please use side stream in benchmark code.T�� enable_timingr� )�torch�cuda�current_stream�default_stream� RuntimeError�detach_�requires_grad_�grad� CUDAGraph�graph� synchronize�Event�record�replay� elapsed_timer)r�range�tensor�getattr�item)�fn�rep� grad_to_none� return_moder.r�g� start_event� end_event� estimate_ms�n_repeat�ir"� n_retries�timess r�do_bench_cudagraphrMs����L�L�L� �:� :� :� :� :� �z� � �"�"�e�j�&?�&?�&A�&A�A�A��n�o�o�o��B�D�D�D���� � �A� �I�I�K�K�K� � � �T� "� "� "��A�F�F� � �����A� �� � �!� � � � � ����� � � � � � � � � � � ���� � � � � �J�������*�"�"��"�6�6�K�� � � �t� �4�4�I��������H�H�J�J�J� ������ �J�������*�*�9�5�5�K��1�c�#� �+�,�,�-�-�H� � �����A� �� � �!� � ����x��� � �A��'�%�"�"�A�!�A�F�F� �B�D�D�D�D�  �������������������  �J������ �C��I� �9� � �@�@���j�&�&�T�&�:�:� ��J�$�$�4�$�8�8� ������� ��� � � ������� � ��� � � � � �(�(��3�3�h�>�?�?��� �L�L�� � �E� &�7�5�+� &� &�u� -� -� 2� 2� 4� 4�4s$� C�C� C�0+H(�(H,�/H,��dTc�D��|dvsJ�ddl�|���j���|r+��t d���jd���}n*��t d���jd���}�j�d� ��}�j�d� ��} |���td ��D] } |� ��|���!| ����j���|� | ��d z } td t || z ����} td t || z ����} �fd �t| ��D��}�fd �t| ��D��} t| ��D] } |��� t| ��D]b}|� |D] }d|_ � |� ��||���|��| |����c�j����� d�t|| ��D���j���}|�_��|�� |�j��������}t%|��d kr|d}|St'�|��|�����S)a� Benchmark the runtime of the provided function. By default, return the median runtime of :code:`fn` along with the 20-th and 80-th performance percentile. :param fn: Function to benchmark :type fn: Callable :param warmup: Warmup time (in ms) :type warmup: int :param rep: Repetition time (in ms) :type rep: int :param grad_to_none: Reset the gradient of the provided tensor to None :type grad_to_none: torch.tensor, optional :param quantiles: Performance percentile to return in addition to the median. :type quantiles: list[float] :param fast_flush: Use faster kernel to flush L2 between measurements :type fast_flush: bool r'rNg���Ar/)�dtype�deviceg���ATr+�rc�F��g|]}�j�d�����S�Tr+�r/r9�rrJr.s �rrzdo_bench.<locals>.<listcomp>�s,���Q�Q�Q�A�5�:�#�#�$�#�7�7�Q�Q�Qrc�F��g|]}�j�d�����SrUrVrWs �rrzdo_bench.<locals>.<listcomp>�s,���O�O�O�!���!�!��!�5�5�O�O�Orc�>�g|]\}}|�|����Sr)r<)r�s�es rrzdo_bench.<locals>.<listcomp>�s(��T�T�T���1�!�.�.��+�+�T�T�Tr)rQ)r.r/r8�emptyr�int8r9r:r=�zero_r<r)r5r>�zip�float�quantile�tolist�lenr?r@)rA�warmuprBrC� quantiles� fast_flushrD�cacherFrG�_rH�n_warmuprIrJrrLr"r.s @r�do_benchrjRs���$ �:� :� :� :� :��L�L�L��B�D�D�D� �J������ �I�� � �C� �O�O�5�9�V� �L�L���� � �C��J�J�e�j�� �H�H���*�"�"��"�6�6�K�� � � �t� �4�4�I������� �1�X�X� � �� � � � � � � ������ ������ �J�������*�*�9�5�5��9�K��1�c�&�;�.�/�/�0�0�H��1�c�#� �+�,�,�-�-�H�Q�Q�Q�Q��x���Q�Q�Q�K�O�O�O�O�u�X���O�O�O�I� �8�_�_� � �� ������ �8�_�_� � �� � #�!� � ������ � � � � � ��A������� ������!� ������� �J������ �L�L�T�T��K��8S�8S�T�T�T�\a�\g�L� h� h�E����n�n�U�E�L�L��%�+�L�$N�$N�O�O�V�V�X�X�� �s�8�8�q�=�=��a�&�C�� � &�7�5�+� &� &�u� -� -� 2� 2� 4� 4�4r�c ���ddl}ddl}t||j��s|�|��}t||j��s|�|��}|�d}t |��r||j��n|}|�d}t |��r||j��n|}t||j��r\|j|jkr|���}|� ��� �����}t||j��r\|j|jkr|���}|� ��� �����}|j dks |j dkr!|j � ||||d���dS|�||||���st|�d|�d |�d |�d |�d � ���dS) Nrg{�G�z�?grT)�atol�rtol� equal_nan)rmrn� z is not close to z (atol=z, rtol=�))�numpyr.� isinstance�Tensorr>�callablerQ�bfloat16r`�cpu�detach�size�testing�assert_allclose�allclose�AssertionError)r�yrmrn�err_msg�npr.s r� assert_closer��s��������L�L�L� �a��� &� &�� �L�L��O�O�� �a��� &� &�� �L�L��O�O�� �|���$�T�N�N� 4�4�4���=�=�=��D� �|���$�T�N�N� 4�4�4���=�=�=��D��!�U�\�"�"�%� �7�e�n� $� $���� � �A� �E�E�G�G�N�N� � � "� "� $� $���!�U�\�"�"�%� �7�e�n� $� $���� � �A� �E�E�G�G�N�N� � � "� "� $� $�� �v��z�z�Q�V�a�Z�Z� � �"�"�1�a�d���"�N�N�N��� �;�;�q�!�$�T�;� 2� 2�^���\�\�!�\�\�a�\�\��\�\�UY�\�\�\�]�]�]�^�^rc��eZdZdZ ddeedeededeed eed ed eeefd ed ededefd�Z dS)� Benchmarkzk This class is used by the :code:`perf_report` function to generate line plots with a concise API. rkFN�x_names�x_vals�line_arg� line_vals� line_names� plot_name�args�xlabel�ylabel�x_log�y_logc��||_||_| |_||_||_||_| |_| |_||_| |_ ||_ ||_ dS)a� Constructor. x_vals can be a list of scalars or a list of tuples/lists. If x_vals is a list of scalars and there are multiple x_names, all arguments will have the same value. If x_vals is a list of tuples/lists, each element should have the same length as x_names. :param x_names: Name of the arguments that should appear on the x axis of the plot. :type x_names: List[str] :param x_vals: List of values to use for the arguments in :code:`x_names`. :type x_vals: List[Any] :param line_arg: Argument name for which different values correspond to different lines in the plot. :type line_arg: str :param line_vals: List of values to use for the arguments in :code:`line_arg`. :type line_vals: List[Any] :param line_names: Label names for the different lines. :type line_names: List[str] :param plot_name: Name of the plot. :type plot_name: str :param args: Dictionary of keyword arguments to remain fixed throughout the benchmark. :type args: Dict[str, Any] :param xlabel: Label for the x axis of the plot. :type xlabel: str, optional :param ylabel: Label for the y axis of the plot. :type ylabel: str, optional :param x_log: Whether the x axis should be log scale. :type x_log: bool, optional :param y_log: Whether the y axis should be log scale. :type y_log: bool, optional N) r�r�r�r�r�r�r��stylesr�r�r�r�)�selfr�r�r�r�r�r�r�r�r�r�r��colorr�s r�__init__zBenchmark.__init__�s]��\�� ��� ��� � �� �"���$����� ��� ��� ��� �"����� � � r)rkrkFFNN) �__name__� __module__� __qualname__�__doc__r�strrr�boolr�rrrr�r��s���������������:�:��c��:��S� �:�� :� ��9� :� ��I� :��:��3��8�n�:��:��:��:��:�:�:�:�:�:rr�c �:�eZdZd�Z d dedededefd�Zd d �Zd S)�Markc�"�||_||_dS�N�rA� benchmarks)r�rAr�s rr�z Mark.__init__s�����$����rF��bench� save_path� show_plots� print_datac �. ��ddl}ddlm} ddl} |j} d�|jD��} d�|jD��} t |j��}| �|| z| z| z���}|jD�]�t�t tf��s�fd�|D���t���t|��kr"tdt|���d������tt|�����}ggg}}}|jD]Q}|jdi|�|j|i�|j�|��} |\} } } n#t&$r |dd} } } YnwxYw|| gz }|| gz }|| gz }�Rt ���|z|z|z|jt|��<��|j�r4| ���| ���}|d}t1|j��D�]\}}||dz||d z} } |jr|j|dnd}|jr|j|d nd}|�|||||||� ��| ������sz| ������sT| �t<��} | �t<��} |�||| | d |� ����|� ��|�!|j"p|��|�#|j$��|�%|j&rdnd��|�'|j(rdnd��|r| �)��|r6| �*|j+�,||j�d�����|||jz}|rA|j-d dkr0|j.�/��\}}||||z |d<|r8ta|jdz��ta|�1����|r=|�2|j+�,||j�d���d|�d�d���|S)Nrc��g|]}|�d���S)�-minrrs rrzMark._run.<locals>.<listcomp>���6�6�6��A����6�6�6rc��g|]}|�d���S)�-maxrrs rrzMark._run.<locals>.<listcomp>r�r)�columnsc���g|]}���Srr)rrhrs �rrzMark._run.<locals>.<listcomp>s���(�(�(�1�Q�(�(�(rz Expected z values, got r�r�r)�labelr��lsg333333�?)�alphar��log�linearz.png��Diff�:z.csvz%.�fF)� float_format�indexr)3�os�matplotlib.pyplot�pyplot�pandasr��listr�� DataFramer�rs�tuplerc� ValueError�dictr_r�rAr�r�� TypeError�locr��figure�subplot� enumerater��plot�isnull�all�astyper`� fill_between�legend� set_xlabelr�� set_ylabelr�� set_xscaler�� set_yscaler��show�savefig�pathr�shaper�rb�print� to_string�to_csv)r�r�r�r�r��diff_col�save_precision�kwragsr��plt�pd�y_mean�y_min�y_maxr��df�x_args�row_mean�row_min�row_maxr~r"�ax�first_xrJ�col�sty�col0�col1rs @r�_runz Mark._run s����� � � �'�'�'�'�'�'������!��6�6�U�%5�6�6�6��6�6�U�%5�6�6�6���u�}�%�%�� �\�\�'�F�"2�U�":�U�"B�\� C� C���� E� E�A��a�$���/�/� )�(�(�(�(��(�(�(���1�v�v��W���%�%� �!K�S��\�\�!K�!K��!K�!K�L�L�L��#�g�q�/�/�*�*�F�)+�R��w�g�H��_� #� #���d�g�V�V��V�5�>�1�*=�V���V�v�V�V��;�+.�(�F�E�5�5�� �;�;�;�+.��d�5�E�F�F�F�;�����V�H�$���E�7�"���E�7�"���"�1�g�g��0�7�:�W�D�B�F�3�r�7�7�O�O� �?� O� �J�J�L�L�L������B��a�j�G�!�%�"2�3�3� V� V���1�!�!�f�*�~�r�!�f�*�~�u��,1�L�B�e�l�1�o�a�(�(�d��,1�L�B�e�l�1�o�a�(�(�d������7� �R��U�!�3�3��G�G�G��|�|�~�~�)�)�+�+�V�E�L�L�N�N�4F�4F�4H�4H�V�!�L�L��/�/�E�!�L�L��/�/�E��O�O�B�w�K���T�QT�O�U�U�U�� �I�I�K�K�K� �M�M�%�,�1�'� 2� 2� 2� �M�M�%�,� '� '� '� �M�M�5�;�<�%�%�H� =� =� =� �M�M�5�;�<�%�%�H� =� =� =�� ���� � � �� O�� � �B�G�L�L��u��4L�4L�4L�M�M�N�N�N� ��%�*�*� +�� � -���� �q�(�(���*�*�,�,�J�D�$��D��B�t�H�,�B�v�J� � "� �%�/�C�'� (� (� (� �"�,�,�.�.� !� !� !� � #� �I�I�b�g�l�l�9���.F�.F�.F�G�G�Vl�[i�Vl�Vl�Vl�!� � #� #� #�� s�.D5�5E�Erkc ��t|jt��}|r|jgn|j}g}|rYtj|d���t tj�|d��d��} | �d��|D]F} |� |j | |||fi|����|r| �d| j �d����G|r)| �d��| � ��|r |r|d S|SdS) NT)�exist_okz results.html�wz <html><body> z <image src="z.png"/> z</body></html> r) rsr�r�r��makedirs�openr�r�write�appendr�r��close) r�r�r�r�� return_df�kwargs�has_single_benchr�� result_dfs�htmlr�s r�runzMark.runPs5��%�d�o�y�A�A��*:�O�d�o�&�&��� �� � � )� �K� �D� 1� 1� 1� 1���� � �Y��?�?��E�E�D� �J�J�'� (� (� (�� H� H�E� � � �i�d�i��y�*�j�[�[�TZ�[�[� \� \� \�� H�� � �F�5�?�F�F�F�G�G�G�� � � �J�J�)� *� *� *� �J�J�L�L�L� � "�� "�!�!�}�$�!�!��trN)Fr�)FFrkF) r�r�r�r�r�r�r�r�r�rrrr�r�s�������%�%�%�ch��C�C�)�C��C��C�SW�C�C�C�C�J�����rr�c����fd�}|S)z� Mark a function for benchmarking. The benchmark can then be executed by using the :code:`.run` method on the return value. :param benchmarks: Benchmarking configurations. :type benchmarks: List of :class:`Benchmark` c�$��t|���Sr�)r�r�s �r�<lambda>zperf_report.<locals>.<lambda>os����b�*�-�-�rr)r��wrappers` r� perf_reportr�hs���.�-�-�-�G� �Nrc��ddl}ddlm}|s|j���}|jj�|��d}|jj�|��d}||zdzdz d z }|S) z return DRAM bandwidth in GB/s rNr��driver�mem_clock_rate� mem_bus_widthr�g��.A�)r.�runtimerr/�current_device�active�utils�get_device_properties)rRr.r� mem_clock_khz� bus_width�bw_gbpss r� get_dram_gbpsr ss����L�L�L������� �-���*�*�,�,���M�'�=�=�f�E�E�FV�W�M�� �#�9�9�&�A�A�/�R�I��i�'�!�+�c�1�A�5�G� �Nrc���ddl}ddlm}|s|j���}|jj�|��ddz}|j�|��}|ddkr||j ksJ�d}ni||j |j fvrd}nV||j |j |j fvrd}n=||jtjtjtjfvrd }nt'd ���||z|zd z}|S) Nrrr��multiprocessor_count�r�ii�dtype not supported��&� .>)r.rrr/rrrr�get_device_capability�float16�float32�int32rv�int16r]�tl� float8e4nv� float8e4b15�float8e5r2� rQ� clock_raterRr.r� num_subcores� capability�ops_per_sub_core�tflopss r�get_max_tensorcore_tflopsr"�s(���L�L�L������� �-���*�*�,�,���=�&�<�<�V�D�D�E[�\�_`�`�L���1�1�&�9�9�J��!�}�q����� �%�%�%�%���� �U�]�E�K�0� 0� 0�"� � � �u�}�e�n�e�k�B� B� B�"� � � �u�z�2�=�"�.�"�+�N� N� N�#� � ��4�5�5� 5� �J� &�)9� 9�D� @�F� �Mrc ����fd�}|S)Nc�J���tj�����fd���}|S)Nc�p��ddl}|�tj�������}� ���|���k}|r�|dkr�tj�� jd��}tj ddd�}d|vs Jd���|dj j j }|�d � j �d |�d �}tjdd d |gd|���} | jdks Jd���dt#| j��vsJ�dS� |i|��dS)Nrz cuda-memcheck�__file__�PATH�1)r'�PYTORCH_NO_CUDA_MEMORY_CACHING�requestz@memcheck'ed test must have a (possibly unused) `request` fixturez::�[�]�pytestz-vsT)�capture_output�envz7cuda-memcheck returned an error: bounds checking failedzERROR SUMMARY: 0 errors)�psutil�Processr��getppid�name�itemsr��realpath� __globals__�environ�node�callspec�idr�rr�� returncoder�r) r�r�r0� ppid_name�run_cuda_memcheckr�r/�test_idr r!� target_kwargs�test_fns ��rr�z1cuda_memcheck.<locals>.decorator.<locals>.wrapper�sW��� �M�M�M����r�z�|�|�4�4�9�9�;�;�I� -� 3� 3� 5� 5������ G� � � )�Y�/�%A�%A��w�'�'��(;�J�(G�H�H��!�z�&�1�UX�Y�Y�� �F�*�*�*�,n�*�*�*� ��+�0�9�<���>�>��!1�>�>�G�>�>�>�� �n�o�x���%L�]a�gj�k�k�k���~��*�*�*�,e�*�*�*�0�C�� �O�O�C�C�C�C�C�C����(��(�(�(�(�(r)� functools�wraps)r@r�r?s` �r� decoratorz cuda_memcheck.<locals>.decorator�s>���� ��� !� !� )� )� )� )� )� "� !� )�"�rr)r?rCs` r� cuda_memcheckrD�s$��������, �r�F�c #��K� tjgd���tjdddd|�d|��g��tjdddd|�d|��g��tdg��d }td g��d }t||z ��d ksJd |�d ����t||z ��d ksJd |�d ����d|z}d|zdz}||fV�tjgd���tjgd���tjgd���dS#tjgd���tjgd���tjgd���wxYw)N)r r r �-pmr(r r r z--lock-gpu-clocks=r z--lock-memory-clocks=zclocks.current.smrzclocks.current.memoryr-zGPU SMs must run at z MHzg�3��O�?ig����MbP?)r r r rHr )r r r z-rgc)r r r z-rmc)rrr#�abs)� ref_sm_clock� ref_mem_clock� cur_sm_clock� cur_mem_clockr!�gbpss r� set_gpu_clockrO�s�����C��� E� E� E�F�F�F��� � � � >�� >� >� � >� >� ! � � � � �� � � � C�M� C� C�M� C� C� ! � � � � �1�2�3�3�A�6� ��6�7�8�8��;� ��<�,�.�/�/�"�4�4�4�6_�\�6_�6_�6_�4�4�4��=�=�0�1�1�B�6�6�6�8b�}�8b�8b�8b�6�6�6�)�L�8����&��-���d�l������ E� E� E�F�F�F��� A� A� A�B�B�B��� A� A� A�B�B�B�B�B�� �� E� E� E�F�F�F��� A� A� A�B�B�B��� A� A� A�B�B�B�B���s �CD!�!AE%c��ddl}ddlm}|s|j���}|jj�|��ddz}|j���}|ddkr+||j krd}nM||j krd}n?td ���||j krd}n"||j |j fvrd}ntd ���||z|zd z}|S) Nrrr�rrr� �@rr) r.rrr/rrrrrrrr2rvrs r�get_max_simd_tflopsrS�s���L�L�L������� �-���*�*�,�,���=�&�<�<�V�D�D�E[�\�_`�`�L���1�1�3�3�J��!�}�q��� �E�M� !� !�!� � � �e�m� #� #�!� � ��4�5�5� 5� �E�M� !� !�!� � � �u�}�e�n�5� 5� 5�!� � ��4�5�5� 5� �J� &�)9� 9�D� @�F� �Mr)r$Nr%)rNrONNTr%)NNrkr�)rErF)rAr�rr� contextlibr�typingrrrrkrrr#rMrjr�r�r�r�r r"rDrOrSrrr�<module>rVs������� � � � ����� � � � �%�%�%�%�%�%�"�"�"�"�"�"�"�"�"�"����������<5�<5�<5�<5�~I5�I5�I5�I5�X"^�"^�"^�"^�J?�?�?�?�?�?�?�?�D`�`�`�`�`�`�`�`�F��� � � � �����:���6�C�C�C���C�8�����r
Memory