� <��g7 ��X�ddlZddlmZddlmZmZmZddlZGd�d��ZdZ dZ dS)�N)�cached_property)�List�Optional�Tuplec ���eZdZdZ ddejdedeedeefd�Z e de fd ���Z e de fd ���Z e de fd ���Ze de fd ���Ze de fd ���Ze de fd���Ze de fd���Zede fd���Zedee fd���Zdedee fd�Zdee defd�Zdee defd�Ze dee fd���Zdee deeeeee ffd�Zdee deeeeee ffd�Zdee deeeeee ffd�ZdS)� Tokenizerz-Simple wrapper around a tokenizers.Tokenizer.N� tokenizer� multilingual�task�languagec��||_|r�|tvr.td|�dd�t���d����|tvr.td|�dd�t���d����|j�d|z��|_|j�d|z��|_||_dSd|_d|_d|_dS)N�'z'' is not a valid task (accepted tasks: z, �)z9' is not a valid language code (accepted language codes: z<|%s|>�en) r �_TASKS� ValueError�join�_LANGUAGE_CODES� token_to_idr r � language_code)�selfr r r r s �h/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/faster_whisper/tokenizer.py�__init__zTokenizer.__init__ s���#��� � &��6�!�!� �j��t�t�T�Y�Y�v�.�.�.�.�0���� ��.�.� �j��x�x����?�!;�!;�!;�!;�=���� ��2�2�8�d�?�C�C�D�I� �N�6�6�x�(�7J�K�K�D�M�!)�D� � � ��D�I� �D�M�!%�D� � � ��returnc�6�|j�d��S)Nz<|transcribe|>�r r�rs r� transcribezTokenizer.transcribe*s���~�)�)�*:�;�;�;rc�6�|j�d��S)Nz <|translate|>rrs r� translatezTokenizer.translate.����~�)�)�/�:�:�:rc�6�|j�d��S)Nz<|startoftranscript|>rrs r�sotz Tokenizer.sot2s���~�)�)�*A�B�B�Brc�6�|j�d��S)Nz <|startoflm|>rrs r�sot_lmzTokenizer.sot_lm6r"rc�6�|j�d��S)Nz<|startofprev|>rrs r�sot_prevzTokenizer.sot_prev:s���~�)�)�*;�<�<�<rc�6�|j�d��S)Nz <|endoftext|>rrs r�eotz Tokenizer.eot>r"rc�6�|j�d��S)Nz<|notimestamps|>rrs r� no_timestampszTokenizer.no_timestampsBs���~�)�)�*<�=�=�=rc��|jdzS)N�)r,rs r�timestamp_beginzTokenizer.timestamp_beginFs���!�A�%�%rc��|jg}|j�|�|j��|j�|�|j��|S)N)r$r �appendr )r�sequences r� sot_sequencezTokenizer.sot_sequenceJsH���H�:�� �=� $� �O�O�D�M� *� *� *� �9� � �O�O�D�I� &� &� &��r�textc�D�|j�|d���jS)NF)�add_special_tokens)r �encode�ids)rr4s rr7zTokenizer.encodeVs ���~�$�$�T�e�$�D�D�H�Hr�tokensc�T���fd�|D��}�j�|��S)Nc�*��g|]}|�jk� |��S�)r*)�.0�tokenrs �r� <listcomp>z$Tokenizer.decode.<locals>.<listcomp>Zs%���E�E�E��E�D�H�4D�4D�u�4D�4D�4Dr)r �decode)rr9� text_tokenss` rr@zTokenizer.decodeYs2���E�E�E�E�&�E�E�E� ��~�$�$�[�1�1�1rc���gg}|D]e}|�jkr=d|�jz dzd�d�}|�|��|�g���J|d�|���fd��fd�|D����S)Nz<|g{�G�z�?z.2fz|>������c�r��g|]3}t|t��r|n�j�|����4Sr<)� isinstance�strr r@)r=�srs �rr?z4Tokenizer.decode_with_timestamps.<locals>.<listcomp>is<��� T� T� T�q�*�Q��$�$� B�Q�Q�$�.�*?�*?��*B�*B� T� T� Tr)r/r1r)rr9�outputsr>� timestamps` r�decode_with_timestampsz Tokenizer.decode_with_timestamps]s�����$��� *� *�E���,�,�,�N�%�$�*>�">�$�!F�N�N�N�N� ����y�)�)�)����r�"�"�"�"��� �"�"�5�)�)�)�)��w�w� T� T� T� T�G� T� T� T� � � rc�4�td��}|d���z }td��}td�|D����sJ�|�d��d|�d��dh}|t|��zD]d}|�|��|�d|z��fD]4}t |��d ks||vr|�|d���5�ett|����S) u� Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech annotations, to prevent sampling texts that are not actually spoken in the audio, e.g. - ♪♪♪ - ( SPEAKING FOREIGN LANGUAGE ) - [DAVID] Hey there, keeping basic punctuations like commas, periods, question marks, exclamation points, etc. u#"#()*+/:;<=>@[\]^_`{|}~「」『』uK<< >> <<< >>> -- --- -( -[ (' (" (( )) ((( ))) [[ ]] {{ }} ♪♪ ♪♪♪u♩♪♫♬♭♮♯c3�PK�|]!}dt|��cxkodkncV��"dS)i@&i&N)�ord)r=�cs r� <genexpr>z.Tokenizer.non_speech_tokens.<locals>.<genexpr>�sD����E�E�!�6�S��V�V�-�-�-�-�v�-�-�-�-�E�E�E�E�E�Erz -rz '� r.) �list�split�set�allr7�len�add�tuple�sorted)r�symbols� miscellaneous�result�symbolr9s r�non_speech_tokenszTokenizer.non_speech_tokensls*���=�>�>��� Z� `� `� b� b� ���3�4�4� ��E�E�}�E�E�E�E�E�E�E�E��+�+�d�#�#�A�&�� � �D�(9�(9�!�(<�=����]� 3� 3�3� *� *�F�� � �F�#�#�� � �C�&�L�)�)�� *� *���v�;�;�!�#�#�v��'>�'>��J�J�v�a�y�)�)�)��  *��V�F�^�^�$�$�$rc�h�|jdvr|�|��S|�|��S)N>�ja�lo�my�th�zh�yue)r�split_tokens_on_unicode�split_tokens_on_spaces)rr9s r�split_to_word_tokenszTokenizer.split_to_word_tokens�s=�� � �!F� F� F��/�/��7�7� 7��*�*�6�2�2�2rc���|�|��}d}g}g}g}d}|D]�}|�|��|�|��} | �|��} | |z } n#t$rd} YnwxYw| �| t |��krJ|| |kr>|�| ��|�|��g}|t | ��z }��||fS)Nu�r)rKr1�indexrrV) rr9� decoded_full�replacement_char�words� word_tokens�current_tokens�unicode_offsetr>�decoded�replacement_char_indexs rrfz!Tokenizer.split_tokens_on_unicode�s(���2�2�6�:�:� �#����� ������ /� /�E� � !� !�%� (� (� (��1�1�.�A�A�G� .�)0���7G�)H�)H�&�&�.�8�&�&��� .� .� .�)-�&�&�&� .����&�-�&��\�):�):�:�:� �!7�8�<L�L�L�� � �W�%�%�%��"�"�>�2�2�2�!#���#�g�,�,�.����k�!�!s�A*�* A9�8A9c���|�|��\}}g}g}t||��D]�\}}|d|jk}|�d��} |���t jv} |s| s| st|��dkr+|�|��|�|����|d|z|d<|d� |����||fS)NrrQrC) rf�zipr*� startswith�strip�string� punctuationrVr1�extend) rr9�subwords�subword_tokens_listrmrn�subword�subword_tokens�special� with_spacerxs rrgz Tokenizer.split_tokens_on_spaces�s��)-�(D�(D�V�(L�(L�%��%���� �'*�8�5H�'I�'I� 7� 7� #�G�^�$�Q�'�4�8�3�G� �+�+�C�0�0�J�!�-�-�/�/�V�-?�?�K�� 7�*� 7� � 7�s�5�z�z�Q���� � �W�%�%�%��"�"�>�2�2�2�2�!�"�I��/��b� ��B��&�&�~�6�6�6�6��k�!�!r)NN)�__name__� __module__� __qualname__�__doc__� tokenizersr�boolrrGrr�intrr!r$r&r(r*r,�propertyr/rr3r7r@rKrr^rhrfrgr<rrrr s3������7�7� #�"&� &�&��'�&��&��s�m� &� �3�-� &�&�&�&�<�<�C�<�<�<��_�<��;�3�;�;�;��_�;��C�S�C�C�C��_�C��;��;�;�;��_�;��=�#�=�=�=��_�=��;�S�;�;�;��_�;��>�s�>�>�>��_�>��&��&�&�&��X�&�� �d�3�i� � � ��X� �I�3�I�4��9�I�I�I�I�2�T�#�Y�2�3�2�2�2�2�  �T�#�Y�  �3�  �  �  �  ��!%�5��:�!%�!%�!%��_�!%�F 3��3�i� 3� �t�C�y�$�t�C�y�/�)� *� 3� 3� 3� 3�"��3�i�"� �t�C�y�$�t�C�y�/�)� *�"�"�"�"�@"��3�i�"� �t�C�y�$�t�C�y�/�)� *�"�"�"�"�"�"rr)rr!)d�af�am�ar�as�az�ba�be�bg�bn�bo�br�bs�ca�cs�cy�da�de�elr�es�et�eu�fa�fi�fo�fr�gl�gu�ha�haw�he�hi�hr�ht�hu�hy�id�is�itr`�jw�ka�kk�km�kn�ko�la�lb�lnra�lt�lv�mg�mi�mk�ml�mn�mr�ms�mtrb�ne�nl�nn�no�oc�pa�pl�ps�pt�ro�ru�sa�sd�si�sk�sl�sn�so�sq�sr�su�sv�sw�ta�te�tgrc�tk�tl�tr�tt�uk�ur�uz�vi�yi�yordre) rw� functoolsr�typingrrrr�rrrr<rr�<module>r�s��� � � � �%�%�%�%�%�%�(�(�(�(�(�(�(�(�(�(�����D"�D"�D"�D"�D"�D"�D"�D"�N �� e���r
Memory