� <��g�3� ��ddlZddlZddlZddlZddlZddlmZmZddlm Z ddl mZddlm Z mZmZmZmZmZddlmZddlZddlZddlZddlmZddlmZmZdd lmZdd lm Z m!Z!ddl"m#Z#m$Z$m%Z%m&Z&ddl'm(Z(m)Z)m*Z*m+Z+m,Z,eGd �d��Z-eGd�d��Z.eGd�d��Z/eGd�d��Z0Gd�d��Z1Gd�d��Z2dee.dee3de4dee.fd�Z5dej6dej7fd�Z8d e9de:fd!�Z;d"e!d#ee4deee4fd$�Z<d%ee3d&e9d'e9ddfd(�Z=dS))�N)�asdict� dataclass)� signature)�ceil)�BinaryIO�Iterable�List�Optional�Tuple�Union)�warn)�tqdm)�decode_audio�pad_or_trim)�FeatureExtractor)�_LANGUAGE_CODES� Tokenizer)�download_model�format_timestamp�get_end� get_logger)�SpeechTimestampsMap� VadOptions�collect_chunks�get_speech_timestamps�merge_segmentsc�>�eZdZUeed<eed<eed<eed<d�ZdS)�Word�start�end�word�probabilityc�L�tdtd��t|��S)NzIWord._asdict() method is deprecated, use dataclasses.asdict(Word) instead��r �DeprecationWarningr��selfs �i/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/faster_whisper/transcribe.py�_asdictzWord._asdict's+��W�� d�|�|��N)�__name__� __module__�__qualname__�float�__annotations__�strr*�r+r)rr sL��L�L�L� �J�J�J� �I�I�I��r+rc��eZdZUeed<eed<eed<eed<eed<eeed<eed<eed<eed <eee ed <eeed<d�Z d S)�Segment�id�seekrr �text�tokens�avg_logprob�compression_ratio�no_speech_prob�words�temperaturec�L�tdtd��t|��S)NzOSegment._asdict() method is deprecated, use dataclasses.asdict(Segment) insteadr$r%r's r)r*zSegment._asdict>s+��]�� d�|�|�r+N)r,r-r.�intr0r/r1r r rr*r2r+r)r4r40s��G�G�G� �I�I�I��L�L�L� �J�J�J� �I�I�I��I��D��J��%�� r+r4c��eZdZUeed<eed<eed<eed<eed<eed<eeed<eeed<eeed <eed <eed<eeed<ee e eefed <ee ed<eed<eeeed<eed<eed<eed<e ed<e ed<eed<eeed<e e eefed<eeed<ee ed<dS)�TranscriptionOptions� beam_size�best_of�patience�length_penalty�repetition_penalty�no_repeat_ngram_size�log_prob_threshold�no_speech_threshold�compression_ratio_threshold�condition_on_previous_text�prompt_reset_on_temperature�temperatures�initial_prompt�prefix�suppress_blank�suppress_tokens�without_timestamps�max_initial_timestamp�word_timestamps�prepend_punctuations�append_punctuations�multilingual�max_new_tokens�clip_timestamps�hallucination_silence_threshold�hotwordsN)r,r-r.r?r0r/r �boolr rr1rr2r+r)rArAGs��N�N�N� �L�L�L��O�O�O�� '�'�'�!�%��(�(�(�!)�%��0�0�0� $�$�$�$�!&�&�&�&��u�+��U�3�� #5�6�7�7�7�7��S�M��d�3�i�(�(�(�(�� S�M�!�!�!��3��U��+�,�,�,�,�%-�e�_�4�4�4��s�m��r+rAc�~�eZdZUeed<eed<eed<eed<eeeeefed<e ed<e ed<dS) �TranscriptionInfo�language�language_probability�duration�duration_after_vad�all_language_probs�transcription_options�vad_optionsN)r,r-r.r1r0r/r r rrArr2r+r)r^r^ess��M�M�M��O�O�O�� e�C��J�&7�!8�9�9�9�9�/�/�/�/��r+r^cM�z�eZdZd�Zd�Zdejdedefd�Z ddd d d ddddgd �dddddddddgddd ddd ddddddddddf#de eeejfde ededededededed ed!ed"e eeeeed#ffd$e ed%e ed&e ed'ed(ed)e e eeefd*e ed+ed,e eed-ed.ed/ed0ed1ed2ed3ed4e e eefd5e ed6e ed7e eed8e ed9ed:e ed;e ed<ed=eeeeffJd>�Zd?�ZdS)@�BatchedInferencePipelinec�"�||_d|_dS)N�)�model�last_speech_timestamp)r(rjs r)�__init__z!BatchedInferencePipeline.__init__qs��$)�� %(��"�"�"r+c �2��|�|��\}}g}g}t||��D]�\��d�dz } tt| ��jjz��} |�| ��j��d�d| | d��\}}} |��fd�|D��|jr4�j� |�|||j |j�j��_|S)N�end_time� start_timer8r�� tokenizerr8�time_offset�segment_size�segment_durationr6c�8��g|]�}t��|d��d�d|d|d|dt��|d��t�d�jjz��S)r8r9r;rr ro)r7r9r;r8rr r:r6)�dict�decode�get_compression_ratior?rj�frames_per_second)�.0� subsegment�chunk_metadata�outputr(rqs ��r)� <listcomp>z4BatchedInferencePipeline.forward.<locals>.<listcomp>�s��#��&�-�-�j��.B�C�C�$*�=�$9�'-�.>�'?�)�(�3�(��1�&�u�-�*?�%�,�,�Z��-A�B�B�+�+�!�*�<�8�4�:�;W�W�� r+) �generate_segment_batched�zipr?rrjry�append�_split_segments_by_timestampsrT�add_word_timestampsrUrVrk)r(�featuresrq�chunks_metadata�options�encoder_output�outputs�segmented_outputs� segment_sizesrars�subsegmentsr6�single_timestamp_endingr|r}s` ` @@r)�forwardz BatchedInferencePipeline.forwardxs��"&�"?�"?��i��# �# �� &)�/�7�&C�&C�" �" �"�N�F�%�j�1�N�<�4P�P�H��t�H�~�~�� 0L�L�M�M�L�� .�.�.� � �8�8�#��h�'�*�<�8�)�!)�� 9�� '� �$�$��'2�� &�"� �)-��)G�)G�!��,��+��*�*�*�D�&�!� r+r�rqr�c��|jd}|j��|j��|j��ng|j|j��|j�t��|jz}n|jj }||jj krPtdt��d|t��z �d|�d|jj �d|jj �d��|j�|��}�fd �t|��D��}|jre�fd �|jj� |��D��}��j��} t!|��D]\} }||| | <�|jj�|||j|j|j||j|jdd|jd|j|j�� }g} |D]o}t|jd��}|jd||jzz}| �t;||d zz|j|jd��p|| fS)Nr)�previous_tokensrRr[�The length of the prompt is �, and the `max_new_tokens` �C. Thus, the combined length of the prompt and `max_new_tokens` is: �6. This exceeds the `max_length` of the Whisper model: ��. You should either reduce the length of your prompt, or reduce the value of `max_new_tokens`, so that their combined length is less that �.c�8��g|]}��Sr2)�copy)rz�_�prompts �r)r~zEBatchedInferencePipeline.generate_segment_batched.<locals>.<listcomp>�s!��<�<�<�Q�6�;�;�=�=�<�<�<r+c�\��g|](}�j�|dd��)S�r)rq�token_to_id)rz� segment_langsrqs �r)r~zEBatchedInferencePipeline.generate_segment_batched.<locals>.<listcomp>�sD��!��#�/�/� �a�0@��0C�D�D��r+T)rBrDrE� max_lengthrPrQ� return_scores�return_no_speech_prob�sampling_temperaturerFrG�)r9r;r8)�shaperj� get_promptrN�encoderRr[rX�lenr�� ValueError�rangerW�detect_language�indexr_� enumerate�generaterBrDrErPrQrMrFrG� sequences_ids�scoresr�rvr;)r(r�rqr�� batch_sizer�r��prompts�language_tokens�language_token_index�i�language_token�resultsr}�result�seq_len�cum_logprobr�s ` @r)rz1BatchedInferencePipeline.generate_segment_batched�s��^�A�&� ��&�&��)�5�� !7�8�8�8��&�9��%�'� � ��!�-��V��w�'=�=�J�J��.�J�� -�-�-��W�s�6�{�{�W�W��F��+�W�W�,6�W�W�7;�j�6K�W�W�?C�j�>S� W�W�W�� *�*�8�4�4��<�<�<�<�%� �*;�*;�<�<�<�� B��%)�Z�%5�%E�%E�n�%U�%U��O�$*�<�<� �0B�#C�#C� �%.��%?�%?� B� B�!��>�3A�� /�0�0��*�"�+�+��'��%�"�1�!�"�1�#�3��"&�!(�!5�a�!8�&�9�!(�!=�,� � �� F��&�.�q�1�2�2�G� �-��*�g�w�7M�.M�N�K��M�M�� +�w��{� ;�#)�#8�!�/��2�� v�%�%r+N� transcribeF�r�r�rig��?g��?�333333�?g��?��?�333333@��r�T��?��r��"'“¿([{-�"'.。,，!！?？:：”)]}、��audior_�task�log_progressrBrCrDrErFrGr=.rJrHrIrKrLrNrOrPrQrRrSrTrUrVrW� vad_filter�vad_parametersrX�chunk_lengthrYrZr�r[�language_detection_threshold�language_detection_segments�returnc% ��jjj}%|r2�jjjs!�jj�d��d}t |tj��st||%��}|j d|%z}&�jj�dt|&��|p�jjj }|s�|r�|�t|d��}nPt |t��r;d |��vr|�d ��td7i|�d |i��}t%||��}'t'|'|��}n'|&|krd|j dd �g}nt)d��t+d�|D��|%z}(�jj�d t|&|(z ��t-||��\})}*|(r�fd�|)D��ng}+d},|��jjjsd}d}-n��j�tj|+tj�jjjdfdd��gzd��|$|#��\}}-},�jj�d||-��n=�jjjs*|dkr$�jj�d|z��d}d}-t7�jj�jjj||��}.|+rtjd�|+D��ng}+t=d7id|�d|�d|�d|�d| �d| �d | �d!|�d"|�d#t |t>t@f��r |dd�n|g�d$|�d%|�d&|�d'|rtC|.|��n|�d(|�d)|�d*|�d+|"�d,|�d-d�d.d�d/|�d0d1�d2|�d3|�d4d5��}/tE||-|&|(|/||,�6��}0��#|+|.|*|!|/|��}1|1|0fS)8attranscribe audio in chunks in batched fashion and return with language info. Arguments: audio: Path to the input file (or a file-like object), or the audio waveform. language: The language spoken in the audio. It should be a language code such as "en" or "fr". If not set, the language will be detected in the first 30 seconds of audio. task: Task to execute (transcribe or translate). log_progress: whether to show progress bar or not. beam_size: Beam size to use for decoding. best_of: Number of candidates when sampling with non-zero temperature. patience: Beam search patience factor. length_penalty: Exponential length penalty constant. repetition_penalty: Penalty applied to the score of previously generated tokens (set > 1 to penalize). no_repeat_ngram_size: Prevent repetitions of ngrams with this size (set 0 to disable). temperature: Temperature for sampling. If a list or tuple is passed, only the first value is used. initial_prompt: Optional text string or iterable of token ids to provide as a prompt for the each window. suppress_blank: Suppress blank outputs at the beginning of the sampling. suppress_tokens: List of token IDs to suppress. -1 will suppress a default set of symbols as defined in `tokenizer.non_speech_tokens()`. without_timestamps: Only sample text tokens. word_timestamps: Extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment. Set as False. prepend_punctuations: If word_timestamps is True, merge these punctuation symbols with the next word append_punctuations: If word_timestamps is True, merge these punctuation symbols with the previous word multilingual: Perform language detection on every segment. vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio without speech. This step is using the Silero VAD model https://github.com/snakers4/silero-vad. vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available parameters and default values in the class `VadOptions`). max_new_tokens: Maximum number of new tokens to generate per-chunk. If not set, the maximum will be set by the default max_length. chunk_length: The length of audio segments. If it is not None, it will overwrite the default chunk_length of the FeatureExtractor. clip_timestamps: Optionally provide list of dictionaries each containing "start" and "end" keys that specify the start and end of the voiced region within `chunk_length` boundary. vad_filter will be ignored if clip_timestamps is used. batch_size: the maximum number of parallel requests to model for decoding. hotwords: Hotwords/hint phrases to the model. Has no effect if prefix is not None. language_detection_threshold: If the maximum probability of the language tokens is higher than this value, the language is detected. language_detection_segments: Number of segments to consider for the language detection. Unused Arguments compression_ratio_threshold: If the gzip compression ratio is above this value, treat as failed. log_prob_threshold: If the average log probability over sampled tokens is below this value, treat as failed. no_speech_threshold: If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below `log_prob_threshold`, consider the segment as silent. condition_on_previous_text: If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync. Set as False prompt_reset_on_temperature: Resets prompt if temperature is above this value. Arg has effect only if condition_on_previous_text is True. Set at 0.5 prefix: Optional text to provide as a prefix at the beginning of each window. max_initial_timestamp: The initial timestamp cannot be later than this, set at 0.0. hallucination_silence_threshold: Optional[float] When word_timestamps is True, skip silent periods longer than this threshold (in seconds) when a possible hallucination is detected. set as None. Returns: A tuple with: - a generator over transcribed segments - an instance of TranscriptionInfo �iThe current model is English-only but the multilingual parameter is set toTrue; setting to False instead.F�� sampling_rater�!Processing audio with duration %sN�)�max_speech_duration_s�min_silence_duration_msr�)rr zPNo clip timestamps found. Set 'vad_filter' to True or provide 'clip_timestamps'.c3�8K�|]}|d|dz V��dS)r rNr2)rz�segments r)� <genexpr>z6BatchedInferencePipeline.transcribe.<locals>.<genexpr>�s0��S�S��'�'�"2�2�S�S�S�S�S�Sr+�VAD filter removed %s of audioc�X��g|]&}�j�|��ddd�f��'S).Nr�)rj�feature_extractor)rz�chunkr(s �r)r~z7BatchedInferencePipeline.transcribe.<locals>.<listcomp>�s6��U�U�U�u�T�Z� )� )�%� 0� 0��c�r�c�� :�U�U�Ur+�enr�g��float32)�dtype��axis�r�r�r��,Detected language '%s' with probability %.2f�`The current model is English-only but the language parameter is set to '%s'; using 'en' instead.�r�r_c�,�g|]}t|��Sr2)r)rz�features r)r~z7BatchedInferencePipeline.transcribe.<locals>.<listcomp>�s ��C�C�C�w�k�'�*�*�C�C�Cr+rBrCrDrErFrGrHrIrJrMrNrOrPrQrUrVrXr[rTrZrKrYrLr�rWrRrSri�r_r`rarbrdrercr2)$rjr�r��is_multilingual�logger�warning� isinstance�np�ndarrayrr��inforr�rrv�keys�poprr�RuntimeError�sumrr��concatenate�full�n_melsr�hf_tokenizer�stackrA�list�tuple�get_suppressed_tokensr^�_batched_segments_generator)2r(r�r_r�r�rBrCrDrErFrGr=rJrHrIrKrLrNrOrPrQrRrSrTrUrVrWr�r�rXr�rYrZr�r[r�r�r�ra�active_segmentsrb�audio_chunksr�r�rcr`rqr�r��segmentss2` r)r�z#BatchedInferencePipeline.transcribe�s��v� �4�B� �� !�� 0� @� !��J��%�%�2� � � �!�L��%��,�,� E� ��m�D�D�D�E��;�q�>�M�1�� /�1A�(�1K�1K� � � �$�P�t�z�'C�'P�� !�)�%/�.:�03�&�&�&�N�N� ��5�5��.�.�2E�2E�2G�2G�G�G�&�*�*�+B�C�C�C�%/�&�&�(�&�&�@L�&�&�&�N�#8��~�"N�"N��"0��.�"Q�"Q��L�(�(�-.�u�{�1�~�#F�#F�"G��"�M�� S�S�?�S�S�S�S�S�� ,��X�(:�:�;�;� � � � )7�u�o�(N�(N�%��o�"� �U�U�U�U��U�U�U�U�� "��:�#�3� ��'(�$�$��J�.�.��^� ��G�T�Z�%5�%<�a�$@�$�i�X�X�X��1L�1M�/� � � ��(�&�� !�&�&�B��(��:�#�3� ��D�8H�8H�� !�)�)�*�,4�5�� #$� ��J�#��J��,�� IQ�X�B�H�C�C�(�C�C�C�D�D�D�VX� �'�# �# �# ��i�# ��G�# ��X�# �*�>� # � 2�1�# �"6�!5� # � 2�1�# �!4� 3�# �)D�(C�# ��k�D�%�=�9�9�#��B�Q�B��!�]��# �*�>�# � �6�!# �"*�>�## �(#�%�%�i��A�A�A�$��+# �."6�!5�/# �0!4� 3�1# �2*�>�3# �4�X�5# �6,�O�7# �8-1�D�9# �:(-�u�;# �<,�O�=# �>),��?# �@&��A# �B 2�1�C# �D#&�#�E# ��J!��!5��1�")�&�1� � � ��3�3�� ~�r+c#�K�tt|��|d��}d}tdt|��|��D]�} |�|| | |z�||| | |z�|��} | D]�}|D]�}|dz }t |d||dt|dd��t|dd��|jsdnd �|d D��|d|d|d |d|jd��V��|�d��|� ��d|_ dS)Nr)�total�disable�positionr�r6r7r�r c�&�g|]}tdi|��S�r2�r�rzr!s r)r~zHBatchedInferencePipeline._batched_segments_generator.<locals>.<listcomp><s"��!L�!L�!L�4�$�,�,��,�,�!L�!L�!Lr+r<r8r9r;r:)r6r5r7rr r<r8r9r;r:r=ri)rr�r�r�r4�roundrTrM�update�closerk) r(r�rqr�r�r�r��pbar�seg_idxr�r�r�r�s r)r�z4BatchedInferencePipeline._batched_segments_generator#s��#�h�-�-�\�1A�A�N�N�N��q�#�h�-�-��4�4� � �A��l�l��Q��^�+�,��A� �N� 2�3�� G�"� � ��%��G��q�L�G�!�$�V�_�"�$�V�_�#�G�G�$4�a�8�8�!�'�%�.�!�4�4�$+�#:�M�D�D�!L�!L�7�7�;K�!L�!L�!L�&�x�0�$+�M�$:�'.�/?�'@�*1�2E�*F�$+�$8��$;��$��A��+ �. � � ��%(��"�"�"r+)r,r-r.rlr�r�r�rrArrr1rr r\r?r/r rrrvrr4r^r�r�r2r+r)rgrgps^��)�)�)�5!�5!�5!�nN&��*�N&��N&�&� N&�N&�N&�N&�f#'� �"�� !�$%�$%�E �E �E �8;�.2�/2�+/�-0�>B� $�#�02�t�#'�'*� %�$2�#E�"��<@�(,�&*�04�;?��"&�8;�+,�Yb�b��S�(�B�J�.�/�b��3�-�b�� b� �b�� b��b��b��b�"�b�"�b��5�$�u�+�u�U�C�Z�/@�@�A�b�(&.�e�_�)b�*%�U�O�+b�,&�e�_�-b�.%)�/b�0&+�1b�2!��s�H�S�M�'9�!:�;�3b�4�� 5b�6�7b�8"�$�s�)�,�9b�:!�;b�< %�=b�>�?b�@"�Ab�B!�Cb�D�Eb�F�Gb�H!��t�Z�'7�!8�9�Ib�J!�� Kb�L�s�m�Mb�N"�$�t�*�-�Ob�P*2�%��Qb�R�Sb�T�3�-�Ub�V'/�u�o�Wb�X&)�Yb�Z �x�� "3�3� 4�[b�b�b�b�H %)�%)�%)�%)�%)r+rgcK�n�eZdZ d_deded eeeefd ededed eedede fd�Z edeefd��Zd`de fd�Z dddddddddgd�dddddddddgddddddddddd ddddf"d!eeeejfd"eed#ed$ed%ed&ed'ed(ed)ed*ed+eeeeeed,ffd-eed.eed/eed0ed1ed2eeeeefd3eed4ed5eeed6ed7ed8ed9ed:ed;ed<ed=eee efd>eed?eed@eeeefdAeedBeedCeedDedeeeeffHdE�ZdFedGeedHedIedJedKedeeefdL�Z d`dMejdFedNedOeejdeef dP�ZdMejdejfdQ�ZdOejdReedFedNedeejj eeeff dS�Z! dadFedTeed6ed3eedBeedeefdU�Z"dVee dFedOejdWed9ed:edXedefdY�Z# dbdFed[eedOejdWed\edee fd]�Z$ dcd!eejdMeejd<ed=ee efdDedCedeeeeeeefffd^�Z%dS)d�WhisperModel�autor�defaultr�NF�model_size_or_path�device�device_index�compute_type�cpu_threads�num_workers� download_root�local_files_only�filesc ��t��|_d\}}| r/|} | �dd��}| �dd��}n4tj�|��r|} nt |||��} tjj | f|||||| d�| ��|_ tj�| d��}|r%tj �|��|_nytj�|��r%tj �|��|_n5tj �d|j jrdnd z��|_|�| |��|_t-d i|j��|_d |_|jj|jz|_|jj|jjz|_|jj|jz|_d|_d|_dS)a8Initializes the Whisper model. Args: model_size_or_path: Size of the model to use (tiny, tiny.en, base, base.en, small, small.en, distil-small.en, medium, medium.en, distil-medium.en, large-v1, large-v2, large-v3, large, distil-large-v2, distil-large-v3, large-v3-turbo, or turbo), a path to a converted model directory, or a CTranslate2-converted Whisper model ID from the HF Hub. When a size or a model ID is configured, the converted model is downloaded from the Hugging Face Hub. device: Device to use for computation ("cpu", "cuda", "auto"). device_index: Device ID to use. The model can also be loaded on multiple GPUs by passing a list of IDs (e.g. [0, 1, 2, 3]). In that case, multiple transcriptions can run in parallel when transcribe() is called from multiple Python threads (see also num_workers). compute_type: Type to use for computation. See https://opennmt.net/CTranslate2/quantization.html. cpu_threads: Number of threads to use when running on CPU (4 by default). A non zero value overrides the OMP_NUM_THREADS environment variable. num_workers: When transcribe() is called from multiple Python threads, having multiple workers enables true parallelism when running the model (concurrent calls to self.model.generate() will run in parallel). This can improve the global throughput at the cost of increased memory usage. download_root: Directory where the models should be saved. If not set, the models are saved in the standard Hugging Face cache directory. local_files_only: If True, avoid downloading the file and return the path to the local cached file if it exists. files: Load model files from the memory. This argument is a dictionary mapping file names to file contents as file-like or bytes objects. If this is set, model_path acts as an identifier for this model. )NNztokenizer.jsonN�preprocessor_config.json)r� cache_dir)rrr� intra_threads� inter_threadsrzopenai/whisper-tiny�z.enr$g{�G�z�?i�r2) rr�r��os�path�isdirr�ctranslate2�models�Whisperrj�join� tokenizersr�from_bufferr��isfile� from_file�from_pretrainedr��_get_feature_kwargs�feat_kwargsrr��input_stride� hop_length�num_samples_per_tokenr�ry�tokens_per_second�time_precisionr�)r(rrrrrrrrr�model_kwargs�tokenizer_bytes�preprocessor_bytes� model_path�tokenizer_files r)rlzWhisperModel.__init__Ls��V!�l�l��.8�+��+�� +�J�#�i�i�(8�$�?�?�O�!&��+E�t�!L�!L�� W�]�]�-� .� .� �+�J�J�'�"�!1�'��J�!�'�/�� %�%�%�%�� j�2B�C�C�� *� 4� @� @�� Q� Q�D�� W�^�^�N� +� +� � *� 4� >� >�~� N� N�D�� *� 4� D� D�%�t�z�/I�)T��u�U�!�!�D�� 3�3�J�@R�S�S��!1�!E�!E�D�4D�!E�!E��"�-��0A�A� �"� �"�0�D�4J�4U�U� �� "�0�D�4N�N� ��#��r+r�c�H�|jjrtt��ndgS)z%The languages supported by the model.r�)rjr�r�rr's r)�supported_languagesz WhisperModel.supported_languages�s#��)-� �(B�N�t�O�$�$�$��Nr+c�J��i} tj�|d��}|rtj|��}n`tj�|��r?t |dd��5}tj|��}ddd��n#1swxYwYn|Sttj ��j��fd�|� ��D��S#tj$r%}|j�d|��Yd}~nd}~wwxYw|S)Nr�r�utf-8)�encodingc�$��i|]\}}|�v� ||�� Sr2r2)rz�k�v� valid_keyss �r)� <dictcomp>z4WhisperModel._get_feature_kwargs.<locals>.<dictcomp>�s$��G�G�G�T�Q��q�J��A�q��r+z&Could not load preprocessor config: %s)r!r"r'�json�loadsr*�open�loadrrrl� parametersr��items�JSONDecodeErrorr�r�)r(r7r6�config�config_path�file�erBs @r)r-z WhisperModel._get_feature_kwargs�sg�� M��'�,�,�z�3M�N�N�K�!� ��$6�7�7��,�,� ��+�s�W�=�=�=�-��!�Y�t�_�_�F�-�-�-�-�-�-�-�-�-�-�-��-�-�-�-�� "�#3�#<�=�=�H�M�M�O�O�J�G�G�G�G�V�\�\�^�^�G�G�G�G��#� M� M� M��K�� H�!�L�L�L�L�L�L�L�L�� M�� sC�A(C,�-B�C,�B�C,�B�C,�AC,�,D �;D�D r�r�r�r�r�r�Tr�r�r�r�r��0r�r_r�r�rBrCrDrErFrGr=.rJrHrIrKrLrNrOrPrQrRrSrTrUrVrWr�r�rXr�rYrZr[r�r�c$ �~�4�|jj�4|r(|jjs|j�d��d}t |tj��st|�4��}|j d�4z}$|$}%|j�dt|$��|�r|dk�r|�t��}n!t |t��rtd2i|��}t||��}&t!||&��\}'}(tj|'d��}|j d�4z}%|j�d t|$|%z ��|j�t&j��r:|j�d d��4fd�|&D��nd}&|�||� ��})d}*d}+|��|jjsd}d},n�t |t.��r(t1|�d��d��n|d}-|)j ddz }.|-|jz|.krt7|-|jz��nd}/|�|)d|/d�f|#|"��\}},}+|j�d||,��n3|jjs%|dkr|j�d|z��d}d},t;|j|jj||��}0t?d2id|�d|�d|�d|�d| �d| �d| �d|�d|�d |�d!|�d"t |t@tBf��r|n|g�d#|�d$|�d%|�d&|rtE|0|��n|�d'|�d(|�d)|�d*|�d+|�d,|�d-|�d.|�d/| �d0|!��}1|�#|)|0|1||*��}2|&rtI|2|&�4��}2tK||,|$|%|1||+�1��}3|2|3fS)3a'Transcribes an input file. Arguments: audio: Path to the input file (or a file-like object), or the audio waveform. language: The language spoken in the audio. It should be a language code such as "en" or "fr". If not set, the language will be detected in the first 30 seconds of audio. task: Task to execute (transcribe or translate). log_progress: whether to show progress bar or not. beam_size: Beam size to use for decoding. best_of: Number of candidates when sampling with non-zero temperature. patience: Beam search patience factor. length_penalty: Exponential length penalty constant. repetition_penalty: Penalty applied to the score of previously generated tokens (set > 1 to penalize). no_repeat_ngram_size: Prevent repetitions of ngrams with this size (set 0 to disable). temperature: Temperature for sampling. It can be a tuple of temperatures, which will be successively used upon failures according to either `compression_ratio_threshold` or `log_prob_threshold`. compression_ratio_threshold: If the gzip compression ratio is above this value, treat as failed. log_prob_threshold: If the average log probability over sampled tokens is below this value, treat as failed. no_speech_threshold: If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below `log_prob_threshold`, consider the segment as silent. condition_on_previous_text: If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync. prompt_reset_on_temperature: Resets prompt if temperature is above this value. Arg has effect only if condition_on_previous_text is True. initial_prompt: Optional text string or iterable of token ids to provide as a prompt for the first window. prefix: Optional text to provide as a prefix for the first window. suppress_blank: Suppress blank outputs at the beginning of the sampling. suppress_tokens: List of token IDs to suppress. -1 will suppress a default set of symbols as defined in `tokenizer.non_speech_tokens()`. without_timestamps: Only sample text tokens. max_initial_timestamp: The initial timestamp cannot be later than this. word_timestamps: Extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment. prepend_punctuations: If word_timestamps is True, merge these punctuation symbols with the next word append_punctuations: If word_timestamps is True, merge these punctuation symbols with the previous word multilingual: Perform language detection on every segment. vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio without speech. This step is using the Silero VAD model https://github.com/snakers4/silero-vad. vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available parameters and default values in the class `VadOptions`). max_new_tokens: Maximum number of new tokens to generate per-chunk. If not set, the maximum will be set by the default max_length. chunk_length: The length of audio segments. If it is not None, it will overwrite the default chunk_length of the FeatureExtractor. clip_timestamps: Comma-separated list start,end,start,end,... timestamps (in seconds) of clips to process. The last end timestamp defaults to the end of the file. vad_filter will be ignored if clip_timestamps is used. hallucination_silence_threshold: When word_timestamps is True, skip silent periods longer than this threshold (in seconds) when a possible hallucination is detected hotwords: Hotwords/hint phrases to provide the model with. Has no effect if prefix is not None. language_detection_threshold: If the maximum probability of the language tokens is higher than this value, the language is detected. language_detection_segments: Number of segments to consider for the language detection. Returns: A tuple with: - a generator over transcribed segments - an instance of TranscriptionInfo r�Fr�rr�rONr�r�z0VAD filter kept the following audio segments: %sz, c3��K�|]9}dt|d�z��dt|d�z��d�V��:dS)�[rz -> r �]N)r)rzr�r�s �r)r�z*WhisperModel.transcribe.<locals>.<genexpr>]sq��"�E�-�U�7�^�m�-K�L�L�L�L�,�U�5�\�M�-I�J�J�J�J��r+)r�r�r��,r�.r�r�r�r�rBrCrDrErFrGrHrIrJrKrLrMrNrOrPrQrRrSrTrUrVrWrXrYrZr[r�r2)&r�r�rjr�r�r�r�r�r�rr�r�rrrvrrr��isEnabledFor�logging�DEBUG�debugr'r1r/�splitryr?r�rr�rAr�r�r��generate_segments�restore_speech_timestampsr^)5r(r�r_r�r�rBrCrDrErFrGr=rJrHrIrKrLrNrOrPrQrRrSrTrUrVrWr�r�rXr�rYrZr[r�r�rarb� speech_chunksr�r�r�r�rcr`�start_timestamp�content_framesr6rqr�r�r�r�s5 @r)r�zWhisperModel.transcribe�s��n�.�<� �� !�� :� !��K��2� � � �!�L��%��,�,� E� ��m�D�D�D�E��;�q�>�M�1��%��/�1A�(�1K�1K� � � �� !�/�S�0�0��%�!+��N�D�1�1� >�!+�!=�!=�n�!=�!=��1�%��H�H�M�,:�5�-�,P�,P�)�L�/��N�<�a�8�8�8�E�!&��Q��-�!?��K��0� ��,>�!>�?�?� � � � �{�'�'�� 6�6� ��!�!�F��I�I��&3� �� !�M��)�)�%�l�)�K�K��!��:�-� ��'(�$�$�"�/�3�7�7�,�E�/�/�/��4�4�Q�7�8�8�8�(��+� � "*��!3�a�!7��'��)?�?�.�P�P��$�*@�@�A�A�A��(�(�%�c�4�5�5�j�1�0K�1M�)�� (�&�� B��(��:�-� �(�d�2B�2B��#�#�*�,4�5�� #$� ��J�&�� '�! �! �! ��i�! ��G�! ��X�! �*�>� ! � 2�1�! �"6�!5� ! � 2�1�! �!4� 3�! �)D�(C�! �(B�'A�! �)D�(C�! � *�+��e�}�E�E�X��K�=��! �*�>�! � �6�!! �"*�>�#! �(#�%�%�i��A�A�A�$��+! �. 2�1�/! �0#8�"7�1! �2,�O�3! �4"6�!5�5! �6!4� 3�7! �8&��9! �:*�>�;! �<,�O�=! �>-L�,K�?! �@�X�A! ��F�)�)��i��,�� Y�0��=�-�X�X�H� ��!5��1�")�&�1� � � ��~�r+rqr8rrrsrtr6c ��g}t��dko �d�jcxko�dknc}��fd�tt��D��} t| ��dkr�t| ��} |r"| �t��d}| D]n}�||�} | d�jz }| d�jz }|||jzz}|||jzz}|�t |||| ��|}�o|r||z }n��|dz �jz }|||jzz }n||}�fd��D��}t|��dkr+|d�jkr|d�jz }||jz}|�t ||||z��||z }|||fS) Nr$��r�c�j��g|]/}|dkr'�|�jkr�|dz �jk�-|��0S)rr��timestamp_begin)rzr�rqr8s ��r)r~z>WhisperModel._split_segments_by_timestamps.<locals>.<listcomp>�sV��" �" �" ��1�u�u��q� �Y�6�6�6��q�1�u� ��!:�:�:� �;�:�:r+r)r6rr r8r�c�*��g|]}|�jk� |��Sr2rb�rz�tokenrqs �r)r~z>WhisperModel._split_segments_by_timestamps.<locals>.<listcomp>s-��U�i�6O�-O�-O��-O�-O�-Or+)r�rcr�r�r�r3rvr/)r(rqr8rrrsrtr6�current_segmentsr��consecutive_timestamps�slices� last_slice� current_slice� sliced_tokens�start_timestamp_position�end_timestamp_positionrorn�last_timestamp_positionra� timestampss `` r)r�z*WhisperModel._split_segments_by_timestamps�s��K�K�1��U��i�.G�!U�!U�!U�!U�6�RT�:�!U�!U�!U�!U� �" �" �" �" �" ��3�v�;�;�'�'�" �" �" ��%�&�&��*�*��0�1�1�F�&� +�� c�&�k�k�*�*�*��J�!'� +� +� � &�z�-�'?� @� �+8��+;�i�>W�+W�(�)6�r�):�Y�=V�)V�&��":�T�=P�"P�P��'�)?�$�BU�)U�U�� '�'��!�(�$�,� ��+� � �&� D��$��:��>�*�Y�-F�F�(��/�$�2C�C�C��(�H��#)��J��:��"�"�z�"�~��9R�'R�'R�*4�R�.�9�;T�*T�'�2�T�5H�H��#�#��%�#�h�.�!� �� L� �D��'>�>�>r+r�r�r�c#��6�7K�|jddz }t|�jjz��}t |jt��r2d�|jr|j�d��ngD��|_�fd�|jD��}t|��dkr|� d��t|��dzdkr|� |��tt|ddd�|ddd��} d�6d} d}| |d}g} d}|j�{t |jt��rGd |j� ��z}|�|��}| �|��n| �|j��t!|d |��}d}|t| ��k�r�| |\}}||kr|}||kr|}||kr'|dz }|t| ��kr| |d}�\|�jjz}t|�jjz�jjz��}t%�jj||z ||z ��}|dd�|||z�f}|�jjz}t'|��}�j�t,j��r(�j�d t3|��| |d�}|dks|��|��}|jr[�j�|��}|dd\}}|dd�}|j�|��|_||_ ��!|||j"|dkr|j#nd|j$��}��%||||��\} }!}"}#|j&�S| j'|j&k}$|j(� |!|j(krd}$|$r-�j�d| j'|j&��||z }��a| j)d}%|}&dtTdtfd��7dtVtTdtXf�6�7fd�}'dtZtTdtVtTfd�}(��.||%||||��\})}}*|j/�rԉ�0|)g||||j1|j2|��|*s.tg|)��}+|+�|+|krti|+�j5z��}|j6��b|j6},|(|)��}-|-�8|'|-��r-|-d|z }.|.|,kr|&ti|.�j5z��z}��|}/tot|)��D]�}0|)|0}|ds�|'|��r�|(|)|0dzd��}1|1�|1ddd}2n||z}2|d|/z |,kp|d|,kp|d|z dk}3|2|dz |,kp|'|1��p||dz dk}4|3rJ|4rHtitq|dz|d��j5z��}||dz |,kr|}g|)|0d�<n |d}/��tg|)��}+|+�|+}|)D]�}|d}%|�9|%��}5|d|dks|5� ��s�F| �|%��| dz } tu| |&|d|d|5|%|"|!|#| j'|j/rd �|dD��nd�!��V��|j;r|"|j<kr7|j;r!�j�d"|"|j<��t| ��}|�=t%||��|&z �jjz��|t| ��k��|�>��dS)#Nr�r�c�,�g|]}t|��Sr2)r/)rz�tss r)r~z2WhisperModel.generate_segments.<locals>.<listcomp>1s.��'�'�'��b� � �'�'�'r+rTc�>��g|]}t|�jz��Sr2)r ry)rzrsr(s �r)r~z2WhisperModel.generate_segments.<locals>.<listcomp>:s6��" �" �" �35�E�"�t�-�-�.�.�" �" �" r+rr$u*"'“¿([{-"'.。,，!！?？:：”)]}、� �seconds)r�unitrrizProcessing segment at %sr`)rRrOr[Fz$No speech threshold is met (%f > %f)r!r�c��|�dd��}|d|dz }d}|dkr|dz }|dkr|d|z dzz }|d kr||d z z }|S) Nr"rir rg333333�?r�g�/�$�?��@)�get)r!r"ra�scores r)�word_anomaly_scorez:WhisperModel.generate_segments.<locals>.word_anomaly_score�s{��"�h�h�}�c�:�:��;��g��6��%�%��S�L�E��e�#�#��e�h�.�"�4�4�E��c�>�>��X��^�+�E��r+r�c��|�|dsdS�fd�|dD��}|dd�}t�fd�|D��}|dkp|dzt|��kS)Nr<Fc�(��g|]}|d�v�|��S)r!r2)rz�w�punctuations �r)r~zNWhisperModel.generate_segments.<locals>.is_segment_anomaly.<locals>.<listcomp>�s'��U�U�U�q��&� ��8T�8T��8T�8T�8Tr+r�c3�.�K�|]}�|��V��dS�Nr2)rzr�r}s �r)r�zMWhisperModel.generate_segments.<locals>.is_segment_anomaly.<locals>.<genexpr>�s/��A�A�a�.�.�q�1�1�A�A�A�A�A�Ar+rg{�G�z�?)r�r�)r�r<r|r�r}s ��r)�is_segment_anomalyz:WhisperModel.generate_segments.<locals>.is_segment_anomaly�s��?�'�'�*:�?� �5�U�U�U�U�G�G�$4�U�U�U��b�q�b� ��A�A�A�A�5�A�A�A�A�A��z�?�U�T�\�S��Z�Z�%?�?r+r�c�6�td�|D��d��S)Nc3�*K�|]}|d� |V��dS)r<Nr2)rz�ss r)r�zMWhisperModel.generate_segments.<locals>.next_words_segment.<locals>.<genexpr>�s+��?�?�1�A�g�J�?�Q�?�?�?�?�?�?r+)�next)r�s r)�next_words_segmentz:WhisperModel.generate_segments.<locals>.next_words_segment�s!��?�?��?�?�?��F�F�Fr+rp)rkrr<rzr r8c�&�g|]}tdi|��Srrrs r)r~z2WhisperModel.generate_segments.<locals>.<listcomp>,s"��C�C�C�$��C�C�Cr+)r5r6rr r7r8r=r9r:r;r<zBReset prompt. prompt_reset_on_temperature threshold is met %f > %f)?r�r/r��time_per_framer�rYr1rYr�r�r�r�rN�stripr��extendr� nb_max_frames�minrr�rUrVrWrXrrWrjr�rqr�r_� language_coder�rRrOr[�generate_with_fallbackrIr;rHr�rvr r\r r�rTr�rUrVrr ryrZr��maxrwr4rKrLr r)8r(r�rqr�r�r�r^�content_duration�seek_points� seek_clips�idx�clip_idxr6� all_tokens�prompt_reset_sincerN�initial_prompt_tokensrrk�seek_clip_start� seek_clip_endrr�window_end_timersr�rtr�r�r�r`r_r�r�r9r=r:�should_skipr8� previous_seekr�r�rgr�� last_word_end� threshold� first_segment�gap�hal_last_end�si�next_segment�hal_next_start�silence_before� silence_afterr7r�r}s8` @@r)rZzWhisperModel.generate_segments%s ��"��+�a�/�� $�2H�2W�!W�X�X��g�-�s�3�3� �'�'��.��G�+�1�1�#�6�6�6��'�'�'�G�#�" �" �" �" �9@�9P�" �" �" ��{��q� � ��q�!�!�!��{��a��1�$�$��~�.�.�.�,0��C�C�a�C� �+�a�d��d�"3�4�4�- �- � �E��(�#�A�&�� !�-��'�0�#�6�6� :�!$�w�'=�'C�'C�'E�'E�!E��(1�(8�(8��(H�(H�%��!�!�"7�8�8�8�8��!�!�'�"8�9�9�9��*��DT�U�U�U�� #�� Z��(�(�-7��-A�*�O�]��~�-�-� .� ��o�%�%�&��}�$�$��A� ��c�*�o�o�-�-�%�h�/��2�D��!7�!F�F�K�#��.�<�<��(�7�8��O��&�4��%��$��L� �q�q�q�$��)<�"<�<�=�G�+�d�.D�.S�S��!�'�*�*�G��{�'�'�� 6�6� ��!�!�.�0@��0M�0M��)�);�)<�)<�=�O��a�x�x�>�1�!%��W�!5�!5��#� 3��*�4�4�^�D�D��7>�q�z�!�}�4�� 4�)�!�B�$�/��%.�%8�%D�%D�^�%T�%T� �"�*2� �'��_�_��#*�#=�)-��w�~�~�� )�%��F��+�+�N�F�I�w�W�W� ��!��*�6�$�3�g�6Q�Q��.�:�#�g�&@�@�@�#(�K�� K�%�%�>��-��3��L�(�D��)�!�,�F� �M� �� %� � � � � @�H�T�N� @�t� @� @� @� @� @� @� @� G�T�$�Z� G�H�T�N� G� G� G� G��2�2�#��'�)�!1�� 3�� '��&�A :��(�(�%�&��"� ��0��/�*?�)��/�M�$+�,<�$=�$=�M�$�0�]�[�5P�5P�$�]�T�5K�%K�L�L��:�F� '� G�I�%7�$6�7G�$H�$H�M�$�0�5G�5G� �5V�5V�0�+�G�4�{�B��?�?�#0�5��t�?U�9U�3V�3V�#V�D�$�$9�L�#�C�(8�$9�$9�:�:�6�6��"2�2�"6��&�w�/�%�$�-�-�g�6�6�&�+=�+=� 0��a�� :�,�,�L� ,�7�1=�g�1F�q�1I�'�1R��1<�?O�1O�� '�� 0�<� ?�)� K�!H�#*�7�#3�i�#?�!H�#*�7�#3�k�#A�C�#G�+�!/�� ?�)� K�!J�#5�#5�l�#C�#C�!J�#2�W�U�^�#C�c�#I�*� .�&�-�&�',�$'��a��9I�$J�$J�&*�&<�%=�("�("��$4�g�e�n�#D�y�#P�#P�+9�D�8:� 0�� 5� %��'.�u�~�� '�(8� 9� 9� � �,�,9�)�+� � �� *�� '�'��/�/��7�#�w�u�~�5�5�T�Z�Z�\�\�5��!�!�&�)�)�)��q��&�!�'�*��!� +� +�&7�#)�#8�#�2�"�C�C�'�'�2B�C�C�C�C�!��&�6� 5��!D�D�D��5��K�%�%�\�#��;��&)��_�_�"��K�K��^�T�*�*�]�:��(�7�8� � � �I��Z��(�(�P � � ��r+c��|jjdkot|jj��dk}|jdkrtj|d��}t|��}|j�||��S)N�cudar�r$r)�to_cpu) rjrr�r�ndimr��expand_dims�get_ctranslate2_storager�)r(r�r�s r)r�zWhisperModel.encodeEsq��"�f�,�Q��T�Z�5L�1M�1M�PQ�1Q��=�A��~�h��2�2�H�*�8�4�4��z� � ��&� �9�9�9r+r�c��d}g}g}tt|j|jz��}|j�t|��|jz} n|j} | |jkrFtdt|��d| t|��z �d| �d|j�d|j�d��|jD�]�} | dkr d|j d| d �}n|j |jd �}|jj ||gf|j|j|j| dd|j|j|d� |��d}|jd} t| ��}|jd||jzz}||dzz}|�| ��}t/|��}||| |f}|�|��d }|j�E||jkr%d}|j�d| ||j��n|�|��|j�/||jkr$d}|j�d| ||j��|j�$|j|jkr|j� ||jkrd }|sn.��t?|p|d��}|d|d| |df}|S)Nr�r�r�r�r�r�rr�)rB�num_hypotheses� sampling_topkr�)rBrDT) rErFrGr�r�r�rPrQ�max_initial_timestamp_indexFzFCompression ratio threshold is not met with temperature %.1f (%f > %f)zDLog probability threshold is not met with temperature %.1f (%f < %f)c��|dS)Nr�r2)�xs r)�<lambda>z5WhisperModel.generate_with_fallback.<locals>.<lambda>�s ��1��r+��keyr) r?r rSr3rXr�r�r�rMrCrBrDrjr�rErFrGrPrQr�r�rwr�rxr�rJr�rXrHrIr;r�)r(r�r�rqr�� decode_result�all_results�below_cr_threshold_resultsr�r�r=�kwargsr�r8r�r�r9r7r:�needs_fallbacks r)r�z#WhisperModel.generate_with_fallbackPs�� %'�"�&)��'�/�$�2E�E�F�F�' �' �#��!�-��V��w�'=�=�J�J��J��'�'��Q�s�6�{�{�Q�Q��F��+�Q�Q�,6�Q�Q�7;�o�Q�Q�?C�o� Q�Q�Q�� #�/�` �` �K��Q��!"�&-�o�%&�,7� ��")�!2� '� 0�� )�T�Z�(�� '�5�#*�#=�%,�%A�%�"�&*�&�5� '� 7�,G� � �� F��)�!�,�F��&�k�k�G� �-��*�g�w�7M�.M�N�K�%��1��5�K��#�#�F�+�+�1�1�3�3�D� 5�d� ;� ;��!� �M� ��}�-�-�-�"�N��2�>�$�w�'J�J�J�%)�N��K�%�%�`�#�)��;� ��/�5�5�m�D�D�D��*�6��'�"<�<�<�!%��!�!�Z��.� ��+�7��)�G�,G�G�G��.�:��'�"<�<�<�!&��!� �� *�9�k�~�~��M� �a� ��a� ��a� � �M��r+r�c��g}|s|r�|s�|�|j��|rq|so|�d|��z��}t |��|jdzkr|d|jdzdz �}|�|��|r)|�||jdzdz d��|�|j��|r|�|j��|r�|�d|��z��}t |��|jdzkr|d|jdzdz �}|s|�|j ��|�|��|S)Nrur$r�) r��sot_prevr�r�r�r�r��sot_sequence� no_timestampsrc) r(rqr�rRrOr[r��hotwords_tokens� prefix_tokenss r)r�zWhisperModel.get_prompt�s�� N�x� N�� N��M�M�)�,�-�-�-�� /�� /�"+�"2�"2�3��9I�9I�3I�"J�"J��'�'�4�?�a�+?�?�?�&5�6P��1�8L�q�8P�6P�&Q�O�� o�.�.�.�� N�� o��1�0D�q�0H�.I�.K�.K�L�M�M�M�� i�,�-�-�-�� 3��M�M�)�1�2�2�2�� )�%�,�,�S�6�<�<�>�>�-A�B�B�M��=�!�!�T�_��%9�9�9� -�.H��1�0D�q�0H�.H� I� �%� 9�� i�7�8�8�8��M�M�-�(�(�(�� r+r�� num_framesrkc ��t|��dkrdSg}g} |D]d} �fd�| D��}|�ttj�|��| �|��e|��|||��}g} |D�]\}tjd�|D��}||� ��}t|��dkrtj |��nd}tdt|��}|dz}t|��dkr�d}tdt|��D]{}||d ||d z |krX||d|vr||d |z||d <�N||dz d|vr||d |z ||d <�|t|||��| �||f��^t|��D�]x\}} d}| dd|jz}| |\}}t| ��D�]>\}}d}g}|t||��kr�|t| ||��kr�|||}|dr`|�t#|dt%||d zd��t%||d zd��|d ��|t|d��z }|dz }|t||��kr|t| ||��k��t|��dk�r|dd |z |dzkr�|dd |dd z |ks7t|��dkr�|dd |dd z |dzkr�t|��dkre|dd |dd z |krDt'|dd dz|dd |z ��}|x|dd <|dd <t'd|dd |z ��|dd <|d |dd krX|d dz |dd kr=t'dt|dd |z |d ��|dd <n|dd |d <|d |dd krJ|d dz|dd kr/t'|dd |z|d ��|dd <n|dd |d <|d }||||d<��@��z|S)Nrc�8��g|]}�fd�|dD��S)c�*��g|]}|�jk� |��Sr2)�eotres �r)r~z?WhisperModel.add_word_timestamps.<locals>.<listcomp>.<listcomp>s%��R�R�R�5�E�I�M�<Q�<Q��<Q�<Q�<Qr+r8r2)rzr{rqs �r)r~z4WhisperModel.add_word_timestamps.<locals>.<listcomp>sC��S�R�R�R�J�x�$8�R�R�R��r+c�0�g|]}|d|dz ��S)r rr2rs r)r~z4WhisperModel.add_word_timestamps.<locals>.<listcomp>s%��C�C�C��e��t�G�}�,�C�C�Cr+rigffffff�?r$u.。!！?？r�r rr!r6r")r!rr r"r8�r�r�r<)r�r�r�� itertools�chain� from_iterable�find_alignmentr��array�nonzero�medianr�r/r��merge_punctuationsr�ryrvr r�)r(r�rqr�r�rUrVrk�text_tokens�text_tokens_per_segmentr��segment_tokens� alignments�median_max_durations� alignment�word_durations�median_duration�max_duration�sentence_end_marksr��segment_idx� word_indexrr�subsegment_idxr{�saved_tokensr<�timing�boundarys ` r)r�z WhisperModel.add_word_timestamps�so��x�=�=�A��F��"$�� ;� ;�G��")��N� ��t�I�O�$A�$A�.�$Q�$Q�R�R�S�S�S�#�*�*�>�:�:�:�:��(�(��{�N�J� � � � "��#� I� I�I��X�C�C��C�C�C��N�,�N�,B�,B�,D�,D�E�N�-0��-@�-@�1�-D�-D�� .�)�)�)�#� �"�#�u�_�'=�'=�>�>�O�*�Q�.�L��>�"�"�Q�&�&�%3�"��q�#�i�.�.�1�1�W�W�A� ��|�E�*�Y�q�\�'�-B�B�\�Q�Q�$�Q�<��/�3E�E�E�2;�A�,�w�2G�,�2V�I�a�L��/�/�&�q�1�u�-�f�5�9K�K�K�4=�a�L��4G�,�4V�I�a�L��1��y�*>�@S�T�T�T� �'�'��,�(G�H�H�H�H�$-�h�$7�$7�J G�J G� �K��J�!�!�*�V�,�t�/E�E�K�,@��,M�)�O�\�.7��.@�.@�F G�F G�*�� 3�z�+�'>�#?�#?�?�?�L�SV�+�K�8��H�T�T�E�E�(��4�Z�@�F��f�~�� %+�F�^�&+�K�&��/�,I�1�&M�&M�$)�+��u� �*E�q�$I�$I�,2�=�,A� ��!�C��x�(8�$9�$9�9�L��!�O�J�!!�3�z�+�'>�#?�#?�?�?�L�SV�+�K�8��H�T�T�E�E�(�u�:�:��>�>��Q�x��-�.�0?�!�0C�D�D��a��%��(�7�*;�;�l�J�J��J�J��N�N� %�a��%��(�7�2C� C�l�UV�FV� V� V� ��J�J��N�N� %�a��%��(�7�2C� C�l� R� R�'*� %�a��!� 3�U�1�X�e�_�|�5S�(�(�H�CK�J�E�!�H�U�O�e�A�h�w�.?�,/��5��8�E�?�\�3Q�,R�,R��a��)�#�7�+�e�A�h�u�o�=�=�&�w�/�#�5��a��8I�I�I�,/��a��/� A�:�g�CV�W�W�-�-��a��)�)� /4�A�h�w�.?� �7�+�#�5�)�E�"�I�g�,>�>�>�&�u�-��3�e�B�i��6F�F�F�+.�!�"�I�g�.��@�*�U�BS�,�,��b� �%�(�(�-2�"�I�e�,<� �5�)�,6�u�,=�)�AF��%�n�5�g�>�>�MF G�N%�$r+�r��median_filter_widthc�4��t|��dkrgS|j�||j|||��}g}t ||��D�]�\}} |j�|j} tjd�| D��}tjd�| D��}|� | |j gz��\} }t|��dkr|�g��tjtj d�|dd�D��d��}t|��dkr|�g��tjtj|��dd� ��t ��}|||jz}||dd�}||dd�}�fd �t |dd�|dd��D��}|�d�t | ||||��D��|S)Nr)r�c��g|] }|d��Sr�r2�rz�pairs r)r~z/WhisperModel.find_alignment.<locals>.<listcomp>��$D�$D�$D��T�!�W�$D�$D�$Dr+c��g|] }|d��S)r�r2r�s r)r~z/WhisperModel.find_alignment.<locals>.<listcomp>�r�r+r�c�,�g|]}t|��Sr2�r��rz�ts r)r~z/WhisperModel.find_alignment.<locals>.<listcomp>�s��<�<�<�a�3�q�6�6�<�<�<r+r�)r�r)�constant_valuesc�N��g|]!\}}tj�||��"Sr2)r��mean)rzr��j�text_token_probss �r)r~z/WhisperModel.find_alignment.<locals>.<listcomp>�sA��"�"�"��A�q��(��1��-�.�.�"�"�"r+c �B�g|]\}}}}}t|||||��S))r!r8rr r")rv)rzr!r8rr r"s r)r~z/WhisperModel.find_alignment.<locals>.<listcomp>�sR��>��f�e�S�+��!�%�#��$/��r+)r�rj�alignr�r�r�r�r�r��split_to_word_tokensr�r��pad�cumsum�diff�astyper\r2)r(rqr�r�r�r�r��return_listr�� text_tokenr��text_indices�time_indicesr<�word_tokens�word_boundaries�jumps� jump_times�start_times� end_times�word_probabilitiesr�s @r)r�zWhisperModel.find_alignmentxs��{��q� � ��I��*�"�"��"�� 3�#� � ��"%�g�{�";�";�0 �0 ��F�J�%�6��*�J��8�$D�$D��$D�$D�$D�E�E�L��8�$D�$D��$D�$D�$D�E�E�L�!*�!?�!?��i�m�_�,�"�"��E�;��;��1�$�$��"�"�2�&�&�&�� f�� <�<�;�s��s�+;�<�<�<�=�=�v��O��?�#�#�q�(�(��"�"�2�&�&�&��F�2�7�<�0�0�&�!�L�L�L�S�S��E�&�e�,�t�/E�E�J�$�_�S�b�S�%9�:�K�"�?�1�2�2�#6�7�I�"�"�"�"�� 4�o�a�b�b�6I�J�J�"�"�"�� BE��{�K��DV�B�B�� r+c ��|�|� Jd��|�g|r9t||��}t||��\}} tj|d��}|d||jjz�}|�|��}|dd||jjz�f}i�td|jd|jj��D]�} |� t|d| | |jjz�f��}|j�|��d}d�|D��} | d\}}||krnS�� |g��|��t��fd�� }t�|��}||| fS) a� Use Whisper to detect the language of the input audio or features. Arguments: audio: Input audio signal, must be a 1D float array sampled at 16khz. features: Input Mel spectrogram features, must be a float array with shape (n_mels, n_frames), if `audio` is provided, the features will be ignored. Either `audio` or `features` must be provided. vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio without speech. This step is using the Silero VAD model. vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available parameters and default values in the class `VadOptions`). language_detection_threshold: If the maximum probability of the language tokens is higher than this value, the language is detected. language_detection_segments: Number of segments to consider for the language detection. Returns: language: Detected language. languege_probability: Probability of the detected language. all_language_probs: List of tuples with all language names and probabilities. Nz.Either `audio` or `features` must be provided.rr�.r�c�,�g|]\}}|dd�|f��S)r$r`r2)rzrf�probs r)r~z0WhisperModel.detect_language.<locals>.<listcomp>�s)��!S�!S�!S�-�5�$�5��2��;��"5�!S�!S�!Sr+c�.��t�|��Sr�r�)�lang�detected_language_infos �r)r�z.WhisperModel.detect_language.<locals>.<lambda>s��%;�D�%A�!B�!B�r+r�)rrr�r�r�� n_samplesr�r�r�r�rrjr�� setdefaultr�r�)r(r�r�r�r�r�r�r\r�r�r�r�r�rcr_r`rs @r)r�zWhisperModel.detect_language�s��> ��!5�!5�;�"6�!5�5�� =� 5�e�^� L� L� �0>�u�m�0T�0T�-��o��|�!�<�<�<��P�-��0F�0P�P�P��E��-�-�e�4�4�H��U�.��1G�1U�U�U�U� ��"$��q�(�.��,�d�.D�.R�S�S� I� I�A�!�[�[��H�S�!�a�$�2H�2V�.V�*V�%V�W�X�X��N��j�0�0��@�@��C�G�"T�!S�7�!S�!S�!S��-?��-B�*�H�*�#�&B�B�B��"�-�-�h��;�;�B�B�CW�X�X�X�X��&�B�B�B�B��H�$'�'=�h�'G�#H�#H� ��-�/A�A�Ar+)rrrrr�NFNr�)FNN)r�)NNFNr�r�)&r,r-r.r1rr?r r r\rvrl�propertyr:r-rr�r�r/rrrr4r^r�rr�rAr$�StorageViewrZr�r%�WhisperGenerationResultr�r�r�r�r�r2r+r)rrKsb��./�%��'+�!&��\�\��\��\��C��c��N�+� \� �\�� \��\� ��}�\��\��\�\�\�\�|�O�T�#�Y�O�O�O��X�O��$��*#'� �"�� !�$%�$%�E �E �E �8;�.2�/2�+/�-0�>B� $�#�02�t�#(�'*� %�$2�#E�"� �<@�(,�&*�36�;?�"&�8;�+,�WS�S��S�(�B�J�.�/�S��3�-�S�� S� �S�� S��S��S��S�"�S�"�S��5�$�u�+�u�U�C�Z�/@�@�A�S�(&.�e�_�)S�*%�U�O�+S�,&�e�_�-S�.%)�/S�0&+�1S�2!��s�H�S�M�'9�!:�;�3S�4�� 5S�6�7S�8"�$�s�)�,�9S�:!�;S�< %�=S�>�?S�@"�AS�B!�CS�D�ES�F�GS�H!��t�Z�'7�!8�9�IS�J!�� KS�L�s�m�MS�N�s�D��K�/�0�OS�P*2�%��QS�R�3�-�SS�T'/�u�o�US�V&)�WS�X �x�� "3�3� 4�YS�S�S�S�jM?��M?��S� �M?�� M?� �M?� � M?��M?� �d�3�i��M?�M?�M?�M?�j=A� ^�^��*�^��^�&� ^�!��!8�9� ^� �'� �^�^�^�^�@ :�r�z� :�k�.E� :� :� :� :�@�#�/�@��S� �@�� @� &�@� �{�!�9�5�%��N� O� @�@�@�@�L$)� $�"&� !�!��!��c��!�!� !� �� !��3�-� !� �c��!�!�!�!�FA%��t�*�A%��A%�$�/� A%� �A%�"� A%�!�A%� %�A%� �A%�A%�A%�A%�R$%� D�D��D��#�Y�D�$�/� D� �D�!� D� �d��D�D�D�D�P'+�)-� �26�+,�.1�IB�IB�� #�IB��2�:�&�IB�� IB� �d�J�.�/�IB�&)� IB�',�IB� �s�E�4��c�5�j� 1�2�2� 3�IB�IB�IB�IB�IB�IBr+rr�r\r�r�c#�2K�t||��}|D�]}|jr�g}|jD]~}|j|jzdz}|�|��}|�|j|��|_|�|j|��|_|�|��|dj|_|dj|_||_n>|�|j��|_|�|j��|_|V��dS)Nr$rr�)rr<rr �get_chunk_index�get_original_timer�) r�r\r��ts_mapr�r<r!�middle�chunk_indexs r)r[r[ s�� !�� >� >�F��=� @��E�� #� #��*�t�x�/�1�4��$�4�4�V�<�<��#�5�5�d�j�+�N�N�� !�3�3�D�H�k�J�J��T�"�"�"�"�!�!�H�N�G�M��)�-�G�K�!�G�M�M�#�4�4�W�]�C�C�G�M� �2�2�7�;�?�?�G�K�� '�r+r�c�l�tj|��}tj�|��}|Sr�)r��ascontiguousarrayr$r� from_array)r�s r)r�r�'s-��"�7�+�+�G��%�0�0��9�9�G��Nr+r7c��|�d��}t|��ttj|��zS)Nr=)r�r��zlib�compress)r7� text_bytess r)rxrx-s6��W�%�%�J��z�?�?�S��z�!:�!:�;�;�;�;r+rqrQc�~�d|vr'd�|D��}|�|j��n7|�t|��dkrg}nt|t��s Jd��|�|j|j|j|j|j g��ttt|��S)Nr�c��g|] }|dk�|��Sr�r2r�s r)r~z)get_suppressed_tokens.<locals>.<listcomp>7s��@�@�@��a��1��r+rzsuppress_tokens must be a list) r��non_speech_tokensr�r�r�r�� translate�sotr��sot_lmr��sorted�set)rqrQs r)r�r�2s�� _��@�@�o�@�@�@��y�:�;�;�;�;� � �C��$8�$8�A�$=�$=��/�4�0�0�R�R�2R�R�R�0�� M�� O�,�,�-�-�.�.�.r+r�� prepended�appendedc��t|��dz }t|��dz }|dkr�||}||}|d�d��rO|d��|vr3|d|dz|d<|d|dz|d<d|d<g|d<n|}|dz}|dk��d}d}|t|��kr�||}||}|d�d��s=|d|vr3|d|dz|d<|d|dz|d<d|d<g|d<n|}|dz }|t|��k��dSdS)Nr$r�rr!rur8r )r�� startswithr��endswith)r�r&r'r�r��previous� followings r)r�r�Ks��I��A��I��A� �q�&�&��Q�<��a�L� ��F��&�&�s�+�+� ��0@�0F�0F�0H�0H�I�0U�0U� (�� 0�9�V�3D� D�I�f��"*�8�"4�y��7J�"J�I�h��!�H�V��!#�H�X��A� �Q��q�&�&� �A� �A� �c�)�n�n� � ��Q�<��a�L� ��(�(��-�-� �)�F�2C�x�2O�2O�'��/�)�F�2C�C�H�V��!)�(�!3�i��6I�!I�H�X�� "�I�f��"$�I�h��A� �Q��c�)�n�n� � � � � � r+)>r�rDrVr!r�dataclassesrr�inspectr�mathr�typingrrr r rr�warningsr r$�numpyr�r(r�faster_whisper.audiorr� faster_whisper.feature_extractorr�faster_whisper.tokenizerrr�faster_whisper.utilsrrrr�faster_whisper.vadrrrrrrr4rAr^rgrrvr?r[r�rr�r1r/rxr�r�r2r+r)�<module>r8s�� )�)�)�)�)�)�)�)��C�C�C�C�C�C�C�C�C�C�C�C�C�C�C�C��:�:�:�:�:�:�:�:�=�=�=�=�=�=�?�?�?�?�?�?�?�?�V�V�V�V�V�V�V�V�V�V�V�V��,��:��X)�X)�X)�X)�X)�X)�X)�X)�v|B�|B�|B�|B�|B�|B�|B�|B�~%��w��:��g�� :�R�Z��K�4K��<��<��<�<�<�<� /��/��3�Z�/��d�3�i��/�/�/�/�2�$�t�*��PT��r+