� ���g ���V�ddlZddlZddlZddlmZddlmZmZm Z m Z ddl Z ddl Z ddlmZddlmZddlmZer*ddlmZ dd lmZn #e$rYnwxYw ddlZn #e$rYnwxYwej�d ��duZej�d ��duZeje��Z Gd �d e!��Z"Gd�de��Z#Gd�de��Z$Gd�de��Z%Gd�de��Z&Gd�d��Z'Gd�de'��Z(Gd�de'��Z)Gd�d��Z*dS)�N)�PurePath)� TYPE_CHECKING� NamedTuple�Optional�Union�)�Sequence)�logging)�tqdm)�Dataset�� Elasticsearch� elasticsearch�faissc��eZdZdS)� MissingIndexN)�__name__� __module__� __qualname__���_/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/datasets/search.pyrr$s�������Drrc�<�eZdZUeeed<eeed<dS)� SearchResults�scores�indicesN�rrr�list�float�__annotations__�intrrrrr(s2������� ��K���� �#�Y�����rrc�T�eZdZUeeeed<eeeed<dS)�BatchedSearchResults� total_scores� total_indicesNrrrrr#r#-s;��������t�E�{�#�#�#�#���S� �?�"�"�"�"�"rr#c�0�eZdZUeeed<eed<dS)�NearestExamplesResultsr�examplesN�rrrrrr �dictrrrr'r'2s)������� ��K�����N�N�N�N�Nrr'c�H�eZdZUeeeed<eeed<dS)�BatchedNearestExamplesResultsr$�total_examplesNr)rrrr,r,7s7��������t�E�{�#�#�#�#���J�����rr,c��eZdZdZd dedefd�Zd dedefd�Zde e e ffd�Z e de e e fddfd ���Zd S) � BaseIndexzBase class for indexing� �k�returnc ��t�)z� To implement. This method has to return the scores and the indices of the retrieved examples given a certain query. ��NotImplementedError)�self�queryr1�kwargss r�searchzBaseIndex.search?s �� "�!rc ��gg}}|D]E}|�||��\}}|�|��|�|���Ft||��S)a Find the nearest examples indices to the query. Args: queries (`Union[List[str], np.ndarray]`): The queries as a list of strings if `column` is a text index or as a numpy array if `column` is a vector index. k (`int`): The number of examples to retrieve per query. Ouput: total_scores (`List[List[float]`): The retrieval scores of the retrieved examples per query. total_indices (`List[List[int]]`): The indices of the retrieved examples per query. )r9�appendr#) r6�queriesr1r8r$r%r7rrs r� search_batchzBaseIndex.search_batchFsn��')�"�m� �� *� *�E�"�k�k�%��3�3�O�F�G� � � �� '� '� '� � � �� )� )� )� )�#�L�-�@�@�@r�filec��t�)zSerialize the index on diskr4)r6r>s r�savezBaseIndex.saveXs��!�!rc��t�)zDeserialize the index from diskr4)�clsr>s r�loadzBaseIndex.load\s ��"�!rN�r0)rrr�__doc__r!rr9r#r=r�strrr@� classmethodrCrrrr/r/<s�������!�!�"�"�s�"�m�"�"�"�"�A�A�s�A�>R�A�A�A�A�$"��s�H�}�-�"�"�"�"��"��c�8�m�,�"��"�"�"��[�"�"�"rr/c ���eZdZdZ ddeedeededdeedeef d �Zdd e e ed fd eefd �Z ddede fd�Z ddedefd�ZdS)�ElasticSearchIndexa/ Sparse index using Elasticsearch. It is used to index text and run queries based on BM25 similarity. An Elasticsearch server needs to be accessible, and a python client is declared with ``` es_client = Elasticsearch([{'host': 'localhost', 'port': '9200'}]) ``` for example. N�host�port� es_clientr� es_index_name�es_index_configc��tstd���|�|�|�td���|pd}|pd}ddl}ddlm}|�|n||t |��d�g��|_|�|n7dtj � tj ��j ��z|_|�|nd d d d d d�iid�dddd dd�iid�|_dS)Nz}You must install ElasticSearch to use ElasticSearchIndex. To do so you can run `pip install elasticsearch==7.7.1 for example`zBPlease specify either `es_client` or `(host, port)`, but not both.� localhosti�#rr )rJrK�huggingface_datasets_r�analyzer� stop_standard�standard� _english_)�typez stopwords)�number_of_shards�analysis� properties�text�BM25)rVrR� similarity)�settings�mappings)�_has_elasticsearch� ImportError� ValueError�elasticsearch.helpersrrrFrL�os�path�basename�tempfile�NamedTemporaryFile�namerMrN)r6rJrKrLrMrNrrs r�__init__zElasticSearchIndex.__init__lsH��"� ��P��� � � �d�&6�$�:J��a�b�b� b��"�{���|�t��$�$�$�$�/�/�/�/�/�/�&/�&;�����Y]�gj�ko�gp�gp�Pq�Pq�Or�As�As����(� �M�(�2�7�+;�+;�H�<W�<Y�<Y�<^�+_�+_�_� ���*� �O�)*�!+�o� �bm�?n�?n�-o� p���*�F�V�Q[�kq�4r�4r�+s�t� �� ���r� documentsr �columnc�2���|j}|j}|jj�||���t ���}t d|���}d}��fd�}ddl} | j� |j||�����D]\} } |� d��|| z }� |t ���kr=t� d t ���|z �d t �������t� d |d �d ���dS)z� Add documents to the index. If the documents are inside a certain column, you can specify it using the `column` argument. ��index�body�docs)�unit�totalrc3��K���$t���D]\}}|�|d�V��dSt���D] \}}||d�V�� dS)N)rZ�_id)� enumerate)�i�examplerkrjs ��r�passage_generatorz;ElasticSearchIndex.add_documents.<locals>.passage_generator�s�������!�"+�I�"6�"6�>�>�J�A�w�#*�6�?�1�=�=�=�=�=�=�>�>�#,�I�"6�"6�6�6�J�A�w�#*�1�5�5�5�5�5�5�6�6rN)�clientrn�actionsrz>Some documents failed to be added to ElasticSearch. Failures: �/zIndexed �dz documents)rMrNrLr�create�len�hf_tqdmr�helpers�streaming_bulk�update�logger�warning�info) r6rjrk� index_name� index_config�number_of_docs�progress� successesrx�es�ok�actions `` r� add_documentsz ElasticSearchIndex.add_documents�s\���� �'� ��+� � ���%�%�J�\�%�J�J�J��Y������n�=�=�=��� � 6� 6� 6� 6� 6� 6� #�"�"�"��*�3�3��>��%�%�'�'�4� � � � �J�B�� �O�O�A� � � � ��O�I�I� ��I��� &� &� �N�N�~�QT�U^�Q_�Q_�bk�Qk�~�~�nq�r{�n|�n|�~�~� � � � � � �6�y�6�6�6�6�7�7�7�7�7rr0r7r2c ��|jjd |jd|dgdd�i|d�d�|��}|dd}td�|D��d �|D����S) amFind the nearest examples indices to the query. Args: query (`str`): The query as a string. k (`int`): The number of examples to retrieve. Ouput: scores (`List[List[float]`): The retrieval scores of the retrieved examples. indices (`List[List[int]]`): The indices of the retrieved examples. � multi_matchrZ� cross_fields)r7�fieldsrV)r7�sizerm�hitsc��g|] }|d�� S)�_scorer��.0�hits r� <listcomp>z-ElasticSearchIndex.search.<locals>.<listcomp>�s��<�<�<��c�(�m�<�<�<rc�8�g|]}t|d����S)rt)r!r�s rr�z-ElasticSearchIndex.search.<locals>.<listcomp>�s#��>_�>_�>_�SV�s�3�u�:���>_�>_�>_rr)rLr9rMr)r6r7r1r8�responser�s rr9zElasticSearchIndex.search�s���)�4�>�(� ��$�)�U�v�h�Xf�+g�+g�h�rs�t�t� � �� � �� ����'���<�<�t�<�<�<�>_�>_�Z^�>_�>_�>_�`�`�`rr1c ����� �ddl}dgt|��zdgt|��z}}|j�|���5� � ���fd�t |��D��}|j�|��D]2} || } | ���} | j|| <| j|| <�3 ddd��n #1swxYwYt||���S)Nr)� max_workersc�B��i|]\}}�j�j|�fi���|��Sr)�submitr9)r�rvr7�executorr1r8r6s ����r� <dictcomp>z3ElasticSearchIndex.search_batch.<locals>.<dictcomp>�s=���v�v�v�W_�WX�Z_��x��t�{�E�1�O�O��O�O�QR�v�v�vr)r%r$) �concurrent.futuresr~�futures�ThreadPoolExecutorru� as_completed�resultrrr#) r6r<r1r�r8� concurrentr$r%�future_to_index�futurern�resultsr�s ` ` ` @rr=zElasticSearchIndex.search_batch�s9������!�!�!�!�'+�f�s�7�|�|�&;�d�V�c�'�l�l�=R�m� � � � 2� 2�{� 2� K� K� 7�x�v�v�v�v�v�v�v�cl�mt�cu�cu�v�v�v�O�$�,�9�9�/�J�J� 7� 7��'��/��)/������&-�n� �U�#�'.�� �e�$�$�  7� 7� 7� 7� 7� 7� 7� 7� 7� 7� 7� 7���� 7� 7� 7� 7�$�-�l�[�[�[�[s� A,C�C� C)NNNNN�NrD)r0r0)rrrrErrFr!r*rirrr�rr9r#r=rrrrIrIbs(��������#�"�/3�'+�*.� $ �$ ��s�m�$ ��s�m�$ ��O�,� $ �  ��}� $ � "�$�� $ �$ �$ �$ �L"8�"8�u�T�#�Y� �-A�'B�"8�H�UX�M�"8�"8�"8�"8�Ha�a�C�a�M�a�a�a�a�& \� \�s� \�Nb� \� \� \� \� \� \rrIc � �eZdZdZ ddeeeeefdeedeededfd�Z dd ee j d fd eed edeedee f d�Z eddddeeeeefddfd���Zd de j defd�Zd de j defd�Zddeeefdeefd�Ze d!deeefdeeeeefdeeddfd���ZdS)"� FaissIndexa Dense index using Faiss. It is used to index vectors. Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. You can find more information about Faiss here: - For index types and the string factory: https://github.com/facebookresearch/faiss/wiki/The-index-factory - For GPU settings: https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU N�device�string_factory� metric_type� custom_index� faiss.Indexc��|�|�td���|�|�td���||_||_||_||_t st d���dS)a$ Create a Dense index using Faiss. You can specify `device` if you want to run it on GPU (`device` must be the GPU index). You can find more information about Faiss here: - For `string factory`: https://github.com/facebookresearch/faiss/wiki/The-index-factory NzFPlease specify either `string_factory` or `custom_index` but not both.zsCannot pass both 'custom_index' and 'device'. Pass 'custom_index' already transferred to the target device instead.a{You must install Faiss to use FaissIndex. To do so you can run `conda install -c pytorch faiss-cpu` or `conda install -c pytorch faiss-gpu`. A community supported package is also available on pypi: `pip install faiss-cpu` or `pip install faiss-gpu`. Note that pip may not have the latest version of FAISS, and thus, some of the latest features and bug fixes may not be available.)rar�r�r�� faiss_index� _has_faissr`)r6r�r�r�r�s rrizFaissIndex.__init__�s��� � %�,�*B��e�f�f� f� � �,�":��X��� ��� �,���&���'���� ��T��� � � r���vectorsr rk� batch_size� train_size� faiss_verbosec���ddl}|r@t|j|t��s t d|�d|j|�����|j��|�t |d��nt |d|��}|j�;|j�|j ||j��}nK|j ||j|j��}n.|j�|j |��}n|j ||j��}|� ||j ��|_t�dt|j������|��||j_t#|jd��r|jj�||jj_t#|jd��r|jj�||jj_t#|jd��r|jj�||jj_|�b|� |d|�n|d|�|} t�d t | ���d ���|j�| ��nt�d ��t�d t |���d ���t-t/dt |��|����D]>} |� || | |z�n|| | |z�|} |j�| ���?dS)z� Add vectors to the index. If the arrays are inside a certain column, you can specify it using the `column` argument. rNzWrong feature type for column 'z'. Expected 1d array, got zCreated faiss index of type rn� quantizer�clustering_indexz"Training the index with the first z vectorszEIgnored the training step of the faiss index as `train_size` is None.zAdding z vectors to the faiss index)r� isinstance�featuresr rar�r~r�r�� index_factory� IndexFlat�_faiss_index_to_devicer�r�r�rV�verbose�hasattrrnr�r��trainr�range�add) r6r�rkr�r�r�rr�rn� train_vecsrv�vecss r� add_vectorszFaissIndex.add_vectors�s�� � � � � � �*�W�%5�f�%=�x�H�H� ��n�&�n�n�T[�Td�ek�Tl�n�n��� � � � #�&,�n�3�w�q�z�?�?�?�#�g�a�j��>P�:Q�:Q�D��"�.��#�+�/�E�/��d�6I�J�J�E�E�/�E�/��d�6I�4�K[�\�\�E�E��#�+�+�E�O�D�1�1�E�E�+�E�O�D�$�2B�C�C�E�#�:�:�5�$�+�N�N�D� � �K�K�O�t�D�<L�7M�7M�O�O� P� P� P� � $�'4�D� � $��t�'��1�1� ?�d�6F�6L�6X�1>�� �&�.��t�'��5�5� C�$�:J�:T�:`�5B�� �*�2��t�'�);�<�<� J��AQ�Ab�An�<I�� �1�9� � !�17����*��-�-�W�[�j�[�EY�Z`�Ea�J� �K�K�V�S��_�_�V�V�V� W� W� W� � � "� "�:� .� .� .� .� �K�K�_� `� `� `� � � �G�c�'�l�l�G�G�G�H�H�H���q�#�g�,�,� �;�;�<�<� '� '�A�28�.�7�1�q�:�~�-�.�.�g�a�RS�V`�R`�N`�Fa�bh�Fi�D� � � � �� &� &� &� &� '� 'rrnr2c�p�|�|Sddl}t|t��r9|dkr"|j��}|j|||��}np|j|��}n_t|t tf��r |j|t |�����}n#tdt|���d�dz���|S)z� Sends a faiss index to a device. A device can either be a positive integer (GPU id), a negative integer (all GPUs), or a list of positive integers (select GPUs to use), or `None` for CPU. Nr�����)�gpuszThe argument type: z is not expected. zZPlease pass in either nothing, a positive int, a negative int, or a list of positive ints.) rr�r!�StandardGpuResources�index_cpu_to_gpu�index_cpu_to_all_gpusr�tuple�index_cpu_to_gpus_list� TypeErrorrV)rnr�r� faiss_ress rr�z!FaissIndex._faiss_index_to_device;s��� �>��L�� � � � �f�c� "� "� ���{�{�6�E�6�8�8� �.��.�y�&�%�H�H���4��3�E�:�:��� ���u� � .� .� �0�E�0��T�&�\�\�J�J�J�E�E��F�d�6�l�l�F�F�F�n�o��� � � rr0r7c ��t|j��dkr8t|j��dks|jddkrtd���|�dd��}|jjst j|d���}|jj ||fi|��\}}t|d|d� t����S)awFind the nearest examples indices to the query. Args: query (`np.array`): The query as a numpy array. k (`int`): The number of examples to retrieve. Ouput: scores (`List[List[float]`): The retrieval scores of the retrieved examples. indices (`List[List[int]]`): The indices of the retrieved examples. r�rzHShape of query is incorrect, it has to be either a 1D array or 2D (1, N)r��C��order) r~�shapera�reshape�flags� c_contiguous�np�asarrayr�r9r�astyper!)r6r7r1r8r<rrs rr9zFaissIndex.search]s��� �u�{� � �q� � �c�%�+�&6�&6�!�&;�&;�u�{�1�~�QR�?R�?R��g�h�h� h��-�-��2�&�&���}�)� 5��j���4�4�4�G�1�$�*�1�'�1�G�G��G�G�����V�A�Y��� �(9�(9�#�(>�(>�?�?�?rr<c ��t|j��dkrtd���|jjst j|d���}|jj||fi|��\}}t||� t����S)a�Find the nearest examples indices to the queries. Args: queries (`np.array`): The queries as a numpy array. k (`int`): The number of examples to retrieve. Ouput: total_scores (`List[List[float]`): The retrieval scores of the retrieved examples per query. total_indices (`List[List[int]]`): The indices of the retrieved examples per query. r�zShape of query must be 2Dr�r�) r~r�rar�r�r�r�r�r9r#r�r!)r6r<r1r8rrs rr=zFaissIndex.search_batchqs��� �w�}� � �� "� "��8�9�9� 9��}�)� 5��j���4�4�4�G�1�$�*�1�'�1�G�G��G�G����#�F�G�N�N�3�,?�,?�@�@�@rr>�storage_optionsc �~�ddl}|j�=t|jttt f��r|j|j��}n|j}tj t|��dfi|pi��5}|j ||j |j |j������ddd��dS#1swxYwYdS)z Serialize the FaissIndex on diskrN�wb)rr�r�r!rr��index_gpu_to_cpur��fsspec�openrF� write_index�BufferedIOWriter�PyCallbackIOWriter�write)r6r>r�rrn�fs rr@zFaissIndex.save�s��� � � � �;� "�z�$�+��T�5�?Q�'R�'R� "�*�E�*�4�+;�<�<�E�E��$�E� �[��T���D� D� D�_�-B�� D� D� `�� �E� �e�%;�U�%;�<T�E�<T�UV�U\�<]�<]�%^�%^� _� _� _� `� `� `� `� `� `� `� `� `� `� `� `���� `� `� `� `� `� `s�23B2�2B6�9B6c�<�ddl}||���}tjt|��dfi|pi��5}|j|j|j|j������}ddd��n #1swxYwY|�||j ��|_ |S)z$Deserialize the FaissIndex from diskrN)r��rb) rr�r�rF� read_index�BufferedIOReader�PyCallbackIOReader�readr�r�r�)rBr>r�r�rr�r�rns rrCzFaissIndex.load�s��� � � � ��c��(�(�(� � �[��T���D� D� D�_�-B�� D� D� _��$�E�$�%;�U�%;�<T�E�<T�UV�U[�<\�<\�%]�%]�^�^�E� _� _� _� _� _� _� _� _� _� _� _���� _� _� _� _�"-�"D�"D�U�K�L^�"_�"_� ���s�2A1�1A5�8A5�NNNN)Nr�NNr�rD�NN)rrrrErrr!rrFrir��array�boolr�� staticmethodr�rr9r#r=rr*r@rGrCrrrr�r��sZ��������37�(,�%)�04� ����s�D��I�~�.�/��!�� ���c�]� � �}�-� ����B!%��$(�(,� :'�:'��r�x��*�+�:'��� �:'�� :'� �S�M� :'�  ��~� :'�:'�:'�:'�x���m��X�e�C�QU�VY�QZ�N�F[�=\��hu�����\��B@�@�B�H�@��@�@�@�@�(A�A�B�H�A�AU�A�A�A�A�$ `� `��s�H�}�-� `���� `� `� `� `��37�*.� ���C��M�"����s�D��I�~�.�/��"�$�� � � ����[���rr�c� �eZdZdZd�Zd�Zd�Zdedefd�Z defd�Z de efd �Z dede fd �Z d/dedeedeeee efdeedeededdedeedefd�Z d0dejdedeeee efdeedeededdedeedefd�Zd1dedeeefdeefd�Z d2dedeeefdeeee efdeefd�Z d3dedeedeedeed ed!d"eed#eefd$�Z d4ded"edeedeed ed!d#eef d%�Zdefd&�Zd5ded(eeejfd)edefd*�Z d5ded+ee eejfd)edefd,�Z d5ded(eeejfd)ede!fd-�Z" d5ded+ee eejfd)ede#fd.�Z$d S)6�IndexableMixinz+Add indexing features to `datasets.Dataset`c��i|_dSr���_indexes�r6s rrizIndexableMixin.__init__�s ��.0�� � � rc��t�r�r4r�s r�__len__zIndexableMixin.__len__����!�!rc��t�r�r4)r6�keys r� __getitem__zIndexableMixin.__getitem__�rrr�r2c��||jvSr�r��r6r�s r�is_index_initializedz#IndexableMixin.is_index_initialized�s���T�]�*�*rc�V�|�|��std|�d����dS)NzIndex with index_name 'zk' not initialized yet. Please make sure that you call `add_faiss_index` or `add_elasticsearch_index` first.)rrrs r�_check_index_is_initializedz*IndexableMixin._check_index_is_initialized�sN���(�(��4�4� ��b�*�b�b�b��� � � rc�*�t|j��S)zEList the `colindex_nameumns`/identifiers of all the attached indexes.)rr�r�s r� list_indexeszIndexableMixin.list_indexes�s���D�M�"�"�"rc�F�|�|��|j|S)z�List the `index_name`/identifiers of all the attached indexes. Args: index_name (`str`): Index name. Returns: [`BaseIndex`] )r r�rs r� get_indexzIndexableMixin.get_index�s%�� �(�(��4�4�4��}�Z�(�(rNr�Frkr�r�r�r�r�r�r�r�c ��|�|n|}t||||���} | �||||| ���| |j|<dS)a�Add a dense index using Faiss for fast retrieval. The index is created using the vectors of the specified column. You can specify `device` if you want to run it on GPU (`device` must be the GPU index, see more below). You can find more information about Faiss here: - For `string factory`: https://github.com/facebookresearch/faiss/wiki/The-index-factory Args: column (`str`): The column of the vectors to add to the index. index_name (Optional `str`): The index_name/identifier of the index. This is the index_name that is used to call `.get_nearest` or `.search`. By default it corresponds to `column`. device (Optional `Union[int, List[int]]`): If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. string_factory (Optional `str`): This is passed to the index factory of Faiss to create the index. Default index class is IndexFlatIP. metric_type (Optional `int`): Type of metric. Ex: `faiss.METRIC_INNER_PRODUCT` or `faiss.METRIC_L2`. custom_index (Optional `faiss.Index`): Custom Faiss index that you already have instantiated and configured for your needs. batch_size (Optional `int`): Size of the batch to use while adding vectors to the FaissIndex. Default value is 1000. <Added version="2.4.0"/> train_size (Optional `int`): If the index needs a training step, specifies how many vectors will be used to train the index. faiss_verbose (`bool`, defaults to False): Enable the verbosity of the Faiss index. N�r�r�r�r��rkr�r�r��r�r�r�) r6rkr�r�r�r�r�r�r�r�r�s r�add_faiss_indexzIndexableMixin.add_faiss_index�sp��@$.�#9�Z�Z�v� � ��.�k�`l� � � � � ��� ��J�:�]j� � � � �%0�� �j�!�!�!r�external_arraysc �t�t||||���} | �|d||| ���| |j|<dS)a8Add a dense index using Faiss for fast retrieval. The index is created using the vectors of `external_arrays`. You can specify `device` if you want to run it on GPU (`device` must be the GPU index). You can find more information about Faiss here: - For `string factory`: https://github.com/facebookresearch/faiss/wiki/The-index-factory Args: external_arrays (`np.array`): If you want to use arrays from outside the lib for the index, you can set `external_arrays`. It will use `external_arrays` to create the Faiss index instead of the arrays in the given `column`. index_name (`str`): The index_name/identifier of the index. This is the index_name that is used to call `.get_nearest` or `.search`. device (Optional `Union[int, List[int]]`): If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. string_factory (Optional `str`): This is passed to the index factory of Faiss to create the index. Default index class is IndexFlatIP. metric_type (Optional `int`): Type of metric. Ex: `faiss.METRIC_INNER_PRODUCT` or `faiss.METRIC_L2`. custom_index (Optional `faiss.Index`): Custom Faiss index that you already have instantiated and configured for your needs. batch_size (Optional `int`): Size of the batch to use while adding vectors to the FaissIndex. Default value is 1000. <Added version="2.4.0"/> train_size (Optional `int`): If the index needs a training step, specifies how many vectors will be used to train the index. faiss_verbose (`bool`, defaults to False): Enable the verbosity of the Faiss index. rNrr) r6rr�r�r�r�r�r�r�r�r�s r�$add_faiss_index_from_external_arraysz3IndexableMixin.add_faiss_index_from_external_arrays�sb��@!��.�k�`l� � � � � ��� �D�Z�J�fs� � � � �%0�� �j�!�!�!rr>r�c��|�|��}t|t��s#td|�dt |���d����|�||���t �d|�d|����dS)a�Save a FaissIndex on disk. Args: index_name (`str`): The index_name/identifier of the index. This is the index_name that is used to call `.get_nearest` or `.search`. file (`str`): The path to the serialized faiss index on disk or remote URI (e.g. `"s3://my-bucket/index.faiss"`). storage_options (`dict`, *optional*): Key/value pairs to be passed on to the file-system backend, if any. <Added version="2.11.0"/> zIndex 'z' is not a FaissIndex but a '�')r�zSaved FaissIndex z at N)r r�r�rarVr@r�r�)r6r�r>r�rns r�save_faiss_indexzIndexableMixin.save_faiss_indexs������z�*�*���%��,�,� `��^�z�^�^�PT�UZ�P[�P[�^�^�^�_�_� _� � � �4�� �9�9�9�� � �>� �>�>��>�>�?�?�?�?�?rc �4�t�|||���}|jjt |��kr3t d|�d|�d|jj�dt |���d� ���||j|<t�d|�d|����d S) a�Load a FaissIndex from disk. If you want to do additional configurations, you can have access to the faiss index object by doing `.get_index(index_name).faiss_index` to make it fit your needs. Args: index_name (`str`): The index_name/identifier of the index. This is the index_name that is used to call `.get_nearest` or `.search`. file (`str`): The path to the serialized faiss index on disk or remote URI (e.g. `"s3://my-bucket/index.faiss"`). device (Optional `Union[int, List[int]]`): If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. storage_options (`dict`, *optional*): Key/value pairs to be passed on to the file-system backend, if any. <Added version="2.11.0"/> )r�r�z1Index size should match Dataset size, but Index 'z' at z has z elements while the dataset has z examples.zLoaded FaissIndex z from N) r�rCr��ntotalr~rar�r�r�)r6r�r>r�r�rns r�load_faiss_indexzIndexableMixin.load_faiss_index)s���0����V�_��U�U�� � � #�s�4�y�y� 0� 0��p�J�p�p�UY�p�p�`e�`q�`x�p�p�[^�_c�[d�[d�p�p�p��� �%*�� �j�!�� � �A��A�A�4�A�A�B�B�B�B�BrrJrKrLrrMrNc�|�|�|n|}t|||||���}|�||���||j|<dS)aAdd a text index using ElasticSearch for fast retrieval. Args: column (`str`): The column of the documents to add to the index. index_name (Optional `str`): The index_name/identifier of the index. This is the index name that is used to call `.get_nearest` or `.search`. By default it corresponds to `column`. host (Optional `str`, defaults to localhost): host of where ElasticSearch is running port (Optional `str`, defaults to 9200): port of where ElasticSearch is running es_client (Optional `elasticsearch.Elasticsearch`): The elasticsearch client used to create the index if host and port are None. es_index_name (Optional `str`): The elasticsearch index name used to create the index. es_index_config (Optional `dict`): The configuration of the elasticsearch index. Default config is: Config:: { "settings": { "number_of_shards": 1, "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, }, "mappings": { "properties": { "text": { "type": "text", "analyzer": "standard", "similarity": "BM25" }, } }, } N�rJrKrLrMrN)rk)rIr�r�) r6rkr�rJrKrLrMrN�es_indexs r�add_elasticsearch_indexz&IndexableMixin.add_elasticsearch_indexIs`��Z$.�#9�Z�Z�v� �%��D�I�]�ds� � � �� ���t�F��3�3�3�$,�� �j�!�!�!rc�>�t|||||���|j|<dS)aALoad an existing text index using ElasticSearch for fast retrieval. Args: index_name (`str`): The `index_name`/identifier of the index. This is the index name that is used to call `get_nearest` or `search`. es_index_name (`str`): The name of elasticsearch index to load. host (`str`, *optional*, defaults to `localhost`): Host of where ElasticSearch is running. port (`str`, *optional*, defaults to `9200`): Port of where ElasticSearch is running. es_client (`elasticsearch.Elasticsearch`, *optional*): The elasticsearch client used to create the index if host and port are `None`. es_index_config (`dict`, *optional*): The configuration of the elasticsearch index. Default config is: ``` { "settings": { "number_of_shards": 1, "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, }, "mappings": { "properties": { "text": { "type": "text", "analyzer": "standard", "similarity": "BM25" }, } }, } ``` rN)rIr�)r6r�rMrJrKrLrNs r�load_elasticsearch_indexz'IndexableMixin.load_elasticsearch_index}s2��V%7��D�I�]�ds�% �% �% �� �j�!�!�!rc��|j|=dS)z�Drop the index with the specified column. Args: index_name (`str`): The `index_name`/identifier of the index. Nr�rs r� drop_indexzIndexableMixin.drop_index�s�� �M�*� %� %� %rr0r7r1c �`�|�|��|j|j||fi|��S)aFind the nearest examples indices in the dataset to the query. Args: index_name (`str`): The name/identifier of the index. query (`Union[str, np.ndarray]`): The query as a string if `index_name` is a text index or as a numpy array if `index_name` is a vector index. k (`int`): The number of examples to retrieve. Returns: `(scores, indices)`: A tuple of `(scores, indices)` where: - **scores** (`List[List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples - **indices** (`List[List[int]]`): the indices of the retrieved examples )r r�r9)r6r�r7r1r8s rr9zIndexableMixin.search�s<��" �(�(��4�4�4�/�t�}�Z�(�/��q�C�C�F�C�C�Crr<c �`�|�|��|j|j||fi|��S)a]Find the nearest examples indices in the dataset to the query. Args: index_name (`str`): The `index_name`/identifier of the index. queries (`Union[List[str], np.ndarray]`): The queries as a list of strings if `index_name` is a text index or as a numpy array if `index_name` is a vector index. k (`int`): The number of examples to retrieve per query. Returns: `(total_scores, total_indices)`: A tuple of `(total_scores, total_indices)` where: - **total_scores** (`List[List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples per query - **total_indices** (`List[List[int]]`): the indices of the retrieved examples per query )r r�r=)r6r�r<r1r8s rr=zIndexableMixin.search_batch�s<��& �(�(��4�4�4�5�t�}�Z�(�5�g�q�K�K�F�K�K�Krc ���|�|��|j|||fi|��\}}d�|D��}t|dt|���||��S)a�Find the nearest examples in the dataset to the query. Args: index_name (`str`): The index_name/identifier of the index. query (`Union[str, np.ndarray]`): The query as a string if `index_name` is a text index or as a numpy array if `index_name` is a vector index. k (`int`): The number of examples to retrieve. Returns: `(scores, examples)`: A tuple of `(scores, examples)` where: - **scores** (`List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples - **examples** (`dict`): the retrieved examples c��g|] }|dk�|�� S�rr�r�rvs rr�z7IndexableMixin.get_nearest_examples.<locals>.<listcomp>�s��4�4�4�Q�Q�!�V�V�q�V�V�VrN)r r9r'r~)r6r�r7r1r8rr� top_indicess r�get_nearest_examplesz#IndexableMixin.get_nearest_examples�st��& �(�(��4�4�4�%�$�+�j�%��E�E�f�E�E����4�4�'�4�4�4� �%�f�-?�s�;�/?�/?�-?�&@�$�{�BS�T�T�Trc ������|���j|||fi|��\}}d�t||��D��}�fd�|D��}t||��S)aDFind the nearest examples in the dataset to the query. Args: index_name (`str`): The `index_name`/identifier of the index. queries (`Union[List[str], np.ndarray]`): The queries as a list of strings if `index_name` is a text index or as a numpy array if `index_name` is a vector index. k (`int`): The number of examples to retrieve per query. Returns: `(total_scores, total_examples)`: A tuple of `(total_scores, total_examples)` where: - **total_scores** (`List[List[float]`): the retrieval scores from either FAISS (`IndexFlatL2` by default) or ElasticSearch of the retrieved examples per query - **total_examples** (`List[dict]`): the retrieved examples per query c�V�g|]&\}}|dtd�|D�������'S)Nc��g|] }|dk�|�� Sr(rr)s rr�zHIndexableMixin.get_nearest_examples_batch.<locals>.<listcomp>.<listcomp> s��;�;�;�!�A��F�F�A�F�F�Fr)r~)r��scores_i� indices_is rr�z=IndexableMixin.get_nearest_examples_batch.<locals>.<listcomp> sN�� � � �#��)� �<�s�;�;�y�;�;�;�<�<�<� =� � � rc�4��g|]}�d�|D����S)c��g|] }|dk�|�� Sr(rr)s rr�zHIndexableMixin.get_nearest_examples_batch.<locals>.<listcomp>.<listcomp>s��<�<�<�Q�Q�!�V�V�q�V�V�Vrr)r�rr6s �rr�z=IndexableMixin.get_nearest_examples_batch.<locals>.<listcomp>s-���[�[�[�'��<�<�'�<�<�<�=�[�[�[r)r r=�zipr,)r6r�r<r1r8r$r%� total_sampless` r�get_nearest_examples_batchz)IndexableMixin.get_nearest_examples_batch�s����& �(�(��4�4�4�&7�d�&7� �G�Q�&Y�&Y�RX�&Y�&Y�#� �m� � �'*�<��'G�'G� � � � �\�[�[�[�]�[�[�[� �,�\�=�I�I�Ir)NNNNNr�NF)NNNNr�NFr�r�)NNNNNNr�rD)%rrrrErirrrFr�rr rr r/r rrr!rr�r�rrr*rrrr!r#rr9r#r=r'r+r,r5rrrr�r��sG������5�5�1�1�1�"�"�"�"�"�"�+�s�+�t�+�+�+�+��c����� #�d�3�i�#�#�#�#� )�C� )�I� )� )� )� )�%)�26�(,�%)�04��$(�#�'0�'0��'0��S�M�'0���s�D��I�~�.�/� '0� !�� � '0� �c�]� '0��}�-�'0��'0��S�M�'0��'0�'0�'0�'0�Z37�(,�%)�04��$(�#�&0�&0���&0��&0���s�D��I�~�.�/� &0� !�� � &0� �c�]� &0��}�-�&0��&0��S�M�&0��&0�&0�&0�&0�P@�@�3�@�e�C��M�6J�@�]e�fj�]k�@�@�@�@�,37�*.� C�C��C��C��M�"�C���s�D��I�~�.�/� C� "�$�� C�C�C�C�F%)�"�"�/3�'+�*.�2-�2-��2-��S�M�2-��s�m� 2-� �s�m� 2-� �O�,� 2-� ��}�2-�"�$��2-�2-�2-�2-�p#�"�/3�*.�- �- ��- ��- ��s�m� - � �s�m� - � �O�,� - �"�$��- �- �- �- �^&�S�&�&�&�&�D�D��D�U�3���=�-A�D�c�D�]j�D�D�D�D�*NP�L�L��L�(-�d�3�i���.A�(B�L�GJ�L� �L�L�L�L�.FH�U�U��U�&+�C���M�&:�U�?B�U� �U�U�U�U�2NP�J�J��J�(-�d�3�i���.A�(B�J�GJ�J� &�J�J�J�J�J�Jrr�)+�importlib.util� importlibrcrf�pathlibr�typingrrrrr��numpyr�r�r �utilsr r r� arrow_datasetr rrr`r�util� find_specr_r�� get_loggerrr�� Exceptionrrr#r'r,r/rIr�r�rrr�<module>rAs$������ � � � �����������=�=�=�=�=�=�=�=�=�=�=�=� � � � �����������������"�"�"�"�"�"��  �&�&�&�&�&�&� �/�/�/�/�/�/�/�� � � � � �� ���� �� � � � �� � � � � �� �����^�-�-�o�>�>�d�J�� �^� %� %�g� .� .�d� :� � �� �H� %� %�� � � � � �9� � � ������J���� #�#�#�#�#�:�#�#�#� �����Z���� �����J���� #"�#"�#"�#"�#"�#"�#"�#"�Lr\�r\�r\�r\�r\��r\�r\�r\�jG�G�G�G�G��G�G�G�TpJ�pJ�pJ�pJ�pJ�pJ�pJ�pJ�pJ�pJs$�A � A�A�A�A"�!A"
Memory