� ���g�*���ddlmZmZddlmZmZmZddlmZm Z ddl m Z ddl m Z mZmZddlmZddlmZdd lmZeje��Zed ee ��Z dd eedeeedeedee deededdefd�Z ddeedee deededef d�Zd S)�)�Optional�TypeVar�)�Dataset�_concatenate_map_style_datasets�_interleave_map_style_datasets)� DatasetDict�IterableDatasetDict)� DatasetInfo)�IterableDataset�_concatenate_iterable_datasets�_interleave_iterable_datasets)� NamedSplit)�logging)�Literal� DatasetTypeN�first_exhausted�datasets� probabilities�seed�info�split�stopping_strategy�r� all_exhausted�returnc ���ddlm}ddlm}|st d���t |��D�]\}} t | ||f��s�t | ttf��rU| st d|�d����t d|�dt| ���d tt| �����d ����t d|�d t| ��j �d ����|d krt | |��r||fn||f\} } ��t | | ��s#t d| j �d| j �d|�d������|dvrt |�d����| |urt||||||���St||||||���S)u� Interleave several datasets (sources) into a single dataset. The new dataset is constructed by alternating between the sources to get the examples. You can use this function on a list of [`Dataset`] objects, or on a list of [`IterableDataset`] objects. - If `probabilities` is `None` (default) the new dataset is constructed by cycling between each source to get the examples. - If `probabilities` is not `None`, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities. The resulting dataset ends when one of the source datasets runs out of examples except when `oversampling` is `True`, in which case, the resulting dataset ends when all datasets have ran out of examples at least one time. Note for iterable datasets: In a distributed setup or in PyTorch DataLoader workers, the stopping strategy is applied per process. Therefore the "first_exhausted" strategy on an sharded iterable dataset can generate less samples in total (up to 1 missing sample per subdataset per worker). Args: datasets (`List[Dataset]` or `List[IterableDataset]`): List of datasets to interleave. probabilities (`List[float]`, *optional*, defaults to `None`): If specified, the new dataset is constructed by sampling examples from one source at a time according to these probabilities. seed (`int`, *optional*, defaults to `None`): The random seed used to choose a source for each example. info ([`DatasetInfo`], *optional*): Dataset information, like description, citation, etc. <Added version="2.4.0"/> split ([`NamedSplit`], *optional*): Name of the dataset split. <Added version="2.4.0"/> stopping_strategy (`str`, defaults to `first_exhausted`): Two strategies are proposed right now, `first_exhausted` and `all_exhausted`. By default, `first_exhausted` is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples. If the strategy is `all_exhausted`, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once. Note that if the strategy is `all_exhausted`, the interleaved dataset size can get enormous: - with no probabilities, the resulting dataset will have `max_length_datasets*nb_dataset` samples. - with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting. Returns: [`Dataset`] or [`IterableDataset`]: Return type depends on the input `datasets` parameter. `Dataset` if the input is a list of `Dataset`, `IterableDataset` if the input is a list of `IterableDataset`. Example: For regular datasets (map-style): ```python >>> from datasets import Dataset, interleave_datasets >>> d1 = Dataset.from_dict({"a": [0, 1, 2]}) >>> d2 = Dataset.from_dict({"a": [10, 11, 12]}) >>> d3 = Dataset.from_dict({"a": [20, 21, 22]}) >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted") >>> dataset["a"] [10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22] >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) >>> dataset["a"] [10, 0, 11, 1, 2] >>> dataset = interleave_datasets([d1, d2, d3]) >>> dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22] >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") >>> dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22] >>> d1 = Dataset.from_dict({"a": [0, 1, 2]}) >>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]}) >>> d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]}) >>> dataset = interleave_datasets([d1, d2, d3]) >>> dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22] >>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted") >>> dataset["a"] [0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 10, 24] >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42) >>> dataset["a"] [10, 0, 11, 1, 2] >>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted") >>> dataset["a"] [10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24] For datasets in streaming mode (iterable): >>> from datasets import interleave_datasets >>> d1 = load_dataset('allenai/c4', 'es', split='train', streaming=True) >>> d2 = load_dataset('allenai/c4', 'fr', split='train', streaming=True) >>> dataset = interleave_datasets([d1, d2]) >>> iterator = iter(dataset) >>> next(iterator) {'text': 'Comprar Zapatillas para niña en chancla con goma por...'} >>> next(iterator) {'text': 'Le sacre de philippe ier, 23 mai 1059 - Compte Rendu...' ``` r)r)r z/Unable to interleave an empty list of datasets.�aExpected a list of Dataset objects or a list of IterableDataset objects, but element at position � is an empty dataset dictionary.�Dataset at position � has at least one split: �N Please pick one to interleave with the other datasets, for example: dataset['�']� is a �.r�Unable to interleave a � (at position 0) with a � (at position �K). Expected a list of Dataset objects or a list of IterableDataset objects.rz: is not supported. Please enter a valid stopping_strategy.)rrr)� arrow_datasetr�iterable_datasetr � ValueError� enumerate� isinstancer r �list�next�iter�type�__name__rr) rrrrrrrr �i�dataset� dataset_type� other_types �`/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/datasets/combine.py�interleave_datasetsr9s���H'�&�&�&�&�&�1�1�1�1�1�1� �L��J�K�K�K���)�)��� ��7��'�G�_�#=�>�>� ��'�K�1D�#E�F�F� ���$�:�|}�:�:�:����!�|�1�|�|�t�G�}�}�|�|�dh�im�nu�iv�iv�dw�dw�|�|�|�����W�tu�W�W�~B�CJ�~K�~K�~T�W�W�W��� � ��6�6�.8��'�.J�.J�j��/�*�*�Q`�bi�Pj� %�L�*�*��G�\�2�2� ��K�,�*?�K�K�Yc�Yl�K�K�|}�K�K�K��� � �� D�D�D��-�i�i�i�j�j�j��w���-� �m�T��E�Uf� � � � �-� �m�T��E�Uf� � � � ��dsets�axisc ���|std���t|��D�])\}}t|ttf��s�t|t t f��rU|std|�d����td|�dt|���dtt|�����d����td|�dt|��j �d ����|d kr5t|t��rttfn ttf\}}��t||��s#td |j �d |j �d |�d������+|turt||||���St||||���S)a� Converts a list of [`Dataset`] with the same schema into a single [`Dataset`]. Args: dsets (`List[datasets.Dataset]`): List of Datasets to concatenate. info (`DatasetInfo`, *optional*): Dataset information, like description, citation, etc. split (`NamedSplit`, *optional*): Name of the dataset split. axis (`{0, 1}`, defaults to `0`): Axis to concatenate over, where `0` means over rows (vertically) and `1` means over columns (horizontally). <Added version="1.6.0"/> Example: ```py >>> ds3 = concatenate_datasets([ds1, ds2]) ``` z0Unable to concatenate an empty list of datasets.rrr r!r"r#r$r%rr&r'r(r))rrr<)r,r-r.rr r r r/r0r1r2r3rr )r;rrr<r4r5r6r7s r8�concatenate_datasetsr>�s7��: �M��K�L�L�L���&�&��� ��7��'�G�_�#=�>�>� ��'�K�1D�#E�F�F� ���$�:�|}�:�:�:����!�|�1�|�|�t�G�}�}�|�|�dh�im�nu�iv�iv�dw�dw�|�|�|�����W�tu�W�W�~B�CJ�~K�~K�~T�W�W�W��� � ��6�6�.8��'�.J�.J�j��/�*�*�Q`�bi�Pj� %�L�*�*��G�\�2�2� ��K�,�*?�K�K�Yc�Yl�K�K�|}�K�K�K��� � ��w���.�u�4�u�SW�X�X�X�X�-�e�$�e�RV�W�W�W�Wr:)NNNNr)NNr)�typingrrr*rrr� dataset_dictr r rr r+r r r�splitsr�utilsr�utils.py_utilsr� get_loggerr3�loggerrr/�float�intr9r>�r:r8�<module>rIs���$�$�$�$�$�$�$�$�c�c�c�c�c�c�c�c�c�c�:�:�:�:�:�:�:�:�������l�l�l�l�l�l�l�l�l�l�������������#�#�#�#�#�#� �� �H� %� %���g�m�W�o�>�>� � ,0��"&�"&�EV� I �I ��;��I ��D��K�(�I � �3�-�I � �;� � I � �J� � I � �A�B� I ��I �I �I �I �\#'�"&�� 9X�9X� � � �9X� �;� �9X� �J� �9X� � 9X� � 9X�9X�9X�9X�9X�9Xr:
Memory