� ��g��ddlZddlZddlZddlZddlZddlZddlZddlmZddl m Z ddlmZddl mZddlmZmZmZddlZddlZddlmZddlmZmZmZmZmZmZdd lm Z d dl!m"Z"d dl#m$Z$m%Z%d d l&m'Z'd dl(m)Z)d dl*m+Z+m,Z,d dl-m.Z.d dl/m0Z0m1Z1m2Z2m3Z3d dl4m5Z5d dl6m7Z7d dl8m9Z9d dl:m;Z;d dl<m=Z=m>Z>m?Z?d dl@mAZAe7jBeC��ZDGd�de ��ZEGd�deF��ZGGd�deF��ZHdS)�N)�Sequence)�partial)�BytesIO)�Path)�Callable�Optional�Union)� url_to_fs)� CommitInfo�CommitOperationAdd�CommitOperationDelete�DatasetCard�DatasetCardData�HfApi)�RepoFile�)�config)�:PUSH_TO_HUB_WITHOUT_METADATA_CONFIGS_SPLIT_PATTERN_SHARDED�Dataset)�Features)�FeatureType)�DatasetInfo�DatasetInfosDict)� _split_re)� NamedSplit�Split� SplitDict� SplitInfo)�Table)�logging)�is_documented_by)�MetadataConfigs)�asdict�glob_pattern_to_regex�string_to_dict)�PathLikec��eZdZd�ZdS)�bindc�0�|jg|�|j�Ri|��S�N)�func�args)�self�fn_args� fn_kwargss �e/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/datasets/dataset_dict.py�__call__z bind.__call__1s)��t�y�;�'�;�D�I�;�;�;��;�;�;�N)�__name__� __module__�__qualname__r1�r2r0r(r(0s#��<�<�<�<�<r2r(c'�Z��eZdZdZd�Zd�Zd�Zd�Zdef�fd�Z e deee ffd��Ze deeeffd ��Ze deeeffd ��Ze deeeffd��Ze deeeeffd��Ze deeeeffd ��Zdhdid�Zdedeeeffd�Zdeeeffd�Zd�Zdeddfd�Zdeddfd�Zdeeeefddfd�Zdededdfd�Z deeefddfd�Z!deeeefddfd�Z"djded e#ddfd!�Z$e%j& dkd#e'ed$e'ed%e#fd&��Z( dkd#e'ed$e'ed%e#fd'�Z)d(�Z* dld)e'e+d$e'ed%e#fd*�Z, dkd#e'ed$e'ed%e#ddfd+�Z- dld)e'e+d$e'ed%e#ddfd,�Z. dmd.e'e+d/e#d0e#d1e#d2e'eeeefd3e#d4e'ed5e#d6e'eeeefd7e#d8e'e#d9e'eee'efd:e'ede'ed;e#d<e'ed=e'ed>e'eddf&d?�Z/ dnd.e'e+d/e#d0e#d2e'eeeefd3e#d4e'ed7e#d8e'e#d9e'eee'efd:e'ed<e'ed=e'ed>e'eddfd@�Z0 dod7e#d9e'eee'efd:e'ede'ed;e#d=e'edAe'eddfdB�Z1 dpdeee2efdDee#e2e#fdEed7e#d8e'e#dFe'eee'efd:e'eddfdG�Z3 dqdHe'eeeee'effdIe'edJe'eee4j5j6fd7e#d8e'e#dFe'eee'efd:e'eddfdK�Z7 drdLe8dMe'eeefdNe'eeefd=e'edOe'ef dP�Z9e: dsdLe8d7e'e#dOe'eddfdQ��Z;e: dkdReee8fde'edSed7e#ddf dT��Z<e: dkdReee8fde'edSed7e#ddf dU��Z=e: dtdReee8fde'edSed7e#d$e'eeddfdV��Z>e: dkdReee8fde'edSed7e#ddf dW��Z?e@ejA��dXedYeddfdZ��ZA dud]ed^e'e#d_e'ed`e'edae'edbe'e#dce'edde'edee'e#dMe'eeefdNe'eeefdfe#deBfdg�ZC�xZDS)v�DatasetDictz`A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)c��|��D]7}t|t��s tdt |��d��8dS)NzBValues in `DatasetDict` should be of type `Dataset` but got type '�')�values� isinstancer� TypeError�type�r-�datasets r0�_check_values_typezDatasetDict._check_values_type8sd��{�{�}�}� w� w�G��g�w�/�/� w�� u�ei�jq�er�er� u� u� u�v�v�v� w� w� wr2c�F�t|��}t|dd�|dd��D]^\}}|dj|djkr=t d|d�d|d�d|dj�d|dj��_dS)N��rzNAll datasets in `DatasetDict` should have the same features but features for 'rz' and 'z' don't match: z != )�list�items�zip�features� ValueError)r-rE�item_a�item_bs r0�_check_values_featuresz"DatasetDict._check_values_features=s��T�Z�Z�\�\�"�"��!�%��*�e�A�B�B�i�8�8� � �N�F�F��a�y�!�V�A�Y�%7�7�7� �~�ek�lm�en�~�~�w}�~�xA�~�~�RX�YZ�R[�Rd�~�~�jp�qr�js�j|�~�~��8� � r2c��|Sr*r6�r-s r0� __enter__zDatasetDict.__enter__Es��r2c�|�|��D]&}t|d��r|`t|d��r|`�'dS)N�_data�_indices)r;�hasattrrPrQ)r-�exc_type�exc_val�exc_tbr@s r0�__exit__zDatasetDict.__exit__HsS��{�{�}�}� %� %�G��w��(�(� "��M��w� �+�+� %��$�� %� %r2�returnc��t|ttf��st��dkr!t ��|��S�fd�tjtjtj fD��}|r|dnt��d}td|�d|�d|�dt��)Nrc��g|]}|�v�|�� Sr6r6)�.0�splitr-s �r0� <listcomp>z+DatasetDict.__getitem__.<locals>.<listcomp>Ts,��*�*�*��RW�[_�R_�R_��R_�R_�R_r2z Invalid key: zD. Please first select a split. For example: `my_dataset_dictionary['z'][z]`. Available splits: ) r<�strr�len�super�__getitem__r�TRAIN�TEST� VALIDATIONrD�KeyError�sorted)r-�k�available_suggested_splits�suggested_split� __class__s` �r0r`zDatasetDict.__getitem__Ps��a�#�z�*�+�+� �s�4�y�y�A�~�~��7�7�&�&�q�)�)�)�*�*�*�*�$)�K��U�=M�#N�*�*�*�&�@Z�l�8��;�;�_c�dh�_i�_i�jk�_l�O��4��4�4�+:�4�4�?@�4�4�%+�D�\�\�4�4�� r2c�f�|��d�|��D��S)z�The Apache Arrow tables backing each split. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.data ``` c�$�i|] \}}||j��Sr6)�data�rZrfr@s r0� <dictcomp>z$DatasetDict.data.<locals>.<dictcomp>ks ��?�?�?�J�A�w��7�<�?�?�?r2�rArErMs r0rlzDatasetDict.data^s2�� !�!�!�?�?�$�*�*�,�,�?�?�?�?r2c�f�|��d�|��D��S)a�The cache files containing the Apache Arrow table backing each split. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.cache_files {'test': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'}], 'train': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'}], 'validation': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]} ``` c�$�i|] \}}||j��Sr6)�cache_filesrms r0rnz+DatasetDict.cache_files.<locals>.<dictcomp>}�!��F�F�F�:�1�g��7�&�F�F�Fr2rorMs r0rrzDatasetDict.cache_filesms2�� !�!�!�F�F��F�F�F�Fr2c�f�|��d�|��D��S)a*Number of columns in each split of the dataset. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.num_columns {'test': 2, 'train': 2, 'validation': 2} ``` c�$�i|] \}}||j��Sr6)�num_columnsrms r0rnz+DatasetDict.num_columns.<locals>.<dictcomp>�rsr2rorMs r0rvzDatasetDict.num_columnss2�� !�!�!�F�F��F�F�F�Fr2c�f�|��d�|��D��S)a-Number of rows in each split of the dataset. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.num_rows {'test': 1066, 'train': 8530, 'validation': 1066} ``` c�$�i|] \}}||j��Sr6)�num_rowsrms r0rnz(DatasetDict.num_rows.<locals>.<dictcomp>�s!��C�C�C� ��7��7�#�C�C�Cr2rorMs r0ryzDatasetDict.num_rows�s2�� !�!�!�C�C�d�j�j�l�l�C�C�C�Cr2c�f�|��d�|��D��S)apNames of the columns in each split of the dataset. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.column_names {'test': ['text', 'label'], 'train': ['text', 'label'], 'validation': ['text', 'label']} ``` c�$�i|] \}}||j��Sr6��column_namesrms r0rnz,DatasetDict.column_names.<locals>.<dictcomp>�s!��G�G�G�J�A�w��7�'�G�G�Gr2rorMs r0r}zDatasetDict.column_names�s2�� !�!�!�G�G�$�*�*�,�,�G�G�G�Gr2c�f�|��d�|��D��S)aTShape of each split of the dataset (number of rows, number of columns). Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.shape {'test': (1066, 2), 'train': (8530, 2), 'validation': (1066, 2)} ``` c�$�i|] \}}||j��Sr6)�shaperms r0rnz%DatasetDict.shape.<locals>.<dictcomp>�s ��@�@�@�Z�Q��7�=�@�@�@r2rorMs r0r�zDatasetDict.shape�s2�� !�!�!�@�@�4�:�:�<�<�@�@�@�@r2�c��|��t�fd�|��D��S)a�Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("rajpurkar/squad") >>> ds["train"].features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None)} >>> ds.flatten() DatasetDict({ train: Dataset({ features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], num_rows: 87599 }) validation: Dataset({ features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'], num_rows: 10570 }) }) ``` c�D��i|]\}}||��S))� max_depth)�flatten)rZrfr@r�s �r0rnz'DatasetDict.flatten.<locals>.<dictcomp>�s-��c�c�c� ��7�A�w��C�C�c�c�cr2�rAr8rE)r-r�s `r0r�zDatasetDict.flatten�sG��: ��!�!�!��c�c�c�c�VZ�V`�V`�Vb�Vb�c�c�c�d�d�dr2�columnc�l��|��fd�|��D��S)a�Return a list of the unique elements in a column for each split. This is implemented in the low-level backend and as such, very fast. Args: column (`str`): column name (list all the column names with [`~datasets.DatasetDict.column_names`]) Returns: Dict[`str`, `list`]: Dictionary of unique elements in the given column. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.unique("label") {'test': [1, 0], 'train': [1, 0], 'validation': [1, 0]} ``` c�B��i|]\}}||��Sr6)�unique)rZrfr@r�s �r0rnz&DatasetDict.unique.<locals>.<dictcomp>�s+��I�I�I�j�a��7�>�>�&�)�)�I�I�Ir2ro)r-r�s `r0r�zDatasetDict.unique�s9��* ��!�!�!�I�I�I�I�D�J�J�L�L�I�I�I�Ir2c�f�|��d�|��D��S)a2Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be careful when running this command that no other process is currently using other cache files. Return: `Dict` with the number of removed files for each split Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.cleanup_cache_files() {'test': 0, 'train': 0, 'validation': 0} ``` c�>�i|]\}}||��Sr6)�cleanup_cache_filesrms r0rnz3DatasetDict.cleanup_cache_files.<locals>.<dictcomp> s*��P�P�P�Z�Q��7�.�.�0�0�P�P�Pr2rorMs r0r�zDatasetDict.cleanup_cache_files�s2�� !�!�!�P�P�4�:�:�<�<�P�P�P�Pr2c��d�d�|��D��}tjdd|dtj��}d|�d�S)N� c�"�g|]\}}|�d|�� S�z: r6�rZrf�vs r0r\z(DatasetDict.__repr__.<locals>.<listcomp> �&��?�?�?�$�!�Q�Q�+�+�!�+�+�?�?�?r2�^� rzDatasetDict({ � })��joinrE�re�sub�M�r-�reprs r0�__repr__zDatasetDict.__repr__sT��y�y�?�?�$�*�*�,�,�?�?�?�@�@��v�d�G�T�1�b�d�3�3��-�$�-�-�-�-r2rGc��|��t�fd�|��D��S)a� Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary. Args: features ([`Features`]): New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. `string` <-> `ClassLabel` you should use [`~DatasetDict.map`] to update the dataset. Example: ```py >>> from datasets import load_dataset, ClassLabel, Value >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} >>> new_features = ds["train"].features.copy() >>> new_features['label'] = ClassLabel(names=['bad', 'good']) >>> new_features['text'] = Value('large_string') >>> ds = ds.cast(new_features) >>> ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='large_string', id=None)} ``` c�D��i|]\}}||��S�)rG��cast�rZrfr@rGs �r0rnz$DatasetDict.cast.<locals>.<dictcomp>/s-��^�^�^�:�1�g�A�w�|�|�X�|�>�>�^�^�^r2r��r-rGs `r0r�zDatasetDict.castsG��: ��!�!�!��^�^�^�^�QU�Q[�Q[�Q]�Q]�^�^�^�_�_�_r2c��|��t��fd�|��D��S)aCast column to feature for decoding. Args: column (`str`): Column name. feature ([`Feature`]): Target feature. Returns: [`DatasetDict`] Example: ```py >>> from datasets import load_dataset, ClassLabel >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} >>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) >>> ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='string', id=None)} ``` c�F��i|]\}}||��S�)r��feature��cast_column�rZrfr@r�r�s ��r0rnz+DatasetDict.cast_column.<locals>.<dictcomp>Ls5��r�r�r�Wa�WX�Za�A�w�2�2�&�'�2�R�R�r�r�rr2r��r-r�r�s ``r0r�zDatasetDict.cast_column1sK��4 ��!�!�!��r�r�r�r�r�ei�eo�eo�eq�eq�r�r�r�s�s�sr2r}c��|��t�fd�|��D��S)a� Remove one or several column(s) from each split in the dataset and the features associated to the column(s). The transformation is applied to all the splits of the dataset dictionary. You can also remove a column using [`~DatasetDict.map`] with `remove_columns` but the present method doesn't copy the data of the remaining columns and is thus faster. Args: column_names (`Union[str, list[str]]`): Name of the column(s) to remove. Returns: [`DatasetDict`]: A copy of the dataset object without the columns to remove. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds = ds.remove_columns("label") DatasetDict({ train: Dataset({ features: ['text'], num_rows: 8530 }) validation: Dataset({ features: ['text'], num_rows: 1066 }) test: Dataset({ features: ['text'], num_rows: 1066 }) }) ``` c�D��i|]\}}||��S�r|��remove_columns�rZrfr@r}s �r0rnz.DatasetDict.remove_columns.<locals>.<dictcomp>v�3��p�p�p�U_�UV�X_�A�w�5�5�<�5�P�P�p�p�pr2r��r-r}s `r0r�zDatasetDict.remove_columnsNsH��N ��!�!�!��p�p�p�p�cg�cm�cm�co�co�p�p�p�q�q�qr2�original_column_name�new_column_namec��|��t��fd�|��D��S)a Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary. You can also rename a column using [`~DatasetDict.map`] with `remove_columns` but the present method: - takes care of moving the original features under the new column name. - doesn't copy the data to a new dataset and is thus much faster. Args: original_column_name (`str`): Name of the column to rename. new_column_name (`str`): New name for the column. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds = ds.rename_column("label", "label_new") DatasetDict({ train: Dataset({ features: ['text', 'label_new'], num_rows: 8530 }) validation: Dataset({ features: ['text', 'label_new'], num_rows: 1066 }) test: Dataset({ features: ['text', 'label_new'], num_rows: 1066 }) }) ``` c�F��i|]\}}||��S�)r�r�� rename_column�rZrfr@r�r�s ��r0rnz-DatasetDict.rename_column.<locals>.<dictcomp>��L�� A�w� �7�(�(�)=�$3�)�� r2r��r-r�r�s ``r0r�zDatasetDict.rename_columnxs`��J ��!�!�!�� #'�*�*�,�,� � � � � � r2�column_mappingc��|��t�fd�|��D��S)a[ Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The transformation is applied to all the datasets of the dataset dictionary. Args: column_mapping (`Dict[str, str]`): A mapping of columns to rename to their new names. Returns: [`DatasetDict`]: A copy of the dataset with renamed columns. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) DatasetDict({ train: Dataset({ features: ['text_new', 'label_new'], num_rows: 8530 }) validation: Dataset({ features: ['text_new', 'label_new'], num_rows: 1066 }) test: Dataset({ features: ['text_new', 'label_new'], num_rows: 1066 }) }) ``` c�D��i|]\}}||��S�)r��rename_columns�rZrfr@r�s �r0rnz.DatasetDict.rename_columns.<locals>.<dictcomp>�s3��t�t�t�Yc�YZ�\c�A�w�5�5�^�5�T�T�t�t�tr2r��r-r�s `r0r�zDatasetDict.rename_columns�sH��F ��!�!�!��t�t�t�t�gk�gq�gq�gs�gs�t�t�t�u�u�ur2c��|��t�fd�|��D��S)a�Select one or several column(s) from each split in the dataset and the features associated to the column(s). The transformation is applied to all the splits of the dataset dictionary. Args: column_names (`Union[str, list[str]]`): Name of the column(s) to keep. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.select_columns("text") DatasetDict({ train: Dataset({ features: ['text'], num_rows: 8530 }) validation: Dataset({ features: ['text'], num_rows: 1066 }) test: Dataset({ features: ['text'], num_rows: 1066 }) }) ``` c�D��i|]\}}||��Sr��select_columnsr�s �r0rnz.DatasetDict.select_columns.<locals>.<dictcomp>�r�r2r�r�s `r0r�zDatasetDict.select_columns�sH��B ��!�!�!��p�p�p�p�cg�cm�cm�co�co�p�p�p�q�q�qr2F� include_nullsc��|��t��fd�|��D��S)a�Casts the given column as [`~datasets.features.ClassLabel`] and updates the tables. Args: column (`str`): The name of the column to cast. include_nulls (`bool`, defaults to `False`): Whether to include null values in the class labels. If `True`, the null values will be encoded as the `"None"` class label. <Added version="1.14.2"/> Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("boolq") >>> ds["train"].features {'answer': Value(dtype='bool', id=None), 'passage': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)} >>> ds = ds.class_encode_column("answer") >>> ds["train"].features {'answer': ClassLabel(num_classes=2, names=['False', 'True'], id=None), 'passage': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)} ``` c�F��i|]\}}||��S))r�r�)�class_encode_column)rZrfr@r�r�s ��r0rnz3DatasetDict.class_encode_column.<locals>.<dictcomp>s5��w�w�w�\f�\]�_f�Q��+�+�6��+�W�W�w�w�wr2r�)r-r�r�s ``r0r�zDatasetDict.class_encode_column�sQ��6 ��!�!�!��w�w�w�w�w�jn�jt�jt�jv�jv�w�w�w� � � r2Nr>�columns�output_all_columnsc +�^K�|��d�|��D��}d�|��D��}d�|��D��}d�|��D��} |j|||fi|��dV�|��D]-\} } | j|| || || fi|| ��.dS#|��D]-\} } | j|| || || fi|| ��.wxYw)a�To be used in a `with` statement. Set `__getitem__` return format (type and columns). The transformation is applied to all the datasets of the dataset dictionary. Args: type (`str`, *optional*): Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`list[str]`, *optional*): Columns to format in the output. `None` means `__getitem__` returns all columns (default). output_all_columns (`bool`, defaults to False): Keep un-formatted columns as well in the output (as python objects). **format_kwargs (additional keyword arguments): Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`. c�$�i|] \}}||j��Sr6)�_format_typerms r0rnz,DatasetDict.formatted_as.<locals>.<dictcomp>*s!��R�R�R�z�q�'�1�g�2�R�R�Rr2c�$�i|] \}}||j��Sr6)�_format_kwargsrms r0rnz,DatasetDict.formatted_as.<locals>.<dictcomp>+s!��V�V�V�:�1�g�Q�� 6�V�V�Vr2c�$�i|] \}}||j��Sr6)�_format_columnsrms r0rnz,DatasetDict.formatted_as.<locals>.<dictcomp>,s!��X�X�X�Z�Q��a��!8�X�X�Xr2c�$�i|] \}}||j��Sr6)�_output_all_columnsrms r0rnz,DatasetDict.formatted_as.<locals>.<dictcomp>-s!��!`�!`�!`�Z�Q��!�W�%@�!`�!`�!`r2N)rArE� set_format)r-r>r�r�� format_kwargs�old_format_type�old_format_kwargs�old_format_columns�old_output_all_columnsrfr@s r0�formatted_aszDatasetDict.formatted_ass��. ��!�!�!�R�R�T�Z�Z�\�\�R�R�R��V�V��V�V�V��X�X�4�:�:�<�<�X�X�X��!`�!`�SW�S]�S]�S_�S_�!`�!`�!`�� D�O�D�'�+=�O�O��O�O�O��E�E�E�"�j�j�l�l� � � ��7�"��"�#�A�&�&�q�)�*�1�-��(��*� �� d�j�j�l�l� � � ��7�"��"�#�A�&�&�q�)�*�1�-��(��*� �� s �C(�(AD,c�~�|��|��D]}|jd|||d�|��dS)aJSet `__getitem__` return format (type and columns). The format is set for every dataset in the dataset dictionary. Args: type (`str`, *optional*): Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`list[str]`, *optional*): Columns to format in the output. `None` means `__getitem__` returns all columns (default). output_all_columns (`bool`, defaults to False): Keep un-formatted columns as well in the output (as python objects), **format_kwargs (additional keyword arguments): Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`. It is possible to call `map` after calling `set_format`. Since `map` may add new columns, then the list of formatted columns gets updated. In this case, if you apply `map` on a dataset to add a new column, then this column will be formatted: `new formatted columns = (all columns - previously unformatted columns)` Example: ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) >>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) >>> ds["train"].format {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'numpy'} ``` �r>r�r�Nr6�rAr;r��r-r>r�r�r�r@s r0r�zDatasetDict.set_format:sq��T ��!�!�!��{�{�}�}� � �G��G�� #5� � � � � � � � � r2c��|��|��D]}|��dS)awReset `__getitem__` return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary. Same as `self.set_format()` Example: ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) >>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) >>> ds["train"].format {'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'numpy'} >>> ds.reset_format() >>> ds["train"].format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': None} ``` Nr�r?s r0�reset_formatzDatasetDict.reset_formatmsL��8 ��!�!�!��{�{�}�}� !� !�G�� !� !r2� transformc��|��|��D]}|�d|||��dS)aWSet ``__getitem__`` return format using this transform. The transform is applied on-the-fly on batches when ``__getitem__`` is called. The transform is set for every dataset in the dataset dictionary As :func:`datasets.Dataset.set_format`, this can be reset using :func:`datasets.Dataset.reset_format` Args: transform (`Callable`, optional): user-defined formatting transform, replaces the format defined by :func:`datasets.Dataset.set_format` A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in ``__getitem__``. columns (`list[str]`, optional): columns to format in the output If specified, then the input batch of the transform only contains those columns. output_all_columns (`bool`, default to False): keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform. �custom)r�r�r�Nr��r-r�r�r�r@s r0� set_transformzDatasetDict.set_transform�se��( ��!�!�!��{�{�}�}� � �G��#5�#� � � � � � � r2c�P�tj|��}|jd|||d�|��|S)a�Set `__getitem__` return format (type and columns). The data formatting is applied on-the-fly. The format `type` (for example "numpy") is used to format batches when using `__getitem__`. The format is set for every dataset in the dataset dictionary. It's also possible to use custom transforms for formatting using [`~datasets.Dataset.with_transform`]. Contrary to [`~datasets.DatasetDict.set_format`], `with_format` returns a new [`DatasetDict`] object with new [`Dataset`] objects. Args: type (`str`, *optional*): Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`list[str]`, *optional*): Columns to format in the output. `None` means `__getitem__` returns all columns (default). output_all_columns (`bool`, defaults to `False`): Keep un-formatted columns as well in the output (as python objects). **format_kwargs (additional keyword arguments): Keywords arguments passed to the convert function like `np.array`, `torch.tensor` or `tensorflow.ragged.constant`. Example: ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds["train"].format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': None} >>> ds = ds.with_format("torch") >>> ds["train"].format {'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'], 'format_kwargs': {}, 'output_all_columns': False, 'type': 'torch'} >>> ds["train"][0] {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', 'label': tensor(1), 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} ``` r�r6)�copy�deepcopyr�r�s r0�with_formatzDatasetDict.with_format�sO��H�-��%�%�� 1� � �� r2c�^�tj|��}|�|||��|S)a� Set `__getitem__` return format using this transform. The transform is applied on-the-fly on batches when `__getitem__` is called. The transform is set for every dataset in the dataset dictionary As [`~datasets.Dataset.set_format`], this can be reset using [`~datasets.Dataset.reset_format`]. Contrary to [`~datasets.DatasetDict.set_transform`], `with_transform` returns a new [`DatasetDict`] object with new [`Dataset`] objects. Args: transform (`Callable`, *optional*): User-defined formatting transform, replaces the format defined by [`~datasets.Dataset.set_format`]. A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in `__getitem__`. columns (`list[str]`, *optional*): Columns to format in the output. If specified, then the input batch of the transform only contains those columns. output_all_columns (`bool`, defaults to False): Keep un-formatted columns as well in the output (as python objects). If set to `True`, then the other un-formatted columns are kept with the output of the transform. Example: ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> def encode(example): ... return tokenizer(example['text'], truncation=True, padding=True, return_tensors="pt") >>> ds = ds.with_transform(encode) >>> ds["train"][0] {'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'input_ids': tensor([ 101, 1103, 2067, 1110, 17348, 1106, 1129, 1103, 6880, 1432, 112, 188, 1207, 107, 14255, 1389, 107, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 170, 11791, 5253, 188, 1732, 7200, 10947, 12606, 2895, 117, 179, 7766, 118, 172, 15554, 1181, 3498, 6961, 3263, 1137, 188, 1566, 7912, 14516, 6997, 119, 102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} ``` )r�r�r�)r�r�r�r�s r0�with_transformzDatasetDict.with_transform�s5��d�-��%�%�� 7�Wi��j�j�j��r2��function�with_indices� with_rank� with_split� input_columns�batched� batch_size�drop_last_batchr��keep_in_memory�load_from_cache_file�cache_file_names�writer_batch_size�disable_nullabler/�num_proc�descc�L�|��|�t�|��}i}|��D]O\}}|rt ||��}|�|||||||| | |||| |||||��||<|r|j}�Pt|��S)a� Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it. The transformation is applied to all the datasets of the dataset dictionary. You can specify whether the function should be batched or not with the `batched` parameter: - If batched is `False`, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. `{"text": "Hello there !"}`. - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`. - If batched is `True` and `batch_size` is `n > 1`, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. Args: function (`callable`): with one of the following signature: - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` - `function(example: Dict[str, Any], indices: int) -> Dict[str, Any]` if `batched=False` and `with_indices=True` - `function(batch: Dict[str, list]) -> Dict[str, list]` if `batched=True` and `with_indices=False` - `function(batch: Dict[str, list], indices: list[int]) -> Dict[str, list]` if `batched=True` and `with_indices=True` For advanced usage, the function can also return a `pyarrow.Table`. If the function is asynchronous, then `map` will run your function in parallel. Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. If no function is provided, default to identity function: `lambda x: x`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. with_rank (`bool`, defaults to `False`): Provide process rank to `function`. Note that in this case the signature of `function` should be `def function(example[, idx], rank): ...`. with_split (`bool`, defaults to `False`): Provide process split to `function`. Note that in this case the signature of `function` should be `def function(example[, idx], split): ...`. input_columns (`[Union[str, list[str]]]`, *optional*, defaults to `None`): The columns to be passed into `function` as positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. batched (`bool`, defaults to `False`): Provide batch of examples to `function`. batch_size (`int`, *optional*, defaults to `1000`): Number of examples per batch provided to `function` if `batched=True`, `batch_size <= 0` or `batch_size == None` then provide the full dataset as a single batch to `function`. drop_last_batch (`bool`, defaults to `False`): Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function. remove_columns (`[Union[str, list[str]]]`, *optional*, defaults to `None`): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of `function`, i.e. if `function` is adding columns with names in `remove_columns`, these columns will be kept. keep_in_memory (`bool`, defaults to `False`): Keep the dataset in memory instead of writing it to a cache file. load_from_cache_file (`Optional[bool]`, defaults to `True` if caching is enabled): If a cache file storing the current computation from `function` can be identified, use it instead of recomputing. cache_file_names (`[Dict[str, str]]`, *optional*, defaults to `None`): Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one `cache_file_name` per dataset in the dataset dictionary. writer_batch_size (`int`, default `1000`): Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. features (`[datasets.Features]`, *optional*, defaults to `None`): Use a specific [`Features`] to store the cache file instead of the automatically generated one. disable_nullable (`bool`, defaults to `False`): Disallow null values in the table. fn_kwargs (`Dict`, *optional*, defaults to `None`): Keyword arguments to be passed to `function` num_proc (`int`, *optional*, defaults to `None`): Number of processes for multiprocessing. By default it doesn't use multiprocessing. desc (`str`, *optional*, defaults to `None`): Meaningful description to be displayed alongside with the progress bar while mapping examples. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example >>> ds = ds.map(add_prefix) >>> ds["train"][0:3]["text"] ['Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .', 'Review: effective but too-tepid biopic'] # process a batch of examples >>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True) # set number of processors >>> ds = ds.map(add_prefix, num_proc=4) ``` N)r�r�r�r�r�r�r�r�r�r��cache_file_namerrGrr/rr)rA�dict�fromkeysrEr(�mapr+r8)r-r�r�r�r�r�r�r�r�r�r�r�r�rrGrr/rr�dataset_dictr[r@s r0rzDatasetDict.map-s��n ��!�!�!��#�#�}�}�T�2�2��"�j�j�l�l� )� )�N�E�7�� 1��%�0�0��")�+�+�!�)�#�+��%� /�-�-�%9� 0�� 7�"3�!�!1�#�!��##.�#�#�L��(� )�#�=��<�(�(�(r2c�� |�� t�|�� t�� f d�|��D��S)aApply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary. Args: function (`Callable`): Callable with one of the following signatures: - `function(example: Dict[str, Any]) -> bool` if `batched=False` and `with_indices=False` and `with_rank=False` - `function(example: Dict[str, Any], *extra_args) -> bool` if `batched=False` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) - `function(batch: Dict[str, list]) -> list[bool]` if `batched=True` and `with_indices=False` and `with_rank=False` - `function(batch: Dict[str, list], *extra_args) -> list[bool]` if `batched=True` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) If no function is provided, defaults to an always `True` function: `lambda x: True`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx[, rank]): ...`. with_rank (`bool`, defaults to `False`): Provide process rank to `function`. Note that in this case the signature of `function` should be `def function(example[, idx], rank): ...`. input_columns (`[Union[str, list[str]]]`, *optional*, defaults to `None`): The columns to be passed into `function` as positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. batched (`bool`, defaults to `False`): Provide batch of examples to `function`. batch_size (`int`, *optional*, defaults to `1000`): Number of examples per batch provided to `function` if `batched=True` `batch_size <= 0` or `batch_size == None` then provide the full dataset as a single batch to `function`. keep_in_memory (`bool`, defaults to `False`): Keep the dataset in memory instead of writing it to a cache file. load_from_cache_file (`Optional[bool]`, defaults to `True` if caching is enabled): If a cache file storing the current computation from `function` can be identified, use it instead of recomputing. cache_file_names (`[Dict[str, str]]`, *optional*, defaults to `None`): Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one `cache_file_name` per dataset in the dataset dictionary. writer_batch_size (`int`, defaults to `1000`): Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. fn_kwargs (`Dict`, *optional*, defaults to `None`): Keyword arguments to be passed to `function` num_proc (`int`, *optional*, defaults to `None`): Number of processes for multiprocessing. By default it doesn't use multiprocessing. desc (`str`, *optional*, defaults to `None`): Meaningful description to be displayed alongside with the progress bar while filtering examples. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.filter(lambda x: x["label"] == 1) DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 4265 }) validation: Dataset({ features: ['text', 'label'], num_rows: 533 }) test: Dataset({ features: ['text', 'label'], num_rows: 533 }) }) ``` Nc�h� �i|].\}}||�� |�� /S)) r�r�r�r�r�r�r�r�rrr/rr��filter)rZrfr@r�r�r�rr/r�r�r�r�rr�r�rs ��r0rnz&DatasetDict.filter.<locals>.<dictcomp> sp�� A�w��7�>�>�%�!-�'�"/�#�)�#1�)=�$4�Q�$7�&7�'�%��"�� r2�rArrr8rE)r-r�r�r�r�r�r�r�r�r�rr/rrs `````````````r0r zDatasetDict.filter�s��l ��!�!�!��#�#�}�}�T�2�2�� #'�*�*�,�,�! � � � � � r2�new_fingerprintc ��|��t�|��t��fd�|��D��S)a�Create and cache a new Dataset by flattening the indices mapping. Args: keep_in_memory (`bool`, defaults to `False`): Keep the dataset in memory instead of writing it to a cache file. cache_file_names (`Dict[str, str]`, *optional*, default `None`): Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one `cache_file_name` per dataset in the dataset dictionary. writer_batch_size (`int`, defaults to `1000`): Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. features (`Optional[datasets.Features]`, defaults to `None`): Use a specific [`Features`] to store the cache file instead of the automatically generated one. disable_nullable (`bool`, defaults to `False`): Allow null values in the table. num_proc (`int`, optional, default `None`): Max number of processes when generating cache. Already cached shards are loaded sequentially new_fingerprint (`str`, *optional*, defaults to `None`): The new fingerprint of the dataset after transform. If `None`, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments Nc�\��i|](\}}||��|� ��)S))r�rrrGrrr)�flatten_indices) rZrfr@r�rrGr�rrrs ��r0rnz/DatasetDict.flatten_indices.<locals>.<dictcomp>Zs`�� A�w��7�*�*�#1�$4�Q�$7�&7�%�%5�%�$3�+�� r2r)r-r�r�rrGrrrs ```````r0rzDatasetDict.flatten_indices4s��D ��!�!�!��#�#�}�}�T�2�2�� #'�*�*�,�,� � � � � � r2�at_end�reverse�null_placement�indices_cache_file_namesc ��|��t�|��t��fd�|��D��S)a�Create a new dataset sorted according to a single or multiple columns. Args: column_names (`Union[str, Sequence[str]]`): Column name(s) to sort by. reverse (`Union[bool, Sequence[bool]]`, defaults to `False`): If `True`, sort by descending order rather than ascending. If a single bool is provided, the value is applied to the sorting of all column names. Otherwise a list of bools with the same length and order as column_names must be provided. null_placement (`str`, defaults to `at_end`): Put `None` values at the beginning if `at_start` or `first` or at the end if `at_end` or `last` keep_in_memory (`bool`, defaults to `False`): Keep the sorted indices in memory instead of writing it to a cache file. load_from_cache_file (`Optional[bool]`, defaults to `True` if caching is enabled): If a cache file storing the sorted indices can be identified, use it instead of recomputing. indices_cache_file_names (`[Dict[str, str]]`, *optional*, defaults to `None`): Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. You have to provide one `cache_file_name` per dataset in the dataset dictionary. writer_batch_size (`int`, defaults to `1000`): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset('cornell-movie-review-data/rotten_tomatoes') >>> ds['train']['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> sorted_ds = ds.sort('label') >>> sorted_ds['train']['label'][:10] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] >>> another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False]) >>> another_sorted_ds['train']['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ``` Nc�\��i|](\}}||��|� ��)S))r}rrr�r��indices_cache_file_namer)�sort) rZrfr@r}rr�r�rrrs ��r0rnz$DatasetDict.sort.<locals>.<dictcomp>�s^�� A�w��7�<�<�!-�#�#1�#1�)=�,D�Q�,G�&7� �� r2r)r-r}rrr�r�rrs ```````r0rzDatasetDict.sorths��b ��!�!�!�#�+�'+�}�}�T�':�':�$�� #'�*�*�,�,� � � � � � r2�seeds�seed� generatorsc��|��|��td��|�|n��t�|��n0t �t��st�|��t�|��t�|��t��fd�|��D��S)a,Create a new Dataset where the rows are shuffled. The transformation is applied to all the datasets of the dataset dictionary. Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy's default random generator (PCG64). Args: seeds (`Dict[str, int]` or `int`, *optional*): A seed to initialize the default BitGenerator if `generator=None`. If `None`, then fresh, unpredictable entropy will be pulled from the OS. If an `int` or `array_like[ints]` is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You can provide one `seed` per dataset in the dataset dictionary. seed (`int`, *optional*): A seed to initialize the default BitGenerator if `generator=None`. Alias for seeds (a `ValueError` is raised if both are provided). generators (`Dict[str, *optional*, np.random.Generator]`): Numpy random Generator to use to compute the permutation of the dataset rows. If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). You have to provide one `generator` per dataset in the dataset dictionary. keep_in_memory (`bool`, defaults to `False`): Keep the dataset in memory instead of writing it to a cache file. load_from_cache_file (`Optional[bool]`, defaults to `True` if caching is enabled): If a cache file storing the current computation from `function` can be identified, use it instead of recomputing. indices_cache_file_names (`Dict[str, str]`, *optional*): Provide the name of a path for the cache file. It is used to store the indices mappings instead of the automatically generated cache file name. You have to provide one `cache_file_name` per dataset in the dataset dictionary. writer_batch_size (`int`, defaults to `1000`): Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running `map`. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds["train"]["label"][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] # set a seed >>> shuffled_ds = ds.shuffle(seed=42) >>> shuffled_ds["train"]["label"][:10] [0, 1, 0, 1, 0, 0, 0, 0, 0, 0] ``` Nz*Please specify seed or seeds, but not bothc�r��i|]3\}}||��|�|��|��4S))r� generatorr�r�rr��shuffle) rZrfr@rrr�r�rrs ��r0rnz'DatasetDict.shuffle.<locals>.<dictcomp>�sc�� A�w��7�?�?��q��(��m�#1�)=�,D�Q�,G�&7� #�� r2)rArHrrr<r8rE)r-rrrr�r�rrs ` `````r0r"zDatasetDict.shuffle�s��r ��!�!�!�� 1��I�J�J�J��(��e��=��M�M�$�'�'�E�E��E�4�(�(� /��M�M�$��.�.�E��t�,�,�J�#�+�'+�}�}�T�':�':�$�� #'�*�*�,�,� � � � � � r2�dataset_dict_path�max_shard_size� num_shards�storage_optionsc�p�t|fi|pi��\}}|�t�|��}n$t|t��st d��|�|d��|�tj|tj ��dd��5}tjdt|��i|��ddd��n#1swxYwY|��D]E\} } | �tj|| ��|�| ��|||� ��FdS) a� Saves a dataset dict to a filesystem using `fsspec.spec.AbstractFileSystem`. For [`Image`], [`Audio`] and [`Video`] data: All the Image(), Audio() and Video() data are stored in the arrow files. If you want to store paths or urls, please use the Value("string") type. Args: dataset_dict_path (`path-like`): Path (e.g. `dataset/train`) or remote URI (e.g. `s3://my-bucket/dataset/train`) of the dataset dict directory where the dataset dict will be saved to. max_shard_size (`int` or `str`, *optional*, defaults to `"500MB"`): The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like `"50MB"`). num_shards (`Dict[str, int]`, *optional*): Number of shards to write. By default the number of shards depends on `max_shard_size` and `num_proc`. You need to provide the number of shards for each dataset in the dataset dictionary. Use a dictionary to define a different num_shards for each split. <Added version="2.8.0"/> num_proc (`int`, *optional*, default `None`): Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default. <Added version="2.8.0"/> storage_options (`dict`, *optional*): Key/value pairs to be passed on to the file-system backend, if any. <Added version="2.8.0"/> Example: ```python >>> dataset_dict.save_to_disk("path/to/dataset/directory") >>> dataset_dict.save_to_disk("path/to/dataset/directory", max_shard_size="1GB") >>> dataset_dict.save_to_disk("path/to/dataset/directory", num_shards={"train": 1024, "test": 8}) ``` N�gPlease provide one `num_shards` per dataset in the dataset dictionary, e.g. {{'train': 128, 'test': 4}}T)�exist_ok�w�utf-8��encoding�splits)r%r$rr&)r rrr<rH�makedirs�open� posixpathr�r�DATASETDICT_JSON_FILENAME�json�dumprDrE�save_to_disk�get)r-r#r$r%rr&�fs�_�frfr@s r0r5zDatasetDict.save_to_disk�s��`�+�G�G��0E�2�G�G��A��t�,�,�J�J��J��-�-� ��y�� %��5�5�5� �W�W��N�,�f�.N�O�O�� 1��I�x��d��,�a�0�0�0� 1� 1� 1� 1� 1� 1� 1� 1� 1� 1� 1�� 1� 1� 1� 1��*�*�,�,� � �J�A�w�� 0�!�4�4�%�>�>�!�,�,�-�!� /� !� � � � � � s� %C�C�Cc��t|fi|pi��\}}tj|tj��}tj|tj��}tj|tj��}|�|��sP|�|��r(|�|��rtd|�d��td|�d��|� |dd��5}tj|��d}ddd��n#1swxYwYt��} |D]D} tj|� |��| ��}tj|||� ��| | <�E| S) aK Load a dataset that was previously saved using [`save_to_disk`] from a filesystem using `fsspec.spec.AbstractFileSystem`. Args: dataset_dict_path (`path-like`): Path (e.g. `"dataset/train"`) or remote URI (e.g. `"s3//my-bucket/dataset/train"`) of the dataset dict directory where the dataset dict will be loaded from. keep_in_memory (`bool`, defaults to `None`): Whether to copy the dataset in-memory. If `None`, the dataset will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to nonzero. See more details in the [improve performance](../cache#improve-performance) section. storage_options (`dict`, *optional*): Key/value pairs to be passed on to the file-system backend, if any. <Added version="2.8.0"/> Returns: [`DatasetDict`] Example: ```py >>> ds = load_from_disk('path/to/dataset/directory') ``` zNo such file: 'z�'. Expected to load a `DatasetDict` object, but got a `Dataset`. Please use either `datasets.load_from_disk` or `Dataset.load_from_disk` instead.zU'. Expected to load a `DatasetDict` object, but provided path is not a `DatasetDict`.�rr+r,r.N)r�r&)r r1r�rr2�DATASET_STATE_JSON_FILENAME�DATASET_INFO_FILENAME�isfile�FileNotFoundErrorr0r3�loadr8�unstrip_protocolr�load_from_disk)r#r�r&r7�dataset_dict_json_path�dataset_state_json_path�dataset_info_pathr9r.r rf�dataset_dict_split_paths r0rBzDatasetDict.load_from_diskHs��B!*�*;� W� W��@U�SU� W� W��!*��0A�6�Cc�!d�!d��"+�.�1B�F�Df�"g�"g��%�N�+<�f�>Z�[�[��y�y�/�0�0� ��y�y�*�+�+� �� :Q�0R�0R� �'�P�&<�P�P�P��$�P�"8�P�P�P�� W�W�+�S�7�W� C� C� ,�q��Y�q�\�\�(�+�F� ,� ,� ,� ,� ,� ,� ,� ,� ,� ,� ,�� ,� ,� ,� ,�#�}�}�� A�&/�n�R�5H�5H�IZ�5[�5[�]^�&_�&_�#�%�4�'�-� /��L��O�O� �s�-D�D�D� path_or_paths� cache_dirc�L�ddlm}||f|||d�|��S)a+Create [`DatasetDict`] from CSV file(s). Args: path_or_paths (`dict` of path-like): Path(s) of the CSV file(s). features ([`Features`], *optional*): Dataset features. cache_dir (str, *optional*, defaults to `"~/.cache/huggingface/datasets"`): Directory to cache data. keep_in_memory (`bool`, defaults to `False`): Whether to copy the data in-memory. **kwargs (additional keyword arguments): Keyword arguments to be passed to [`pandas.read_csv`]. Returns: [`DatasetDict`] Example: ```py >>> from datasets import DatasetDict >>> ds = DatasetDict.from_csv({'train': 'path/to/dataset.csv'}) ``` r)�CsvDatasetReader�rGrHr�)�io.csvrJ�read)rGrGrHr��kwargsrJs r0�from_csvzDatasetDict.from_csv�sW��B -�,�,�,�,�,�� )� � � � � ��$�&�&� r2c�L�ddlm}||f|||d�|��S)aECreate [`DatasetDict`] from JSON Lines file(s). Args: path_or_paths (`path-like` or list of `path-like`): Path(s) of the JSON Lines file(s). features ([`Features`], *optional*): Dataset features. cache_dir (str, *optional*, defaults to `"~/.cache/huggingface/datasets"`): Directory to cache data. keep_in_memory (`bool`, defaults to `False`): Whether to copy the data in-memory. **kwargs (additional keyword arguments): Keyword arguments to be passed to [`JsonConfig`]. Returns: [`DatasetDict`] Example: ```py >>> from datasets import DatasetDict >>> ds = DatasetDict.from_json({'train': 'path/to/dataset.json'}) ``` r)�JsonDatasetReaderrK)�io.jsonrQrM)rGrGrHr�rNrQs r0� from_jsonzDatasetDict.from_json��W��B /�.�.�.�.�.� � �� )� � � � � ��$�&�&� r2c�N�ddlm}||f||||d�|��S)a5Create [`DatasetDict`] from Parquet file(s). Args: path_or_paths (`dict` of path-like): Path(s) of the CSV file(s). features ([`Features`], *optional*): Dataset features. cache_dir (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`): Directory to cache data. keep_in_memory (`bool`, defaults to `False`): Whether to copy the data in-memory. columns (`list[str]`, *optional*): If not `None`, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 'a.c', and 'a.d.e'. **kwargs (additional keyword arguments): Keyword arguments to be passed to [`ParquetConfig`]. Returns: [`DatasetDict`] Example: ```py >>> from datasets import DatasetDict >>> ds = DatasetDict.from_parquet({'train': 'path/to/dataset/parquet'}) ``` r)�ParquetDatasetReader)rGrHr�r�)� io.parquetrVrM)rGrGrHr�r�rNrVs r0�from_parquetzDatasetDict.from_parquet�sZ��L 5�4�4�4�4�4�#�#�� )�� $�&�&� r2c�L�ddlm}||f|||d�|��S)a+Create [`DatasetDict`] from text file(s). Args: path_or_paths (`dict` of path-like): Path(s) of the text file(s). features ([`Features`], *optional*): Dataset features. cache_dir (`str`, *optional*, defaults to `"~/.cache/huggingface/datasets"`): Directory to cache data. keep_in_memory (`bool`, defaults to `False`): Whether to copy the data in-memory. **kwargs (additional keyword arguments): Keyword arguments to be passed to [`TextConfig`]. Returns: [`DatasetDict`] Example: ```py >>> from datasets import DatasetDict >>> ds = DatasetDict.from_text({'train': 'path/to/dataset.txt'}) ``` r)�TextDatasetReaderrK)�io.textrZrM)rGrGrHr�rNrZs r0� from_textzDatasetDict.from_textrTr2�label2id�label_columnc��|��t��fd�|��D��S)Nc�F��i|]\}}||��S))r]r^)�align_labels_with_mapping)rZrfr@r]r^s ��r0rnz9DatasetDict.align_labels_with_mapping.<locals>.<dictcomp>:sC�� A�w��7�4�4�h�Ua�4�b�b� � � r2r�)r-r]r^s ``r0raz%DatasetDict.align_labels_with_mapping6s[��!�!�!�� "&�*�*�,�,� � � � � � r2�defaultT�config_name�set_default�data_dir�commit_message�commit_description�private�token�revision� create_pr�embed_external_filesc ��|�t�|��}n$t|t��std��|��|��d}d}t t|��j � ��}||_t��|_ |��D]7}tjt"|��stdt"�d|�d��8t%t&j|��}|�||d|d � ��}|j}| �/| �d��s|�|| |dd ��s |d kr|nd�g}|��D]�}t2� d|�d��||�|�||| | ||�|��| �� \}}}||z }||z }||z }t9t;|��|t=||��|j |<��d|_||_ ||_!||z|_"d\}}g}g}d�|D��}|�#|| d|d ��D�];}t|tH��s�|j%t&j&krd }�1|j%t&j'krd }�I|j%�tQ�fd�|��D��r2|j%|vr)|�)tU|j%��tWj+|j%tYj-dd��rNt]tX��}t_|j%|��}|�J�|d} | |vr|�)| ��=|r`|�0|t&j&d| ��}!tcj2tg|!��}"|"j4}#tkj6|#��}$n?|rd}"to��}#tk��}$nd}"to��}#tk��}$|$s4|r2dd�|D��i}%tkd |%i��8|#��d�fd�|��D��i}&|rQ|d krK|$rD|$�9��}'|'d krtd��|$|'�:d ��}(d |&d <|r�|�0|t&j'd| ��})tw|)d �!��5}*tyj2|*��}+ddd��n#1swxYwYt{|��|+|<t}��},|,�?tyj@|+d"�#��Ad ��|�)t�t&j'|,�$��t�||i��8|#��tk||&i��8|#��|"�tcd%|#�d&��n|"}"|�)t�t&j&t;|"��A��$��|�|nd'}t=|��t&jDkr!|�E|||z|||d| | �(��}-n�t2� d)t&jD�d*��t�jGt=|��t&jDz��}.t�d|.��D]�}/||/t&jDz|/d+zt&jDz�|/dkr|ngz}0|�E||0|d,|/d-�d.|.d-�d/�z||d| | �(��}-t2� d0|/d+z�d1�|.|/z d+z rd2|.|/z d+z �d3�nd4zd5z��|-S)6a�Pushes the [`DatasetDict`] to the hub as a Parquet dataset. The [`DatasetDict`] is pushed using HTTP requests and does not need to have neither git or git-lfs installed. Each dataset split will be pushed independently. The pushed dataset will keep the original split names. The resulting Parquet files are self-contained by default: if your dataset contains [`Image`] or [`Audio`] data, the Parquet files will store the bytes of your images or audio files. You can disable this by setting `embed_external_files` to False. Args: repo_id (`str`): The ID of the repository to push to in the following format: `<user>/<dataset_name>` or `<org>/<dataset_name>`. Also accepts `<dataset_name>`, which will default to the namespace of the logged-in user. config_name (`str`): Configuration name of a dataset. Defaults to "default". set_default (`bool`, *optional*): Whether to set this configuration as the default one. Otherwise, the default configuration is the one named "default". data_dir (`str`, *optional*): Directory name that will contain the uploaded data files. Defaults to the `config_name` if different from "default", else "data". <Added version="2.17.0"/> commit_message (`str`, *optional*): Message to commit while pushing. Will default to `"Upload dataset"`. commit_description (`str`, *optional*): Description of the commit that will be created. Additionally, description of the PR if a PR is created (`create_pr` is True). <Added version="2.16.0"/> private (`bool`, *optional*): Whether to make the repo private. If `None` (default), the repo will be public unless the organization's default is private. This value is ignored if the repo already exists. token (`str`, *optional*): An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in with `huggingface-cli login`. Will raise an error if no token is passed and the user is not logged-in. revision (`str`, *optional*): Branch to push the uploaded files to. Defaults to the `"main"` branch. <Added version="2.15.0"/> create_pr (`bool`, *optional*, defaults to `False`): Whether to create a PR with the uploaded files or directly commit. <Added version="2.15.0"/> max_shard_size (`int` or `str`, *optional*, defaults to `"500MB"`): The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like `"500MB"` or `"1GB"`). num_shards (`Dict[str, int]`, *optional*): Number of shards to write. By default, the number of shards depends on `max_shard_size`. Use a dictionary to define a different num_shards for each split. <Added version="2.8.0"/> embed_external_files (`bool`, defaults to `True`): Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type: - [`Audio`] and [`Image`] removes local path information and embed file content in the Parquet files. Return: huggingface_hub.CommitInfo Example: ```python >>> dataset_dict.push_to_hub("<organization>/<dataset_id>") >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", private=True) >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", max_shard_size="1GB") >>> dataset_dict.push_to_hub("<organization>/<dataset_id>", num_shards={"train": 1024, "test": 8}) ``` If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages): ```python >>> english_dataset.push_to_hub("<organization>/<dataset_id>", "en") >>> french_dataset.push_to_hub("<organization>/<dataset_id>", "fr") >>> # later >>> english_dataset = load_dataset("<organization>/<dataset_id>", "en") >>> french_dataset = load_dataset("<organization>/<dataset_id>", "fr") ``` Nr(rzSplit name should match 'z' but got 'z'.)�endpointrir@T)ri� repo_typerhr)zrefs/pr/)�branchriror)rbrlzPushing split z to the Hub.)rer[rirjrkr$r%rl)� num_bytes�num_examples)FFc��g|] }|j�� Sr6��path_in_repo)rZ�additions r0r\z+DatasetDict.push_to_hub.<locals>.<listcomp>�s��M�M�M�x�X�2�M�M�Mr2)�repo_idrjrori� recursivec3�(�K�|]}��d|�d�V�� dS)�/�-Nr6�rZr[res �r0� <genexpr>z*DatasetDict.push_to_hub.<locals>.<genexpr>�s5��4e�4e�PU��5K�5K�5�5K�5K�5K�4e�4e�4e�4e�4e�4er2rtz{split}�*r[)rorj� data_filesc� �g|]}|d|�d�d��S)zdata/�-*�r[�pathr6)rZr[s r0r\z+DatasetDict.push_to_hub.<locals>.<listcomp>s,��d�d�d�u��8I��8I�8I�8I�J�J�d�d�dr2c�&��g|] }|��d|�d�d��S)rzr�r�r6r|s �r0r\z+DatasetDict.push_to_hub.<locals>.<listcomp>#s3��f�f�f�QV�U�x�4K�4K�%�4K�4K�4K�L�L�f�f�fr2zzThere exists a configuration named 'default'. To set a different configuration as default, rename the 'default' one first.r+r,�)�indent)ru�path_or_fileobjz--- z --- zUpload dataset)� operationsrfrgrirorjrkz)Number of files to upload is larger than z+. Splitting the push into multiple commits.rz (part �05dz-of-�)zCommit #z completedz (still z to go)��.)Irrr<rHrArK�next�iterr;�infor�rcrr.�keysr��matchrrr�HF_ENDPOINT�create_reporw� startswith� create_branch�logger�_push_parquet_shards_to_hubr6rr]r^�download_checksums� download_size�dataset_size� size_in_bytes�list_repo_treer� rfilename�REPOCARD_FILENAME�DATASETDICT_INFOS_FILENAME�tuple�appendr �fnmatchr�replacer$r%�hf_hub_downloadrr@rrlr"�from_dataset_card_datar�to_dataset_card_data�get_default_config_name�popr0r3r#r�write�dumps�encoderr�UPLOADS_MAX_NUMBER_PER_COMMIT� create_commit�math�ceil�range)1r-rwrcrdrerfrgrhrirjrkr$r%rl�total_uploaded_size�total_dataset_nbytes�info_to_dumpr[�api�repo_url� additions�split_additions� uploaded_size�dataset_nbytes�repo_with_dataset_card�repo_with_dataset_infos�repo_splits� deletions�repo_files_to_add� repo_file�pattern�split_pattern_fields� repo_split�dataset_card_path�dataset_card�dataset_card_data�metadata_configs� default_metadata_configs_to_dump�metadata_config_to_dump�default_config_namer8�dataset_infos_pathr9� dataset_infos�buffer�commit_info�num_commits�ir�s1 ` r0�push_to_hubzDatasetDict.push_to_hub@s� ��D��t�,�,�J�J��J��-�-� ��y�� !�!�!��#�#�%�%�%�� $(��d�k�k�m�m�)<�)<�$=�$=�$B�$G�$G�$I�$I��#.�� '�k�k��Y�Y�[�[� ^� ^�E��8�I�u�-�-� ^� �!\�Y�!\�!\�SX�!\�!\�!\�]�]�]� ^��V�/�u�=�=�=��?�?��#� � ��"��(;�(;�J�(G�(G��#�� K�&1�Y�&>�&>�{�{�F�H�� Y�Y�[�[� x� x�E��K�K�<��<�<�<�=�=�=�=A�%�[�=d�=d��!��!�#�-�%�>�>�%�0�0�%9�>e� >� >�:�O�]�N� ��(�I��=�0�� N�2� �)2�3�u�:�:��fi�jn�ot�ju�fv�fv�)w�)w�)w�L��&�&�*.��'�%8��"�$8��!�%8�;O�%O��"�;G�7�� 7�!#��13� �M�M�9�M�M�M��+�+��,� � � 3� 3�I��i��2�2� ��"�f�&>�>�>�)-�&�&��$��(I�I�I�*.�'�'��#�.�.�u�4e�4e�4e�4e�Y]�Yb�Yb�Yd�Yd�4e�4e�4e�/e�/e�f�f� 3��'�/@�@�@�� !6�I�DW�!X�!X�!X�Y�Y�Y�Y��#�J�R�S\�^a�b�b�� 3�0�0j�k�k��'5�i�6I�7�'S�'S�$�+�7�7�7�1�'�:� ��[�0�0��&�&�z�2�2�2��"� 1� #� 3� 3��(�#�!� !4�!�!��'�+�D�1B�,C�,C�D�D�L� ,� 1��.�E�FW�X�X�� $� 1��L� /� 1� 1��.�0�0��L� /� 1� 1��.�0�0�� s�K� s��d�d�Xc�d�d�d�0�,� �Y�(H�I�J�J�_�_�`q�r�r�r��f�f�f�f�Z^�Zc�Zc�Ze�Ze�f�f�f�# �� 6�;�)�3�3�� M�&6�&N�&N�&P�&P�#�&�)�3�3�$�:�� )�)<�=�A�A�)�L�L�A�15�#�I�.�"� �!$�!4�!4��1�#�!� "5�"�"��(�7�;�;�;� 3�q�&*�i��l�l� � 3� 3� 3� 3� 3� 3� 3� 3� 3� 3� 3�� 3� 3� 3� 3�)/��)=�)=�M�+�&��Y�Y�F��L�L��M�!�<�<�<�C�C�G�L�L�M�M�M��"�!'�!B�$*�� +�|�4�5�5�J�J�K\�]�]�]��&=�>�?�?�T�T�Uf�g�g�g�JV�J^�{�#E�+<�#E�#E�#E�F�F�F�dp��#�5� #�L� 1� 1� 8� 8� :� :� � � � � � �,:�+E��K[��y�>�>�V�A�A�A��+�+��$�y�0�-�#5��#�!�#�,� � �K�K� �K�K�N�F�<`�N�N�N� � � ��)�C� �N�N�V�5Y�$Y�Z�Z�K��1�k�*�*� � ��&��<�<��A��Im�?m�m��"#�q�&�&�Y�Y�b�2� �"�/�/��)�#1�4[�a�4[�4[�4[��4[�4[�4[�4[�#[�'9��'�%�'�0� � ��0�q�1�u�0�0�0�BM�PQ�/�TU�BU�]�>�+��/�A�"5�>�>�>�>�[]�_�� s�U<�<V�V)r�)rWr8)F)NNF)NF)NFFFNFr�FNFNNr�NFNNN) NFFNFr�FNNr�NNN)FNr�NFNN)FrFNNr�)NNNFNNr�)NNNN)NN)NNFN)rbNNNNNNNFNNT)Er3r4r5�__doc__rArKrNrVrr`�propertyrr]rrlrr�intrvryrDr}r�r�r�r�r�r�rr�r�r r�r�r�r��boolr�� contextlib�contextmanagerrr�r�r�rr�r�r�rr rrr�np�random� Generatorr"r&r5�staticmethodrBrOrSrXr\r!rarr�� __classcell__)ris@r0r8r85s��j�j�w�w�w� ��%�%�%��@�d�3��:�&�@�@�@��X�@��G�T�#�t�)�_�G�G�G��X�G�"� G�T�#�s�(�^� G� G� G��X� G�� D�$�s�C�x�.� D� D� D��X� D��H�d�3��S� �>�2�H�H�H��X�H�"� A�t�C��s��O�,� A� A� A��X� A�e�e�e�e�e�@J�S�J�T�#�t�)�_�J�J�J�J�0Q�T�#�s�(�^�Q�Q�Q�Q�&.�.�.� `�X�`�-�`�`�`�`�@t�#�t�=�t�t�t�t�:(r�5��d�3�i��+@�(r�]�(r�(r�(r�(r�T. �#�. ��. �P]�. �. �. �. �`$v�T�#�s�(�^�$v� �$v�$v�$v�$v�L"r�5��d�3�i��+@�"r�]�"r�"r�"r�"r�H � �#� �d� �}� � � � �@��#�"&�#(� %�%��s�m�%��$��%�!� %�%�%��%�R#�"&�#(� 1�1��s�m�1��$��1�!� 1�1�1�1�f!�!�!�F#'�#(� ��H�%��$��!� ��>#�"&�#(� K�K��s�m�K��$��K�!� K� � K�K�K�K�`#'�#(� 4�4��H�%�4��$��4�!� 4� �4�4�4�4�p(,�"�� 9=��$(� %�:>�$�/3�?C�+/�'+�!&�$(�"&�"�'W)�W)��8�$�W)��W)�� W)� �W)� ��c�4��9�n� 5�6� W)��W)��S�M�W)��W)�!��s�D��I�~�!6�7�W)��W)�'�t�n�W)�#�4��X�c�]�(:�#;�<�W)�$�C�=�W)��8�$�W)� �!W)�"�D�>�#W)�$�3�-�%W)�&�s�m�'W)�( �)W)�W)�W)�W)�v(,�"��9=��$(�$�/3�?C�+/�$(�"&�"�l �l ��8�$�l ��l �� l � ��c�4��9�n� 5�6�l �� l ��S�M�l ��l �'�t�n�l �#�4��X�c�]�(:�#;�<�l �$�C�=�l ��D�>�l ��3�-�l ��s�m�l � �l �l �l �l �` %�?C�+/�'+�!&�"&�)-�2 �2 ��2 �#�4��X�c�]�(:�#;�<�2 �$�C�=� 2 � �8�$�2 �� 2 ��3�-�2 �"�#��2 � �2 �2 �2 �2 �n05�&�$�/3�GK�+/�A �A ��C��#��.�/�A ��t�X�d�^�+�,�A �� A � �A �'�t�n� A �#+�4��X�c�]�0B�+C�"D�A �$�C�=�A � �A �A �A �A �JAE�"�?C�$�/3�GK�+/�Q �Q ��c�4��X�c�]�(:�#;�;�<�=�Q ��s�m�Q ��T�#�r�y�':�":�;�<� Q � �Q �'�t�n� Q �#+�4��X�c�]�0B�+C�"D�Q �$�C�=�Q � �Q �Q �Q �Q �l59�/3�"&�*.� H�H�#�H�!��s�C�x��1�H��T�#�s�(�^�,� H� �3�-�H�"�$�� H�H�H�H�T�*.�*.�9�9�#�9� ��9�"�$��9� � 9�9�9��\�9�v�(,��$� (�(��C��M�*�(��8�$�(��(�� (� � (�(�(��\�(�T�(,��$� (�(��C��M�*�(��8�$�(��(�� (� � (�(�(��\�(�T�(,��$�'+�.�.��C��M�*�.��8�$�.��.�� .� �$�s�)�$�.� �.�.�.��\�.�`�(,��$� (�(��C��M�*�(��8�$�(��(�� (� � (�(�(��\�(�T��g�7�8�8� �$� �c� �m� � � �9�8� �%�&*�"&�(,�,0�"&�#�"&�$)�48�/3�%)�r�r��r��d�^� r� �3�-�r�!�� r�%�S�M�r��$��r��}�r��3�-�r��D�>�r�!��s�C�x��1�r��T�#�s�(�^�,�r�#�r� �r�r�r�r�r�r�r�rr2r8c�F�eZdZd�Z d#deeddfd�Z d$deed ed edee ee efded ededee ee efdeeddfd�Z d%deedee ee efded eedeeddfd�Z d&deejjdeddfd�Zdededdfd�Zdeeefddfd�Zde ee efddfd�Zde ee efddfd�Zdededdfd �Zd!eddfd"�ZdS)'�IterableDatasetDictc��d�d�|��D��}tjdd|dtj��}d|�d�S)Nr�c�"�g|]\}}|�d|�� Sr�r6r�s r0r\z0IterableDatasetDict.__repr__.<locals>.<listcomp>wr�r2r�r�rzIterableDatasetDict({ r�r�r�s r0r�zIterableDatasetDict.__repr__vsT��y�y�?�?�$�*�*�,�,�?�?�?�@�@��v�d�G�T�1�b�d�3�3��5�$�5�5�5�5r2Nr>rWc�^��t�fd�|��D��S)at Return a dataset with the specified format. Args: type (`str`, *optional*): Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means it returns python objects (default). Example: ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True) >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds = ds.with_format("torch") >>> next(iter(ds)) {'text': 'compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .', 'label': tensor(1), 'input_ids': tensor([ 101, 18027, 16310, 16001, 1103, 9321, 178, 11604, 7235, 6617, 1742, 2165, 2820, 1206, 6588, 22572, 12937, 1811, 2153, 1105, 1147, 12890, 19587, 6463, 1105, 15026, 1482, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])} ``` c�D��i|]\}}||��S))r>)r�)rZrfr@r>s �r0rnz3IterableDatasetDict.with_format.<locals>.<dictcomp>�s0��#e�#e�#e�*�!�W�A�w�':�':��':�'E�'E�#e�#e�#er2�r�rE)r-r>s `r0r�zIterableDatasetDict.with_format{s6��X#�#e�#e�#e�#e�X\�Xb�Xb�Xd�Xd�#e�#e�#e�f�f�fr2Fr�r�r�r�r�r�r�r�r�r/c ��i} |��D]@\}}|rt||��}|�|||||||| ��| |<|r|j}�At | ��S)a� Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset. The transformation is applied to all the datasets of the dataset dictionary. You can specify whether the function should be batched or not with the `batched` parameter: - If batched is `False`, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. `{"text": "Hello there !"}`. - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`. - If batched is `True` and `batch_size` is `n` > 1, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. Args: function (`Callable`, *optional*, defaults to `None`): Function applied on-the-fly on the examples when you iterate on the dataset. It must have one of the following signatures: - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` - `function(example: Dict[str, Any], idx: int) -> Dict[str, Any]` if `batched=False` and `with_indices=True` - `function(batch: Dict[str, list]) -> Dict[str, list]` if `batched=True` and `with_indices=False` - `function(batch: Dict[str, list], indices: list[int]) -> Dict[str, list]` if `batched=True` and `with_indices=True` For advanced usage, the function can also return a `pyarrow.Table`. If the function is asynchronous, then `map` will run your function in parallel. Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. If no function is provided, default to identity function: `lambda x: x`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx[, rank]): ...`. input_columns (`[Union[str, list[str]]]`, *optional*, defaults to `None`): The columns to be passed into `function` as positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. batched (`bool`, defaults to `False`): Provide batch of examples to `function`. batch_size (`int`, *optional*, defaults to `1000`): Number of examples per batch provided to `function` if `batched=True`. drop_last_batch (`bool`, defaults to `False`): Whether a last batch smaller than the `batch_size` should be dropped instead of being processed by the function. remove_columns (`[list[str]]`, *optional*, defaults to `None`): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of `function`, i.e. if `function` is adding columns with names in `remove_columns`, these columns will be kept. fn_kwargs (`Dict`, *optional*, defaults to `None`): Keyword arguments to be passed to `function` Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example >>> ds = ds.map(add_prefix) >>> next(iter(ds["train"])) {'label': 1, 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} ``` )r�r�r�r�r�r�r�r/)rEr(rr+r�) r-r�r�r�r�r�r�r�r�r/r r[r@s r0rzIterableDatasetDict.map�s��^��"�j�j�l�l� )� )�N�E�7�� 1��%�0�0��")�+�+�!�)�+��%� /�-�#�#.� #� #�L�� )�#�=��"�<�0�0�0r2c�r��t��fd�|��D��S)a� Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. The filtering is applied to all the datasets of the dataset dictionary. Args: function (`Callable`): Callable with one of the following signatures: - `function(example: Dict[str, Any]) -> bool` if `with_indices=False, batched=False` - `function(example: Dict[str, Any], indices: int) -> bool` if `with_indices=True, batched=False` - `function(example: Dict[str, list]) -> list[bool]` if `with_indices=False, batched=True` - `function(example: Dict[str, list], indices: list[int]) -> list[bool]` if `with_indices=True, batched=True` If no function is provided, defaults to an always True function: `lambda x: True`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. input_columns (`str` or `list[str]`, *optional*): The columns to be passed into `function` as positional arguments. If `None`, a dict mapping to all formatted columns is passed as one argument. batched (`bool`, defaults to `False`): Provide batch of examples to `function` batch_size (`int`, *optional*, defaults to `1000`): Number of examples per batch provided to `function` if `batched=True`. fn_kwargs (`Dict`, *optional*, defaults to `None`): Keyword arguments to be passed to `function` Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.filter(lambda x: x["label"] == 0) >>> list(ds["train"].take(3)) [{'label': 0, 'text': 'Review: simplistic , silly and tedious .'}, {'label': 0, 'text': "Review: it's so laddish and juvenile , only teenage boys could possibly find it funny ."}, {'label': 0, 'text': 'Review: exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}] ``` c�N��i|]!\}}||��"S))r�r�r�r�r�r/r) rZrfr@r�r�r/r�r�r�s ��r0rnz.IterableDatasetDict.filter.<locals>.<dictcomp>?sV�� A�w��7�>�>�%�!-�"/�#�)�'� "�� r2r�)r-r�r�r�r�r�r/s ``````r0r zIterableDatasetDict.filter sf��b#� � � � � � � � � �#'�*�*�,�,� � � � � � r2r �buffer_sizec�f��t��fd�|��D��S)a� Randomly shuffles the elements of this dataset. The shuffling is applied to all the datasets of the dataset dictionary. This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required. For instance, if your dataset contains 10,000 elements but `buffer_size` is set to 1000, then `shuffle` will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer. If the dataset is made of several shards, it also does `shuffle` the order of the shards. However if the order has been fixed by using [`~datasets.IterableDataset.skip`] or [`~datasets.IterableDataset.take`] then the order of the shards is kept unchanged. Args: seed (`int`, *optional*, defaults to `None`): Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffer and also to shuffle the data shards. generator (`numpy.random.Generator`, *optional*): Numpy random Generator to use to compute the permutation of the dataset rows. If `generator=None` (default), uses `np.random.default_rng` (the default BitGenerator (PCG64) of NumPy). buffer_size (`int`, defaults to `1000`): Size of the buffer. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> list(ds["train"].take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'label': 1, 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}, {'label': 1, 'text': 'effective but too-tepid biopic'}] >>> ds = ds.shuffle(seed=42) >>> list(ds["train"].take(3)) [{'label': 1, 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."}, {'label': 1, 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'}, {'label': 1, 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}] ``` c�H��i|]\}}||��S))rr r�r!)rZrfr@r�r rs ��r0rnz/IterableDatasetDict.shuffle.<locals>.<dictcomp>�sA�� A�w��7�?�?�� {�?�[�[� � � r2r�)r-rr r�s ```r0r"zIterableDatasetDict.shuffleLsR��l#� � � � � � �"&�*�*�,�,� � � � � � r2r�r�c�b��t��fd�|��D��S)a� Rename a column in the dataset, and move the features associated to the original column under the new column name. The renaming is applied to all the datasets of the dataset dictionary. Args: original_column_name (`str`): Name of the column to rename. new_column_name (`str`): New name for the column. Returns: [`IterableDatasetDict`]: A copy of the dataset with a renamed column. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.rename_column("text", "movie_review") >>> next(iter(ds["train"])) {'label': 1, 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} ``` c�F��i|]\}}||��Sr�r�r�s ��r0rnz5IterableDatasetDict.rename_column.<locals>.<dictcomp>�r�r2r�r�s ``r0r�z!IterableDatasetDict.rename_column�sM��4#� � � � � � #'�*�*�,�,� � � � � � r2r�c�^��t�fd�|��D��S)a� Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The renaming is applied to all the datasets of the dataset dictionary. Args: column_mapping (`Dict[str, str]`): A mapping of columns to rename to their new names. Returns: [`IterableDatasetDict`]: A copy of the dataset with renamed columns Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.rename_columns({"text": "movie_review", "label": "rating"}) >>> next(iter(ds["train"])) {'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'rating': 1} ``` c�D��i|]\}}||��Sr�r�r�s �r0rnz6IterableDatasetDict.rename_columns.<locals>.<dictcomp>�s0��e�e�e�*�!�W�Q��&�&�n�&�E�E�e�e�er2r�r�s `r0r�z"IterableDatasetDict.rename_columns�s;��0#�e�e�e�e�X\�Xb�Xb�Xd�Xd�e�e�e� � � r2r}c�^��t�fd�|��D��S)a� Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset. The removal is applied to all the datasets of the dataset dictionary. Args: column_names (`Union[str, list[str]]`): Name of the column(s) to remove. Returns: [`IterableDatasetDict`]: A copy of the dataset object without the columns to remove. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.remove_columns("label") >>> next(iter(ds["train"])) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} ``` c�B��i|]\}}||��Sr6r�r�s �r0rnz6IterableDatasetDict.remove_columns.<locals>.<dictcomp>��0��#k�#k�#k�PZ�PQ�SZ�A�w�'=�'=�l�'K�'K�#k�#k�#kr2r�r�s `r0r�z"IterableDatasetDict.remove_columns��5��0#�#k�#k�#k�#k�^b�^h�^h�^j�^j�#k�#k�#k�l�l�lr2c�^��t�fd�|��D��S)a�Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset. The selection is applied to all the datasets of the dataset dictionary. Args: column_names (`Union[str, list[str]]`): Name of the column(s) to keep. Returns: [`IterableDatasetDict`]: A copy of the dataset object with only selected columns. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.select("text") >>> next(iter(ds["train"])) {'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} ``` c�B��i|]\}}||��Sr6r�r�s �r0rnz6IterableDatasetDict.select_columns.<locals>.<dictcomp>�r�r2r�r�s `r0r�z"IterableDatasetDict.select_columns�r�r2r�r�c�b��t��fd�|��D��S)anCast column to feature for decoding. The type casting is applied to all the datasets of the dataset dictionary. Args: column (`str`): Column name. feature ([`Feature`]): Target feature. Returns: [`IterableDatasetDict`] Example: ```py >>> from datasets import load_dataset, ClassLabel >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} >>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good'])) >>> ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='string', id=None)} ``` c�F��i|]\}}||��Sr�r�r�s ��r0rnz3IterableDatasetDict.cast_column.<locals>.<dictcomp> s2��c�c�c� ��7�Q��#�#�6�7�#�C�C�c�c�cr2r�r�s ``r0r�zIterableDatasetDict.cast_column�s?��6#�c�c�c�c�c�VZ�V`�V`�Vb�Vb�c�c�c� � � r2rGc�^��t�fd�|��D��S)a@ Cast the dataset to a new set of features. The type casting is applied to all the datasets of the dataset dictionary. Args: features (`Features`): New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. `string` <-> `ClassLabel` you should use [`map`] to update the Dataset. Returns: [`IterableDatasetDict`]: A copy of the dataset with casted features. Example: ```py >>> from datasets import load_dataset >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} >>> new_features = ds["train"].features.copy() >>> new_features['label'] = ClassLabel(names=['bad', 'good']) >>> new_features['text'] = Value('large_string') >>> ds = ds.cast(new_features) >>> ds["train"].features {'label': ClassLabel(names=['bad', 'good'], id=None), 'text': Value(dtype='large_string', id=None)} ``` c�D��i|]\}}||��Sr�r�r�s �r0rnz,IterableDatasetDict.cast.<locals>.<dictcomp>? s-��#f�#f�#f�:�1�g�A�w�|�|�X�|�'F�'F�#f�#f�#fr2r�r�s `r0r�zIterableDatasetDict.cast s6��F#�#f�#f�#f�#f�Y]�Yc�Yc�Ye�Ye�#f�#f�#f�g�g�gr2r*) NFFNFr�FNN)NFNFr�N)NNr�)r3r4r5r�rr]r�rr�r rDr�rrr r�r�r�r"r�r�r�r�rr�rr�r6r2r0r�r�us"��6�6�6�#�,g�,g��s�m�,g� �,g�,g�,g�,g�`(,�"� �9=�� %�:>�$(�b1�b1��8�$�b1��b1�� b1� ��c�4��9�n� 5�6�b1�� b1��b1��b1�!��s�D��I�~�!6�7�b1��D�>�b1� �b1�b1�b1�b1�L(,��9=��$(�$(�= �= ��8�$�= � ��c�4��9�n� 5�6� = � �= ��S�M� = ��D�>�= � �= �= �= �= �B�37�� ; �; ��B�I�/�0�; �� ; � �; �; �; �; �z" �#�" ��" �Pe�" �" �" �" �H �T�#�s�(�^� �@U� � � � �8m�5��d�3�i��+@�m�EZ�m�m�m�m�4m�5��d�3�i��+@�m�EZ�m�m�m�m�4 �#� �� @U� � � � �>#h��#h� �#h�#h�#h�#h�#h�#hr2r�)Ir�r�r�r3r�r1r��collections.abcr� functoolsr�ior�pathlibr�typingrrr �fsspec�numpyr��fsspec.corer �huggingface_hubrrr rrr�huggingface_hub.hf_apirr�r� arrow_datasetrrrGr�features.featuresrr�rr�namingrr.rrrr�tabler�utilsr �utils.doc_utilsr!�utils.metadatar"�utils.py_utilsr#r$r%�utils.typingr&� get_loggerr3r�r(rr8r�r6r2r0�<module>rs�� $�$�$�$�$�$��,�,�,�,�,�,�,�,�,�,� � � � ��!�!�!�!�!�!��,�+�+�+�+�+��*�*�*�*�*�*�/�/�/�/�/�/�/�/��;�;�;�;�;�;�;�;�;�;�;�;��-�-�-�-�-�-�+�+�+�+�+�+�I�I�I�I�I�I�I�I�I�I�"�"�"�"�"�"� �� H� %� %��<�<�<�<�<�7�<�<�<� }�}�}�}�}�$�}�}�}�@:Jh�Jh�Jh�Jh�Jh�$�Jh�Jh�Jh�Jh�Jhr2