�
���gs � �� � d Z ddlZddlZddlmZ ddlmZmZmZ ddl Z ddl
ZddlZ
ddlmZ ddlmZ ddlmZ ddlmZmZmZmZmZ dd lmZmZmZmZm Z m!Z!m"Z"m#Z#m$Z$ dd
l%m&Z& ddl'm(Z( ddl)m*Z*m+Z+ dd
l,m-Z-m.Z.m/Z/m0Z0 ddl1m2Z2 ddl3m4Z4m5Z5 e2j6 e7� � Z8e9Z:dee dee; fd�Z< G d� de=� � Z> G d� d� � Z? G d� de?� � Z@ G d� d� � ZA G d� deA� � ZBdS )z$To write records into Parquet files.� N)�Iterable)�Any�Optional�Union)� url_to_fs� )�config)�Audio�Features�Image�Value�Video) �FeatureType�_ArrayXDExtensionType�_visit�cast_to_python_objects�generate_from_arrow_type�get_nested_type�%list_of_np_array_to_pyarrow_listarray�numpy_to_pyarrow_listarray�to_pyarrow_listarray)�is_remote_filesystem)�DatasetInfo)�DuplicatedKeysError� KeyHasher)�
array_cast�cast_array_to_feature�embed_table_storage�
table_cast)�logging)�asdict�first_non_null_value�features�returnc � �� | sdS t j �dt ddf�fd�}t | |� � �t j u rdn�S )a�
Get the writer_batch_size that defines the maximum row group size in the parquet files.
The default in `datasets` is 1,000 but we lower it to 100 for image/audio datasets and 10 for videos.
This allows to optimize random access to parquet file, since accessing 1 row requires
to read its entire row group.
This can be improved to get optimized size for querying/iterating
but at least it matches the dataset viewer expectations on HF.
Args:
features (`datasets.Features` or `None`):
Dataset Features from `datasets`.
Returns:
writer_batch_size (`Optional[int]`):
Writer batch size to pass to a dataset builder.
If `None`, then it will use the `datasets` default.
N�featurer$ c � �� t | t � � rt �t j � � �d S t | t
� � rt �t j � � �d S t | t � � rt �t j � � �d S t | t � � r'| j
dk rt �t j � � �d S d S d S )N�binary)�
isinstancer �minr �)PARQUET_ROW_GROUP_SIZE_FOR_IMAGE_DATASETSr
�)PARQUET_ROW_GROUP_SIZE_FOR_AUDIO_DATASETSr �)PARQUET_ROW_GROUP_SIZE_FOR_VIDEO_DATASETSr
�dtype�*PARQUET_ROW_GROUP_SIZE_FOR_BINARY_DATASETS)r&