� ��gw$��ddlZddlmZmZddlmZddlmZmZm Z m Z mZmZddl ZddlmZddlmZddlmZdd lmZmZdd lmZer ddlZddlmZd ddefd�ZeGd�d��ZdS)�N)� dataclass�field)�BytesIO)� TYPE_CHECKING�Any�ClassVar�Dict�Optional�Union�)�config)�DownloadConfig)� array_cast)� is_local_path�xopen)�string_to_dict�)�FeatureType�pdf�pdfplumber.pdf.PDF�returnc��t��5}|jD]!}|�|jj��"|��cddd��S#1swxYwYdS)z-Convert a pdfplumber.pdf.PDF object to bytes.N)r�pages�writer�stream�getvalue)r�buffer�pages �e/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/datasets/features/pdf.py�pdf_to_bytesr s�� !�f��I� *� *�D��L�L��)�)�)�)�� !�!�!�!�!�!�!�!�!�!�!�!��!�!�!�!�!�!s�=A�A� Ac��eZdZUdZdZeed<dZee ed<dZ ee ed<ej ej��ej��d��Zeeed <edd d ��Ze ed<d �Zdee eedfdefd�Zdddefd�Zddeddfd�Zdedee dfffd�Zdeejejejfdejfd�Z dS)�Pdfa� **Experimental.** Pdf [`Feature`] to read pdf documents from a pdf file. Input: The Pdf feature accepts as input: - A `str`: Absolute path to the pdf file (i.e. random access is allowed). - A `dict` with the keys: - `path`: String with relative path of the pdf file in a dataset repository. - `bytes`: Bytes of the pdf file. This is useful for archived files with sequential access. - A `pdfplumber.pdf.PDF`: pdfplumber pdf object. Args: mode (`str`, *optional*): The mode to convert the pdf to. If `None`, the native mode of the pdf is used. decode (`bool`, defaults to `True`): Whether to decode the pdf data. If `False`, returns the underlying dictionary in the format `{"path": pdf_path, "bytes": pdf_bytes}`. Examples: ```py >>> from datasets import Dataset, Pdf >>> ds = Dataset.from_dict({"pdf": ["path/to/pdf/file.pdf"]}).cast_column("pdf", Pdf()) >>> ds.features["pdf"] Pdf(decode=True, id=None) >>> ds[0]["pdf"] <pdfplumber.pdf.PDF object at 0x7f8a1c2d8f40> >>> ds = ds.cast_column("pdf", Pdf(decode=False)) >>> ds[0]["pdf"] {'bytes': None, 'path': 'path/to/pdf/file.pdf'} ``` T�decodeN�idr�dtype��bytes�path�pa_typeF)�default�init�repr�_typec��|jS�N)r))�selfs r�__call__zPdf.__call__Ks ��|��valuerc�f�tjrddl}nd}t|t��r|dd�St|t ��rd|d�S|�/t||jj��r|�|��S|� d��=tj�|d��rd|� d��d�S|� d��|� d��+|� d��|� d��d�Std|�d��) z�Encode example into a format for Arrow. Args: value (`str`, `bytes`, `pdfplumber.pdf.PDF` or `dict`): Data passed as input to Pdf feature. Returns: `dict` with "path" and "bytes" fields rN�r(r'r(r&r'zRA pdf sample should have one of 'path' or 'bytes' but they are missing or None in �.)r �PDFPLUMBER_AVAILABLE� pdfplumber� isinstance�strr'r�PDF�encode_pdfplumber_pdf�get�osr(�isfile� ValueError)r0r3r8s r�encode_examplezPdf.encode_exampleNs>��&� ��J��e�S�!�!� �!�D�1�1�1� ��u� %� %� � �5�1�1�1� � #� �5�*�.�:L�(M�(M� #��-�-�e�4�4�4� �Y�Y�v� � � *�r�w�~�~�e�F�m�/L�/L� *�!�5�9�9�V�+<�+<�=�=�=� �Y�Y�w� � � +�u�y�y��/@�/@�/L�"�Y�Y�w�/�/��6�9J�9J�K�K�K��m�ej�m�m�m�� r2rc��t|d��r0t|jd��r|jjr|jjdd�Sdt|��d�S)aa Encode a pdfplumber.pdf.PDF object into a dictionary. If the PDF has an associated file path, returns the path. Otherwise, serializes the PDF content into bytes. Args: pdf (pdfplumber.pdf.PDF): A pdfplumber PDF object. Returns: dict: A dictionary with "path" or "bytes" field. r�nameNr5)�hasattrrrCr )rs rr<zPdf.encode_pdfplumber_pdfos^��3��!�!� >�g�c�j�&�&A�&A� >�c�j�o� >��J�O�d�;�;�;�!�<��+<�+<�=�=�=r2c��|jstd��tjrddl}ntd��|�i}|d|d}}|��|�t d|�d��t|��r|j|��}n�|� d ��d }|� tj��rtjntj } t||��d} |�| ��} n#t$rd} YnwxYwt!| ��}t#|d |��}|j|��S|jt%|��5} | }ddd��n#1swxYwY|S)aiDecode example pdf file into pdf data. Args: value (`str` or `dict`): A string with the absolute pdf file path, a dictionary with keys: - `path`: String with absolute or relative pdf file path. - `bytes`: The bytes of the pdf file. token_per_repo_id (`dict`, *optional*): To access and decode pdf files from private repositories on the Hub, you can pass a dictionary repo_id (`str`) -> token (`bool` or `str`). Returns: `pdfplumber.pdf.PDF` zKDecoding is disabled for this feature. Please use Pdf(decode=True) instead.rNz6To support decoding pdfs, please install 'pdfplumber'.r(r'z@A pdf should have one of 'path' or 'bytes' but both are None in r6z::��repo_id)�token�rb)�download_config)r#�RuntimeErrorr r7r8�ImportErrorr@r�open�split� startswith�HF_ENDPOINT�HUB_DATASETS_URL�HUB_DATASETS_HFFS_URLrr=rrr)r0r3�token_per_repo_idr8r(�bytes_r� source_url�patternrGrHrJ�f�ps r�decode_examplezPdf.decode_example�s��&�{� n��l�m�m�m��&� X��V�W�W�W��$� "��V�}�e�G�n�f��>��|� �!l�di�!l�!l�!l�m�m�m� ��&�&�.�)�*�/�$�/�/�C�C�!%��D�!1�!1�"�!5�J�&�0�0��1C�D�D�:��/�/�#�9�� %�"0��W�"E�"E�i�"P�� 1� 5� 5�g� >� >��%�%�%�%� $��%��&4�5�&A�&A�&A�O��d�D�/�J�J�J�A�*�:�?�1�-�-�-� ��1�1� �Q�� s$�+D�D�D�"E1�1E5�8E5rc�N�ddlm}|jr|n|d��|d��d�S)zfIf in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.r)�Value�binary�stringr&)�featuresr[r#)r0r[s r�flattenzPdf.flatten�sK��#�#�#�#�#�#��{� �D�D��x��h�� r2�storagec��tj�|j��rrtjdgt|��ztj��}tj�||gddg|� ��}�n�tj� |j��rrtjdgt|��ztj��}tj�||gddg|� ��}�n5tj�|j��r|j� d��dkr|�d��}n8tjdgt|��ztj��}|j� d��dkr|�d��}n8tjdgt|��ztj��}tj�||gddg|� ��}t||j��S)aCast an Arrow array to the Pdf arrow storage type. The Arrow types that can be converted to the Pdf pyarrow storage type are: - `pa.string()` - it must contain the "path" data - `pa.binary()` - it must contain the image bytes - `pa.struct({"bytes": pa.binary()})` - `pa.struct({"path": pa.string()})` - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter - `pa.list(*)` - it must contain the pdf array data Args: storage (`Union[pa.StringArray, pa.StructArray, pa.ListArray]`): PyArrow array to cast. Returns: `pa.StructArray`: Array in the Pdf arrow storage type, that is `pa.struct({"bytes": pa.binary(), "path": pa.string()})`. N)�typer'r()�maskr)�pa�types� is_stringrb�array�lenr\�StructArray�from_arrays�is_null� is_binaryr]� is_struct�get_field_indexrrr))r0r`�bytes_array� path_arrays r�cast_storagezPdf.cast_storage�s��&�8��g�l�+�+� w��(�D�6�C��L�L�#8�r�y�{�{�K�K�K�K��n�0�0�+�w�1G�'�SY�IZ�ah�ap�ap�ar�ar�0�s�s�G�G� �X� � �� -� -� w��4�&�3�w�<�<�"7�b�i�k�k�J�J�J�J��n�0�0�'�:�1F��RX�HY�`g�`o�`o�`q�`q�0�r�r�G�G� �X� � �� -� -� w��|�+�+�G�4�4��9�9�%�m�m�G�4�4�� h��v��G��'<�2�9�;�;�O�O�O��|�+�+�F�3�3�q�8�8�$�]�]�6�2�2� � ��X�t�f�s�7�|�|�&;�"�)�+�+�N�N�N� ��n�0�0�+�z�1J�W�V\�L]�dk�ds�ds�du�du�0�v�v�G��'�4�<�0�0�0r2r/)!�__name__� __module__�__qualname__�__doc__r#�bool�__annotations__r$r r:r%rrd�structr\r]r)rrr-r1rr'�dictrAr<rYr r_�StringArrayri� ListArrayrq�r2rr"r"s��"�"�H�F�D��B�� 0�E�8�C�=�/�/�/�&�R�Y��i�b�i�k�k�'R�'R�S�S�G�X�c�]�S�S�S��u�5�u�=�=�=�E�3�=�=�=��E�#�u�d�<P�*P�$Q��VZ��B>�#7�>�D�>�>�>�>�(8�8�D�8�EY�8�8�8�8�t ��}�d�3� �3E�.F�F�G� � � � �#1�E�"�.�"�.�"�,�*V�$W�#1�\^�\j�#1�#1�#1�#1�#1�#1r2r") r>�dataclassesrr�ior�typingrrrr r r�pyarrowrd�r �download.download_configr�tabler�utils.file_utilsrr�utils.py_utilsrr8r^rr'r r"r|r2r�<module>r�sg�� (�(�(�(�(�(�(�(��F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F��5�5�5�5�5�5��3�3�3�3�3�3�3�3�+�+�+�+�+�+��&��%�%�%�%�%�%�!�*�!�u�!�!�!�!��O1�O1�O1�O1�O1�O1�O1��O1�O1�O1r2