� ���gw$����ddlZddlmZmZddlmZddlmZmZm Z m Z m Z m Z ddl ZddlmZddlmZddlmZdd lmZmZdd lmZer ddlZd d lmZd ddefd�ZeGd�d����ZdS)�N)� dataclass�field)�BytesIO)� TYPE_CHECKING�Any�ClassVar�Dict�Optional�Union�)�config)�DownloadConfig)� array_cast)� is_local_path�xopen)�string_to_dict�)� FeatureType�pdf�pdfplumber.pdf.PDF�returnc���t��5}|jD]!}|�|jj���"|���cddd��S#1swxYwYdS)z-Convert a pdfplumber.pdf.PDF object to bytes.N)r�pages�writer�stream�getvalue)r�buffer�pages �e/home/asafur/pinokio/api/open-webui.git/app/env/lib/python3.11/site-packages/datasets/features/pdf.py� pdf_to_bytesr s��� ���!�f��I� *� *�D� �L�L���� )� )� )� )���� � �!�!�!�!�!�!�!�!�!�!�!�!����!�!�!�!�!�!s�=A�A� Ac��eZdZUdZdZeed<dZee ed<dZ e e ed<e j e j��e j��d���Ze eed <edd d � ��Ze ed <d �Zdee eedfdefd�Zdddefd�Zddeddfd�Zdedee dfffd�Zdee je je jfde jfd�Z dS)�Pdfa� **Experimental.** Pdf [`Feature`] to read pdf documents from a pdf file. Input: The Pdf feature accepts as input: - A `str`: Absolute path to the pdf file (i.e. random access is allowed). - A `dict` with the keys: - `path`: String with relative path of the pdf file in a dataset repository. - `bytes`: Bytes of the pdf file. This is useful for archived files with sequential access. - A `pdfplumber.pdf.PDF`: pdfplumber pdf object. Args: mode (`str`, *optional*): The mode to convert the pdf to. If `None`, the native mode of the pdf is used. decode (`bool`, defaults to `True`): Whether to decode the pdf data. If `False`, returns the underlying dictionary in the format `{"path": pdf_path, "bytes": pdf_bytes}`. Examples: ```py >>> from datasets import Dataset, Pdf >>> ds = Dataset.from_dict({"pdf": ["path/to/pdf/file.pdf"]}).cast_column("pdf", Pdf()) >>> ds.features["pdf"] Pdf(decode=True, id=None) >>> ds[0]["pdf"] <pdfplumber.pdf.PDF object at 0x7f8a1c2d8f40> >>> ds = ds.cast_column("pdf", Pdf(decode=False)) >>> ds[0]["pdf"] {'bytes': None, 'path': 'path/to/pdf/file.pdf'} ``` T�decodeN�idr�dtype��bytes�path�pa_typeF)�default�init�repr�_typec��|jS�N)r))�selfs r�__call__z Pdf.__call__Ks ���|���valuerc�f�tjrddl}nd}t|t��r|dd�St|t ��rd|d�S|�/t||jj��r|�|��S|� d���=tj � |d��rd|� d��d�S|� d���|� d���+|� d��|� d��d�Std|�d����) z�Encode example into a format for Arrow. Args: value (`str`, `bytes`, `pdfplumber.pdf.PDF` or `dict`): Data passed as input to Pdf feature. Returns: `dict` with "path" and "bytes" fields rN�r(r'r(r&r'zRA pdf sample should have one of 'path' or 'bytes' but they are missing or None in �.)r �PDFPLUMBER_AVAILABLE� pdfplumber� isinstance�strr'r�PDF�encode_pdfplumber_pdf�get�osr(�isfile� ValueError)r0r3r8s r�encode_examplezPdf.encode_exampleNs>�� � &� � � � � � ��J� �e�S� !� !� �!�D�1�1� 1� ��u� %� %� � �5�1�1� 1� � #� �5�*�.�:L�(M�(M� #��-�-�e�4�4� 4� �Y�Y�v� � � *�r�w�~�~�e�F�m�/L�/L� *�!�5�9�9�V�+<�+<�=�=� =� �Y�Y�w� � � +�u�y�y��/@�/@�/L�"�Y�Y�w�/�/����6�9J�9J�K�K� K��m�ej�m�m�m��� r2rc��t|d��r0t|jd��r|jjr|jjdd�Sdt|��d�S)aa Encode a pdfplumber.pdf.PDF object into a dictionary. If the PDF has an associated file path, returns the path. Otherwise, serializes the PDF content into bytes. Args: pdf (pdfplumber.pdf.PDF): A pdfplumber PDF object. Returns: dict: A dictionary with "path" or "bytes" field. r�nameNr5)�hasattrrrCr )rs rr<zPdf.encode_pdfplumber_pdfos^�� �3�� !� !� >�g�c�j�&�&A�&A� >�c�j�o� >��J�O�d�;�;� ;�!�<��+<�+<�=�=� =r2c���|jstd���tjrddl}nt d���|�i}|d|d}}|��|�t d|�d����t|��r|j|��}n�|� d ��d }|� tj ��r tj n tj } t||��d } |�| ��} n#t $rd} YnwxYwt!| � ��} t#|d | ���} |j| ��S|jt%|����5} | }ddd��n #1swxYwY|S)aiDecode example pdf file into pdf data. Args: value (`str` or `dict`): A string with the absolute pdf file path, a dictionary with keys: - `path`: String with absolute or relative pdf file path. - `bytes`: The bytes of the pdf file. token_per_repo_id (`dict`, *optional*): To access and decode pdf files from private repositories on the Hub, you can pass a dictionary repo_id (`str`) -> token (`bool` or `str`). Returns: `pdfplumber.pdf.PDF` zKDecoding is disabled for this feature. Please use Pdf(decode=True) instead.rNz6To support decoding pdfs, please install 'pdfplumber'.r(r'z@A pdf should have one of 'path' or 'bytes' but both are None in r6z::������repo_id)�token�rb)�download_config)r#� RuntimeErrorr r7r8� ImportErrorr@r�open�split� startswith� HF_ENDPOINT�HUB_DATASETS_URL�HUB_DATASETS_HFFS_URLrr=rrr)r0r3�token_per_repo_idr8r(�bytes_r� source_url�patternrGrHrJ�f�ps r�decode_examplezPdf.decode_example�s���&�{� n��l�m�m� m� � &� X� � � � � ��V�W�W� W� � $� "� ��V�}�e�G�n�f�� �>��|� �!l�di�!l�!l�!l�m�m�m� ��&�&�.�)�*�/�$�/�/�C�C�!%���D�!1�!1�"�!5�J�&�0�0��1C�D�D�:��/�/�#�9�� %�"0��W�"E�"E�i�"P�� 1� 5� 5�g� >� >����%�%�%�%� $����%����&4�5�&A�&A�&A�O��d�D�/�J�J�J�A�*�:�?�1�-�-�-� �������1�1� �Q��� � � � � � � � � � � ���� � � � �� s$�+D� D�D�"E1�1E5�8E5rc�N�ddlm}|jr|n|d��|d��d�S)zfIf in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.r)�Value�binary�stringr&)�featuresr[r#)r0r[s r�flattenz Pdf.flatten�sK��#�#�#�#�#�#��{� �D�D���x�����h����� r2�storagec���tj�|j��rrtjdgt |��ztj�����}tj�||gddg|� �����}�n�tj� |j��rrtjdgt |��ztj �����}tj�||gddg|� �����}�n5tj� |j���r|j� d��dkr|�d��}n8tjdgt |��ztj�����}|j� d��dkr|�d��}n8tjdgt |��ztj �����}tj�||gddg|� �����}t||j��S)aCast an Arrow array to the Pdf arrow storage type. The Arrow types that can be converted to the Pdf pyarrow storage type are: - `pa.string()` - it must contain the "path" data - `pa.binary()` - it must contain the image bytes - `pa.struct({"bytes": pa.binary()})` - `pa.struct({"path": pa.string()})` - `pa.struct({"bytes": pa.binary(), "path": pa.string()})` - order doesn't matter - `pa.list(*)` - it must contain the pdf array data Args: storage (`Union[pa.StringArray, pa.StructArray, pa.ListArray]`): PyArrow array to cast. Returns: `pa.StructArray`: Array in the Pdf arrow storage type, that is `pa.struct({"bytes": pa.binary(), "path": pa.string()})`. N)�typer'r()�maskr)�pa�types� is_stringrb�array�lenr\� StructArray� from_arrays�is_null� is_binaryr]� is_struct�get_field_indexrrr))r0r`� bytes_array� path_arrays r� cast_storagezPdf.cast_storage�s���& �8� � �g�l� +� +� w��(�D�6�C��L�L�#8�r�y�{�{�K�K�K�K��n�0�0�+�w�1G�'�SY�IZ�ah�ap�ap�ar�ar�0�s�s�G�G� �X� � �� � -� -� w���4�&�3�w�<�<�"7�b�i�k�k�J�J�J�J��n�0�0�'�:�1F��RX�HY�`g�`o�`o�`q�`q�0�r�r�G�G� �X� � �� � -� -� w��|�+�+�G�4�4��9�9�%�m�m�G�4�4� � � �h��v��G� � �'<�2�9�;�;�O�O�O� ��|�+�+�F�3�3�q�8�8�$�]�]�6�2�2� � ��X�t�f�s�7�|�|�&;�"�)�+�+�N�N�N� ��n�0�0�+�z�1J�W�V\�L]�dk�ds�ds�du�du�0�v�v�G��'�4�<�0�0�0r2r/)!�__name__� __module__� __qualname__�__doc__r#�bool�__annotations__r$r r:r%rrd�structr\r]r)rrr-r1r r'�dictrAr<rYr r_� StringArrayri� ListArrayrq�r2rr"r"s��������"�"�H�F�D�����B��� ����0�E�8�C�=�/�/�/�&�R�Y������i�b�i�k�k�'R�'R�S�S�G�X�c�]�S�S�S���u�5�u�=�=�=�E�3�=�=�=�����E�#�u�d�<P�*P�$Q��VZ�����B>�#7�>�D�>�>�>�>�(8�8�D�8�EY�8�8�8�8�t  ��}�d�3� �3E�.F�F�G�  �  �  �  �#1�E�"�.�"�.�"�,�*V�$W�#1�\^�\j�#1�#1�#1�#1�#1�#1r2r") r>� dataclassesrr�ior�typingrrrr r r �pyarrowrd�r �download.download_configr�tabler�utils.file_utilsrr�utils.py_utilsrr8r^rr'r r"r|r2r�<module>r�sg�� � � � �(�(�(�(�(�(�(�(�������F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�F�����������5�5�5�5�5�5�������3�3�3�3�3�3�3�3�+�+�+�+�+�+��&�����%�%�%�%�%�%�!�*�!�u�!�!�!�!� �O1�O1�O1�O1�O1�O1�O1� ��O1�O1�O1r2
Memory