Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement select_object_content implementation #793

Merged
merged 1 commit into from
Sep 10, 2019

Conversation

harshavardhana
Copy link
Member

@harshavardhana harshavardhana commented Sep 4, 2019

This change fixes multiple issues

  • handles Unicode boundaries properly for special delimiters
  • handle zero payloads 'Cont' event messages
  • handle error messages properly

@harshavardhana
Copy link
Member Author

PR is updated with further changes @Praveenrajmani @sinhaashish PTAL

@sinhaashish
Copy link
Contributor

sinhaashish commented Sep 8, 2019

Breaking for this DELIMITER_CH = '╦'

data = client.select_object_content('wlk-data-wbrp', '20190612-00690-1/wlk-wbrp-part-0000.csv.gz', options)

  output_serialization=OutputSerialization(
        csv=CSVOutput(QuoteFields="ASNEEDED",
                      RecordDelimiter="\n",
                      FieldDelimiter=DELIMITER_CH,
                      QuoteCharacter='"',
                      QuoteEscapeCharacter='"',)
Traceback (most recent call last):
  File "examples/select_object_content.py", line 70, in <module>
    data = client.select_object_content('wlk-data-wbrp', '20190612-00690-1/wlk-wbrp-part-0000.csv.gz', options)
  File "build/bdist.linux-x86_64/egg/minio/api.py", line 255, in select_object_content
  File "build/bdist.linux-x86_64/egg/minio/xml_marshal.py", line 114, in xml_marshal_select
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

@harshavardhana
Copy link
Member Author

harshavardhana commented Sep 8, 2019

Unicode characters should be inputs for python as u'character' in python2, python3 supports them natively.

@harshavardhana
Copy link
Member Author

from minio import Minio
from minio.error import ResponseError

from minio.select.options import (SelectObjectOptions, CSVInput,
                                  JSONInput, RequestProgress,
                                  ParquetInput, InputSerialization,
                                  OutputSerialization, CSVOutput,
                                  JsonOutput)
from minio.select.errors import (SelectCRCValidationError, SelectMessageError)

client = Minio('s3.amazonaws.com',
               access_key='ACCESSKEY',
               secret_key='SECRETKEY')

options = SelectObjectOptions(
    expression="select * from s3object",
    input_serialization=InputSerialization(
        compression_type="GZIP",
        csv=CSVInput(FileHeaderInfo="USE",
                     RecordDelimiter="\n",
                     FieldDelimiter=u'╦',
                     QuoteCharacter='"',
                     QuoteEscapeCharacter='"',
                     Comments="#",
                     AllowQuotedRecordDelimiter="FALSE",
                     ),
        # If input is JSON
        # json=JSONInput(Type="DOCUMENT",)
        ),

    output_serialization=OutputSerialization(
        csv=CSVOutput(QuoteFields="ASNEEDED",
                      RecordDelimiter="\n",
                      FieldDelimiter=u'╦',
                      QuoteCharacter='"',
                      QuoteEscapeCharacter='"',)

        # json = JsonOutput(
        #     RecordDelimiter="\n",
        #     )
        ),
    request_progress=RequestProgress(
        enabled="False"
        )
    )

try:
    data = client.select_object_content('wlk-data-wbrp', '20190612-00690-1/wlk-wbrp-part-0000.csv.gz', options)

    # Get the records
    with open('my-record-file', 'w') as record_data:
        for d in data.stream(10*1024):
            record_data.write(d)

    # Get the stats
    print(data.stats())

except SelectMessageError as err:
    print(err)

except SelectCRCValidationError as err:
    print(err)

except ResponseError as err:
    print(err)

sinhaashish
sinhaashish previously approved these changes Sep 9, 2019
Copy link
Contributor

@sinhaashish sinhaashish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with different inputs and LGTM ,
Just SelectSelectCRCValidationError -> SelectCRCValidationError in examples/select_object_content.py

examples/select_object_content.py Outdated Show resolved Hide resolved
minio/select/reader.py Show resolved Hide resolved
Praveenrajmani
Praveenrajmani previously approved these changes Sep 9, 2019
Copy link
Collaborator

@Praveenrajmani Praveenrajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This change fixes multiple issues

- handles unicode boundaries properly for special delimiters
- handle zero payload 'Cont' event messages
- handle error messages properly
@nitisht nitisht merged commit 0625257 into minio:master Sep 10, 2019
@harshavardhana harshavardhana deleted the fix-obj branch September 10, 2019 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants