Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: please support non-legacy-sql nested data type, and preserve order in json #2342

Closed
c0b opened this issue Sep 17, 2016 · 14 comments
Closed
Assignees
Labels
api: bigquery Issues related to the BigQuery API.

Comments

@c0b
Copy link

c0b commented Sep 17, 2016

report in googleapis/google-cloud-node#1593 (comment) from a nodejs app, but since my app can also be written in Python I have a test of Python BigQuery API as well, it seems have a same problem:

$ bq.py --format=prettyjson query --nouse_legacy_sql 'SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, 7)] d' |& cat -An
...
     2  [$
     3    {$
     4      "d": [$
     5        {$
     6          "a": "1", $
     7          "b": "2", $
     8          "id": "zx83"$
     9        }, $
    10        {$
    11          "a": "4", $
    12          "b": "7", $
    13          "id": "f8f7"$
    14        }$
    15      ]$
    16    }$
    17  ]$

the problems:

  1. nested value types are not interpreted, the value types of a and b can be inferred as Integers (from bigquery api response, there are schema including nested field's data type), should be converted to python integers; (the way to fix will need somewhat recursively interpret data types)
  2. keys order are not preserved, someone here might argue following reasons but there are ways workaround:
    1. JSON object's key order doesn't matter to machine; but when we print a prettyjson format, it's for human, and keys order does matter
    2. python's default json.loads convert a JSON object to python dict which uses key's internal hash code somewhat unpredictable order, way to fix is to load with OrderedDict, and drop sort_keys parameter when calling json.dumps
  3. trailing space in object key value pair lines, this is trivial and some Python JSON API's own problem, there are many careless python doing this including this project; please read below expected output, can be fixed by json.dumps(..., separators=(',', ': '))

many tools written in Python suffered the same problem, like what I commented in the popular httpie tool: httpie/cli#427 (comment)

HOW to Fix 2 & 3

In [1]: import json

In [2]: from collections import OrderedDict

In [3]: data = json.loads(
             '{"d": [ { "id": "zx83", "a": 1, "b": 2}, { "id": "f8f7", "a": 4, "b": 7 }  ]}',
               object_pairs_hook=OrderedDict)

In [4]: data
Out[4]: 
OrderedDict([(u'd',
              [OrderedDict([(u'id', u'zx83'), (u'a', 1), (u'b', 2)]),
               OrderedDict([(u'id', u'f8f7'), (u'a', 4), (u'b', 7)])])])

In [5]: json.dumps(data, indent=2, separators=(',', ': '))
Out[5]: '{\n  "d": [\n    {\n      "id": "zx83",\n      "a": 1,\n      "b": 2\n    },\n    {\n      "id": "f8f7",\n      "a": 4,\n      "b": 7\n    }\n  ]\n}'

In [6]: print json.dumps(data, indent=2, separators=(',', ': '))
{
  "d": [
    {
      "id": "zx83",
      "a": 1,
      "b": 2
    },
    {
      "id": "f8f7",
      "a": 4,
      "b": 7
    }
  ]
}

Expected output:

     2  [$
     3    {$
     4      "d": [$
     5        {$
     6          "id": "zx83",$
     7          "a": 1,$
     8          "b": 2$
     9        },$
    10        {$
    11          "id": "f8f7",$
    12          "a": 4,$
    13          "b": 7$
    14        }$
    15      ]$
    16    }$
    17  ]$
@dhermes dhermes added the api: bigquery Issues related to the BigQuery API. label Sep 17, 2016
@dhermes
Copy link
Contributor

dhermes commented Sep 17, 2016

@c0b Could you share a snippet using our library indicating what is broken?

@c0b
Copy link
Author

c0b commented Sep 18, 2016

what is broken?

like in googleapis/google-cloud-node#1593, this issue is not about what is broken, but how can you do better? if you call python json.dumps without separators, that ends up wasted line trailing spaces, can hardly say what is broken, but why not do better?

@dhermes
Copy link
Contributor

dhermes commented Sep 19, 2016

Can you point out places we are calling json.dumps / json.loads that could be better?

@c0b
Copy link
Author

c0b commented Sep 19, 2016

Can you point out places we are calling json.dumps / json.loads that could be better?

I'm not sure, I haven't looked at the code details in this repo where did it call json.dumps; but from the symptom of trailing spaces and all keys are sorted alphabetically, I assume that is a problem caused by json.dumps with sorted_keys=True and without separators that is a problem very similar to httpie/cli#427; if you know more details of this repo, you may correct me

@dhermes
Copy link
Contributor

dhermes commented Sep 19, 2016

Can you give an example of output from this repo with the problem? We never use sorted_keys and we use json.dumps sparingly (though we never use separators)

@c0b
Copy link
Author

c0b commented Sep 19, 2016

above code full path is /opt/google-cloud-sdk/bin/bootstrapping/bq.py under same directory a lot of code in python; I assume it's calling this python library but may be not, but it should be somewhere under @GoogleCloudPlatform org

@c0b
Copy link
Author

c0b commented Sep 19, 2016

/opt/google-cloud-sdk/bin/bq --format=prettyjson query --nouse_legacy_sql 'SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, 7)] d' |& cat -An

@tseaver
Copy link
Contributor

tseaver commented Sep 19, 2016

@c0b the gcloud SDK doesn't use this library at all.

@c0b
Copy link
Author

c0b commented Sep 19, 2016

the gcloud SDK doesn't use this library

So would you tell me where is the code repo for gcloud SDK code? should be maintained by Google and this ticket can be re-routed?

have a little test of this library, might be even worse, it raised exception with the new standard SQL with nested fields:

In this case I have correct credentials / project setup in environement vars, it runs no problem with bq tool or with nodejs api, in ticket googleapis/google-cloud-node#1593 has the detailed API responses

In [1]: from gcloud import bigquery

In [2]: client = bigquery.Client()

In [37]: query = client.run_sync_query('SELECT 1'); query.use_legacy_sql = False; query.run(); query.rows
Out[37]: [(1,)]

In [38]: query = client.run_sync_query('SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, 7)] d'); query.use_legacy_sql = Fal
    ...: se; query.run(); query.rows
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-38-fda2dc8184d1> in <module>()
----> 1 query = client.run_sync_query('SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, 7)] d'); query.use_legacy_sql = False; query.run(); query.rows

/home/ubuntu/.local/lib/python2.7/site-packages/gcloud/bigquery/query.pyc in rows(self)
    202         :returns: fields describing the schema (None until set by the server).
    203         """
--> 204         return _rows_from_json(self._properties.get('rows', ()), self.schema)
    205 
    206     @property

/home/ubuntu/.local/lib/python2.7/site-packages/gcloud/bigquery/_helpers.pyc in _rows_from_json(rows, schema)
     86             if field.mode == 'REPEATED':
     87                 row_data.append([converter(item, field)
---> 88                                  for item in cell['v']])
     89             else:
     90                 row_data.append(converter(cell['v'], field))

/home/ubuntu/.local/lib/python2.7/site-packages/gcloud/bigquery/_helpers.pyc in _record_from_json(value, field)
     52     if _not_null(value, field):
     53         record = {}
---> 54         for subfield, cell in zip(field.fields, value['f']):
     55             converter = _CELLDATA_FROM_JSON[subfield.field_type]
     56             if field.mode == 'REPEATED':

KeyError: 'f'

wonder is that correct way to call with new standard SQL?

  1. https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html
  2. https://cloud.google.com/bigquery/sql-reference/

@tseaver
Copy link
Contributor

tseaver commented Sep 19, 2016

So would you tell me where is the code repo for gcloud SDK code? should be maintained by Google and this ticket can be re-routed?

I don't know myself. Perhaps @jgeewax or @jonparrott can chime in?

have a little test of this library, might be even worse, it raised exception with the new standard SQL with nested fields:

Support for the "Standard SQL" data types is slowly being added: #2265, #2347. The BYTES type should be easy (if we can figure out how the back-end encodes the bytes -- the docs don't say). The STRUCT<t> and ARRAY<t> types are harder, and haven't got any support at all at this point. See #2229.

@theacodes
Copy link
Contributor

The Cloud SDK is not currently open-source.

@dhermes
Copy link
Contributor

dhermes commented Sep 19, 2016

Filed #2353 and #2354 to make this uber-bug a little more focused. Let's chat in each of those about the specific problems at hand.

@dhermes dhermes closed this as completed Sep 19, 2016
@c0b
Copy link
Author

c0b commented Sep 19, 2016

The Cloud SDK is not currently open-source.

Then please file your internal ticket on my behalf; I assume most of you guys are Google employees know where to re-route to.

that closed source code doesn't have any transparency I don't know where can I complain to.

@theacodes
Copy link
Contributor

@c0b the cloud SDK has a public issue tracker here: https://code.google.com/p/google-cloud-sdk/issues/list

There's also a built-in command gcloud feedback that you can use to report issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API.
Projects
None yet
Development

No branches or pull requests

4 participants