Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow table from query result much larger than equivalent inserted table #1741

Open
keller-mark opened this issue May 22, 2024 · 5 comments
Open

Comments

@keller-mark
Copy link
Contributor

What happens?

Arrow tables returned by conn.query are much larger than expected due to lack of usage of Dictionary encoding. In the web browser, large (>1 million row) tables that can be inserted into the database successfully cannot be subsequently queried in their entirety due to the memory footprint of the returned Arrow table.

To Reproduce

This Observable notebook contains a minimal reproduction: https://observablehq.com/d/56ceaa780133858a

Browser/Environment:

Firefox Developer Edition 127.0b4

Device:

Macbook

DuckDB-Wasm Version:

1.24.0

DuckDB-Wasm Deployment:

Observable Notebook

Full Name:

Mark Keller

Affiliation:

Harvard Medical School

@keller-mark
Copy link
Contributor Author

I think this comes down to these lines:

const buffer = await this._bindings.runQuery(this._conn, text);
const reader = arrow.RecordBatchReader.from<T>(buffer);

On the JS side, the arrow.RecordBatchReader could potentially be modified to return dictionary-encoded columns. However if the buffer passed is not already dictionary-encoded, then the memory issue may persist due to the buffer being very large.

@domoritz
Copy link
Collaborator

Hmm, but if it's a buffer, doesn't that mean that JS here is just instantiating an arrow record batch from IPC and the IPC dictates the schema and so we can't just dictionary encode after the fact?

@keller-mark
Copy link
Contributor Author

Yes that is what i was trying to convey with

the memory issue may persist due to the buffer being very large

i.e., that the fix might need to be on the C++ side

@keller-mark
Copy link
Contributor Author

keller-mark commented May 27, 2024

That aside, on the JS side, couldn't the dictionary encoding be performed in incremental fashion as the buffer is parsed? For example you could keep a set of values that have been "seen" already and if an unseen value is encountered, add a mapping to the dictionary

EDIT

I suppose the above statement does not make sense if there is no "parsing" step https://github.com/apache/arrow/blob/ff9921ffa89585be69ae85674bb365d03cb22ba4/js/src/ipc/reader.ts#L357 - then the only option would be on the C++ side

@domoritz
Copy link
Collaborator

Yep, that's what I'm thinking as well but I may be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants