Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC for Presto -Native TPC-DS Connector #28

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pdabre12
Copy link

No description provided.

@pdabre12 pdabre12 force-pushed the native-tpcds-connector branch 2 times, most recently from cf621f2 to 353cbed Compare September 17, 2024 00:21
@pdabre12 pdabre12 changed the title Add RFC for Prestissimo TPC-DS Connector Add RFC for Presto -Native TPC-DS Connector Sep 17, 2024
Copy link

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pdabre12

### [Optional] Goals

1. Add a TPC-DS connector to generate TPC-DS data in Presto native.
2. Write end-to-end tests in Presto native with TPC-DS tables and conduct microbenchmarks in Velox.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip micro-benchmarks in Velox here.


## Background

Currently , Presto does not have a native implementation of the TPC-DS connector. This RFC proposes the addition of a new TPC-DS connector. The new connector can be used as a Presto - Native catalog.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give more background of TPC-DS benchmark, schema and dsdgen program here.


The Presto - Native TPC-DS connector will be a wrapper for the generator distributed (dsdgen) by the TPC organization from C. This means we need our implementation to have the exact same behavior as the C implementation. DuckDB already has a TPC-DS connector of their own and they have wrapped the C files into C++ files, we are going to use these C++ files in our implementation.

In the C++ implementation, there are two types of tables: source tables and target tables used for generation. Source table files are prefixed with "s_", while target table files are prefixed with "w_". For instance, there may be files like "s_call_center.c" and "w_call_center.c". It appears that source tables are only utilized when running the "dsdgen" with an update flag, though the exact function of this flag and the purpose of the source tables have not yet been explored. Currently, our focus is solely on implementing functionalities for the target tables (w_ tables).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't Ying want to generate the update data with TPC-DS connector as well ? Might be good to add more information about it here.


In the C++ implementation, there are two types of tables: source tables and target tables used for generation. Source table files are prefixed with "s_", while target table files are prefixed with "w_". For instance, there may be files like "s_call_center.c" and "w_call_center.c". It appears that source tables are only utilized when running the "dsdgen" with an update flag, though the exact function of this flag and the purpose of the source tables have not yet been explored. Currently, our focus is solely on implementing functionalities for the target tables (w_ tables).

In the target table files prefixed with “w_”, there are some helper functions(need to be implemented by us) precisely called as “append_row_start“ and “append_row_end“ which help in the row generation. Depending on the schema of the table, there will be “append_ “ functions depending on the data type to be appended.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be great to give more information about TpcdsSplits and how data generation happens for a table from one split to the next.


In the target table files prefixed with “w_”, there are some helper functions(need to be implemented by us) precisely called as “append_row_start“ and “append_row_end“ which help in the row generation. Depending on the schema of the table, there will be “append_ “ functions depending on the data type to be appended.

A new TPC-DS config `tpcds.toggle-char-to-varchar` will be added to toggle the char columns to varchar, addressing the lack of support for the char data type in Presto - Native. This config allows the toggling of the char to varchar when required, ensuring consistency between Presto - Java and Presto - Native.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the impact of this at the schema level and at the data level ? Please can you elaborate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants