Add Iceberg samples #5

ZacBlanco · 2024-02-20T20:36:06Z

No description provided.

ZacBlanco · 2024-02-20T20:40:35Z

From the original thread from @aaneja:

Some questions I had -
What would be the schema for the $samples table ? Would there be any extra columns besides the ones from the underlying table

The schema should exactly map to the real table. Any updates to the real table's schema would also happen on the sample table.

Would we store some metadata for the $samples table that could be used to identify -
What table sampling method was used to seed the sample
When was the sample last updated w.r.t to the changelog

This was not mentioned in the RFC, but our prototype implementation included this info. We stored it in the iceberg table properties. I will add this to the RFC

tdcmeehan · 2024-06-28T15:27:10Z

RFC-0001.md

+
+#### Sample Maintenance
+
+Sample maintenance can be performed through a set of SQL statements which replace records in the


Can you please add an example for this?

tdcmeehan · 2024-06-28T15:27:55Z

RFC-0001.md

+Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been
+updated without needing to re-sample the entire table.
+
+Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure


I think we can do better than this comment. We can say, with the help of an orchestration tool, sample maintenance using Presto or Spark is as simple as running the following queries on a fixed cadence according to their desire for up to date samples--right?

Or, once #12 is merged, perhaps we can add a new distributed procedure which updates the table samples? CC: @hantangwangd

Or, once #12 is merged, perhaps we can add a new distributed procedure which updates the table samples?

Sure, in terms of feasibility, I think we can do this through a customized distributed procedure.

hantangwangd · 2024-07-02T09:50:36Z

RFC-0001.md

+		   INSERT INTO "{schema}.{table}$samples" 
+		   SELECT reservoir_sample(...) 
+		   FROM {schema}.{table};


Can reservoir_sample(...) use directly here? As described in PR #21296, reservoir_sample(...) will output a single row type with two columns.

hantangwangd · 2024-07-02T12:16:29Z

RFC-0001.md

+for sample maintenance is available already once the samples are generated. The biggest hurdle for
+users will be that sample maintenance will need to be done manually. We can document the maintenance
+process for samples, but there currently isn't a way to schedule or automatically run the correct
+set of maintenance queries for the sample tables. 


Besides of incremental sample maintenance, should we describe the mechanism of sample table refreshment as well? How to refresh the sample table and what granularity of refreshment is supported.

In addition, should we maintain the snapshot version correspondence between the sample table and the source table? As we do analyzing on the sample table, meanwhile do querying on the source table. We need a way to find the nearest statistics on sample table for the specified snapshot version on source table.

hantangwangd · 2024-07-02T12:50:32Z

RFC-0001.md

+sample with updates, deletes, or inserts that occur on the real table. With the introduction of
+Iceberg's changelog table in #20937 we can efficiently update the sample after the table has been
+updated without needing to re-sample the entire table.


Iceberg changelog table currently do not support V2 row level deletion. It seems to be a big problem for incremental maintenance.

aditi-pandit

Thanks @ZacBlanco for this doc. Gives a good picture of the work being done.

aditi-pandit · 2024-08-13T09:38:52Z

RFC-0001.md

+logic adjusted accordingly to utilize the iceberg table stored at the sample path, rather than the
+true table.
+
+One benefit of creating the sample tables as an iceberg table is that we can re-use the


Do you also use the file-format of the original table, or have decided on something else ? Would be good to clarify that as well.

aditi-pandit · 2024-08-13T09:58:55Z

RFC-0001.md

+
+Currently, iceberg does not fully support support partition-level statistics[^3]. Once partitions
+statistics are officially released, the iceberg connector should be updated to support collecting
+and reading the partition-level stats. As long as the sample table creation code ensures the sample


Can you give a simple explanation of what the partition-level statistics are and why they are not affected by sample tables.

aditi-pandit · 2024-08-13T10:53:43Z

RFC-0001.md

+Incremental sample maintenance will be out of scope for the main PR, but all of the infrastructure
+for sample maintenance is available already once the samples are generated. The biggest hurdle for
+users will be that sample maintenance will need to be done manually. We can document the maintenance
+process for samples, but there currently isn't a way to schedule or automatically run the correct


Typically maintenance happens using an ETL tool or Kafka pipelines. We could investigate how to integrate with these. WxD has https://www.ibm.com/topics/data-pipeline these options.

aditi-pandit · 2024-08-13T11:00:19Z

RFC-0001.md

+
+### Table-Level Storage in Puffin files
+
+One alternative approach is to generate the samples and store them in Puffin files[^puffin_files].


Its better to write a technical doc in a third person tone.

Add Iceberg samples RFC

c2261ca

ZacBlanco mentioned this pull request Feb 20, 2024

Table Sampling in the Iceberg Connector prestodb/presto#21963

Closed

ZacBlanco added 2 commits February 20, 2024 12:47

Add information on schema updates and table metadata

43c28b6

Add notes on sample storage in puffin files

ec4b183

tdcmeehan reviewed Jun 28, 2024

View reviewed changes

ZacBlanco marked this pull request as draft June 28, 2024 19:34

hantangwangd reviewed Jul 2, 2024

View reviewed changes

aditi-pandit reviewed Aug 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Iceberg samples #5

Add Iceberg samples #5

ZacBlanco commented Feb 20, 2024

ZacBlanco commented Feb 20, 2024

tdcmeehan Jun 28, 2024

tdcmeehan Jun 28, 2024

tdcmeehan Jun 28, 2024 •

edited

Loading

hantangwangd Jun 28, 2024

hantangwangd Jul 2, 2024

hantangwangd Jul 2, 2024

hantangwangd Jul 2, 2024

aditi-pandit left a comment

aditi-pandit Aug 13, 2024

aditi-pandit Aug 13, 2024

aditi-pandit Aug 13, 2024

aditi-pandit Aug 13, 2024


		#### Sample Maintenance

		Sample maintenance can be performed through a set of SQL statements which replace records in the


		### Table-Level Storage in Puffin files

		One alternative approach is to generate the samples and store them in Puffin files[^puffin_files].

Add Iceberg samples #5

Are you sure you want to change the base?

Add Iceberg samples #5

Conversation

ZacBlanco commented Feb 20, 2024

ZacBlanco commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdcmeehan Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aditi-pandit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdcmeehan Jun 28, 2024 •

edited

Loading