[RFC] - OpenSearch Table #14524

penghuo · 2024-06-24T18:17:31Z

Is your feature request related to a problem? Please describe

1. Current status

Currently, users can use the Spark Dataset API to directly read and write OpenSearch indices. The OpenSearch Spark extension internally leverages the Dataset API to access OpenSearch indices. However, we observe several problems and requirements:

Limited Query Capabilities:
- Users cannot use SQL to directly query OpenSearch indices.
Inconsistent Data Type Mapping:
- There is no standardized mapping between OpenSearch field types and SQL data types, leading to inconsistencies and confusion. For example:
  - How do multi-fields map to SQL types?
  - How does the String SQL data type map to OpenSearch data types?
  - How to expose OpenSearch-specific data types, such as IP addresses, to SQL data types?
Lack of Partition Support:
- There is no partition support when accessing OpenSearch indices, resulting in inefficiency.
Performance Issues:
- Reading OpenSearch data in JSON format is inefficient.
  - There is no high-performance data transfer protocol exposed by OpenSearch, and the overhead associated with REST response serialization and deserialization is notably high.
  - There is no way to prune nested columns when retrieving data from OpenSearch.
Accessing Cold Indices:
- Customers have expressed a need to use Spark to access cold indices without attaching these cold indices as warm indices.
Integration with AWS Glue and Lake Formation:
- There are emerging requirements to integrate OpenSearch indices with AWS Glue and Lake Formation, enabling them to be governed as Lake Formation tables.

Describe the solution you'd like

2. Vision of the future

Our goal is to enable users to utilize OpenSearch indices within popular query engines such as Spark. Spark users should be able to directly use OpenSearch clusters as catalogs and access OpenSearch indices as tables. We aim to enable Spark users to leverage OpenSearch's rich query and aggregation capabilities to efficiently query OpenSearch. Given OpenSearch's rich data type support, we plan to extend Spark's data type system and functions to incorporate more features from OpenSearch.
We also intend to formally define the OpenSearch Table specification, covering schema and data types, partitioning, and table metadata. Users should be able to define OpenSearch tables in the AWS Glue catalog and use Lake Formation to define ACLs on OpenSearch tables.
To improve performance, we will invest in more efficient data storage formats and data transmission protocols for OpenSearch. We are considering Apache Parquet as a storage format instead of _source, (similar proposal #13668) and Apache Arrow for zero-copy data transmission.
To achieve cost savings, we aim to enable users to query OpenSearch cold indices and snapshots. This will allow them to eagerly move data from hot to cold storage without losing OpenSearch's key features.
In summary, the end to end user experience is

2.1. Directly access

2.1.1. Configure Spark

By default, OpenSearch domain is catalog.

spark.sql.catalog.dev.warehouse=https://my-domain/
spark.sql.catalog.dev=org.apache.opensearch.spark.SparkCatalog

2.1.2. Query index as Table

User could directly access opensearch index without create table.

opensearch is default catalog name.
defalut is database name
index00001 is index name.

SELECT * FROM opensearch.default.index00001

2.2. Create Table

2.2.1. Configure Spark and Create Table (Spark)

step-1, using spark as example, user configure catalog. User configure Glue as metastore. and https://my-domain/ to store metadata.

spark.sql.catalog.dev.warehouse=https://my-domain/
spark.sql.catalog.dev=org.apache.opensearch.spark.SparkCatalog 
spark.sql.catalog.dev.catalog-impl=org.apache.opensearch.aws.glue.GlueCatalog 
spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

step-2, user create table with provided schema. at backend,
- Create OpenSearch index dev.default.tabl00001.metadata to store metadata.
- Create OpenSearch index tbl00001 to store data.

CREATE TABLE my_table (
  DocId INT,
  Links STRUCT<
    Forward: ARRAY<INT>,
    Backward: ARRAY<INT>
  >,
  Name ARRAY<STRUCT<
    Language: ARRAY<STRUCT<
      Code: STRING,
      Country: STRING
    >>,
    Url: STRING
  >>
)
USING OPENSEARCH
LOCATION "http://my-domain"

2.2.2. Writes (Spark)

Writing with SQL - INSERT INTO

`INSERT INTO dev.default.tbl00001 VALUES (1), (2)`

2.2.3. Query (Spark)

Querying with SQL

SELECT * FROM dev.default.tbl00001

Related component

Search:Query Capabilities

Describe alternatives you've considered

n/a

Additional context

3. Next Steps

We will incorporate the feedback from this RFC into a more detailed proposal and high-level design that integrates the storage-related efforts in OpenSearch. We will create meta-issues to delve deeper into the components involved and continue with the detailed design.

4. How Can You Help?

Any general comments about the overall direction are welcome. Some specific questions:

What are your use cases where this architecture would be useful?
What specific remote storage systems would you like to use with OpenSearch?
Have you implemented any work-arounds outside of OpenSearch that could better be solved by this architecture?

The text was updated successfully, but these errors were encountered:

mch2 · 2024-06-26T16:41:30Z

Thanks for raising this @penghuo, looking forward to discussion on this. Removed untriaged label.

andrross · 2024-06-26T22:22:11Z

Some high level comments after a discussion with @penghuo:

Some "explain like I'm 5" use cases would be helpful for those of us not really familiar with tools like Spark, AWS Glue, Lake Formation, etc.
The overall goal of "OpenSearch Table" is a standard specification for integrating OpenSearch data to the broader SQL/analytics world. This is as much a topic for the OpenSearch Project at large rather than specifically scoped to the core/this repo.
There are some specifics in here around performance problems (e.g. using the API to scrape _source fields for large amount of data). Prototypes or proof-of-concept examples with benchmarking data would really help evaluate these concerns.

msfroh · 2024-07-01T22:40:57Z

We are considering Apache Parquet as a storage format instead of _source, (similar proposal #13668) and Apache Arrow for zero-copy data transmission.

Last year, @noCharger and I built a little prototype that avoided storing _source in OpenSearch, instead keeping document source in DynamoDB. At query time, you could still run queries against indexed fields (and doc values), but a search pipeline with a search response processor would fetch source from DynamoDB.

I wonder if we could do something similar here, where a query against an OpenSearch index retrieves matching doc IDs, sorted or scored as appropriate, then you use those doc IDs to fetch content from Parquet (or DynamoDB or Cassandra or whatever).

linuxpi · 2024-07-25T15:34:01Z

[Storage Triage - attendees 1 2 3 4 5 6 7 8]

@penghuo Thanks for filing this issue, love to see how this pans out!

msfroh · 2024-08-05T21:49:15Z

There is no partition support when accessing OpenSearch indices, resulting in inefficiency.

Not sure how Spark Dataset's API thinks about partitions, but a possible option may be to use a point-in-time (PIT) with slices. You could query multiple slices concurrently.

Not sure how much that piece helps, though.

Edit: Oh -- I see this is mentioned in opensearch-project/opensearch-spark#430 (assuming the partitioning there is the same slicing).

msfroh · 2024-08-05T22:20:15Z

To improve performance, we will invest in more efficient data storage formats and data transmission protocols for OpenSearch.

@amberzsy has done some really neat experiments using Protobuf serialization between client and coordinator. I would imagine that would help here too. (See opensearch-project/opensearch-clients#69.)

Given that Spark mostly relies on streaming, would Spark Dataset be able to benefit from a client/server API that supports streaming, like gRPC? Would it help if OpenSearch could stream larger result sets from each request?

penghuo added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 24, 2024

github-actions bot added the Search:Query Capabilities label Jun 24, 2024

penghuo changed the title ~~RFC - OpenSearch Table~~ [RFC] - OpenSearch Table Jun 24, 2024

penghuo mentioned this issue Jun 24, 2024

[EPIC] Zero-ETL - OpenSearch Table opensearch-project/opensearch-spark#185

Open

10 tasks

mch2 added RFC Issues requesting major changes Storage Issues and PRs relating to data and metadata storage Roadmap:Search Project-wide roadmap label and removed untriaged labels Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] - OpenSearch Table #14524

[RFC] - OpenSearch Table #14524

penghuo commented Jun 24, 2024

mch2 commented Jun 26, 2024

andrross commented Jun 26, 2024

msfroh commented Jul 1, 2024

linuxpi commented Jul 25, 2024

msfroh commented Aug 5, 2024 •

edited

Loading

msfroh commented Aug 5, 2024 •

edited

Loading

[RFC] - OpenSearch Table #14524

[RFC] - OpenSearch Table #14524

Comments

penghuo commented Jun 24, 2024

Is your feature request related to a problem? Please describe

1. Current status

Describe the solution you'd like

2. Vision of the future

2.1. Directly access

2.1.1. Configure Spark

2.1.2. Query index as Table

2.2. Create Table

2.2.1. Configure Spark and Create Table (Spark)

2.2.2. Writes (Spark)

2.2.3. Query (Spark)

Related component

Describe alternatives you've considered

Additional context

3. Next Steps

4. How Can You Help?

mch2 commented Jun 26, 2024

andrross commented Jun 26, 2024

msfroh commented Jul 1, 2024

linuxpi commented Jul 25, 2024

msfroh commented Aug 5, 2024 • edited Loading

msfroh commented Aug 5, 2024 • edited Loading

msfroh commented Aug 5, 2024 •

edited

Loading

msfroh commented Aug 5, 2024 •

edited

Loading