Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FlintTable batch write #1653

Merged

Conversation

penghuo
Copy link
Collaborator

@penghuo penghuo commented May 24, 2023

Description

  1. Add FlintWriter in Flint core.
  2. Support Batch Write in Spark.

Implementation Detail

FlintTable is capable of batch write operations in overwrite mode. This interacts with the FlintOpenSearchClient within the FlintCore package. During this process, we utilize the CREATE action within the OpenSearch bulk request. Users have the capability to provide an ID field within their options. If no ID is provided, OpenSearch will generate one automatically. When writing to FlintCore, the following conditions are checked:

If a document with the same ID already exists, the system will skip this entry and do nothing.
If no document with the same ID is found, the system will index the new document.

Why not use INDEX action

The INDEX action will delete doc with same id, and index new doc. In case Luncene does not really delete the doc, the storage size is doubled.

Usage Example

val schema = StructType(Seq(StructField("aInt", IntegerType)))
val openSearchOptions = Map("host" -> "localhost", "port" -> "9200", "spark.flint.write.id.name" -> "aInt")
val df = spark.range(15).toDF("aInt")
df.coalesce(1).write.format("flint").options(openSearchOptions).mode("overwrite").save("t002")

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Peng Huo <penghuo@gmail.com>
Signed-off-by: Peng Huo <penghuo@gmail.com>
@penghuo penghuo added the Flint label May 24, 2023
@penghuo penghuo self-assigned this May 24, 2023
@penghuo penghuo changed the title Flint batch write pr Flint - Support Batch Write May 24, 2023
@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Merging #1653 (8c9db09) into feature/flint (7268b5e) will not change coverage.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             feature/flint    #1653   +/-   ##
================================================
  Coverage            97.19%   97.19%           
  Complexity            4107     4107           
================================================
  Files                  371      371           
  Lines                10464    10464           
  Branches               706      706           
================================================
  Hits                 10170    10170           
  Misses                 287      287           
  Partials                 7        7           
Flag Coverage Δ
sql-engine 97.19% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

/**
* copy from spark {@link JacksonGenerator}.
*/
case class FlintJacksonGenerator(dataType: DataType, writer: Writer, options: JSONOptions) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewer:
This class is copy from SPARK JacksonGenerator. I did not find easy way to directly use it. You can only review the function i defined
def writeAction(action: String, idOrdinal: Option[Int], row: InternalRow): Unit = {}

@penghuo penghuo marked this pull request as ready for review May 24, 2023 15:22
@penghuo penghuo added the enhancement New feature or request label May 24, 2023
@penghuo penghuo changed the title Flint - Support Batch Write Support FlintTable batch write May 24, 2023
Signed-off-by: Peng Huo <penghuo@gmail.com>
Copy link
Collaborator

@dai-chen dai-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

@penghuo penghuo merged commit b720b84 into opensearch-project:feature/flint May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Flint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants