Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support load CSV in PPL (inputlookup or search) #638

Open
LantaoJin opened this issue Sep 10, 2024 · 4 comments · May be fixed by #677 or #678
Open

[FEATURE] Support load CSV in PPL (inputlookup or search) #638

LantaoJin opened this issue Sep 10, 2024 · 4 comments · May be fixed by #677 or #678
Labels
enhancement New feature or request

Comments

@LantaoJin
Copy link
Member

LantaoJin commented Sep 10, 2024

Support the functionality of loading data from CSV file.

file location

There are two options in which a CSV file to store:

  1. Upload CSV files to Spark scratch dir where set by SPARK_LOCAL_DIRS environment variable or config spark.local.dir, For example, $SPARK_LOCAL_DIRS/<some_identities>/lookups/test.csv. But uploading to an local dir could introduce potential security issues, especially if the Spark application runs on cloud service.
  2. (Preferred) Upload CSV files to external URL. The user should make sure the application has the access permission to the external URL. For example, s3://<bucket>/foo/bar/test.csv, file:///foo/bar/test.csv.

PPL syntax

There are also two options to support this feature:

A. Introduce a new command inputlookup or input:

input <fileUrl> [predicate]

Usage:

input "s3://bucket_name/folder1/folder2/flights.csv" FlightDelay > 500

The FlightDelay > 500 only works when the flights.csv contains a csv header.

B. Modify the current search command to support file:

search file=<fileUrl> [predicate]

Usage:

search file="s3://bucket_name/folder1/folder2/flights.csv" FlightDelay > 500

PS: the current search command syntax is

search index=<indexName> [predicate]
search source=<indexName> [predicate]

Both option A and B could be used in sub-search:

search source=os dept=ps
| eval host=lower(host)
| stats count BY host
| append
  [
    input "s3://key/lookup.csv" | eval host=lower(host) | fields host count
  ]
| stats sum(count) AS total BY host
| where total=0
search source=os dept=ps
| eval host=lower(host)
| stats count BY host
| append
  [
    search file="s3://key/lookup.csv" | eval host=lower(host) | fields host count
  ]
| stats sum(count) AS total BY host
| where total=0
@LantaoJin LantaoJin added enhancement New feature or request untriaged labels Sep 10, 2024
@LantaoJin LantaoJin changed the title [FEATURE] Support inputlookup Command in PPL. [FEATURE] Support load CSV in PPL (inputlookup or search) Sep 10, 2024
@penghuo
Copy link
Collaborator

penghuo commented Sep 11, 2024

+1 on (Preferred) Upload CSV files to external URL.
One concern is how to prevent users from accessing any data in the local filesystem, as this poses a security risk.

@YANG-DB
Copy link
Member

YANG-DB commented Sep 11, 2024

I agree with @penghuo this is a possible security concern - I would propose using a different approach:
use the dashboard for loading a csv file into and index and using this index for the lookup

@YANG-DB YANG-DB removed the untriaged label Sep 11, 2024
@brijos
Copy link

brijos commented Sep 11, 2024

I hate to be that guy, but I know of those in the community who would want to load the CSV into their index as well as those who want to load the CSV into cloud storage. From a priority perspective, index should be the first as it is the easiest (assuming the analyst has write access to the cluster). Dealing with cloud storage introduces permissions friction.

@LantaoJin
Copy link
Member Author

+1 on (Preferred) Upload CSV files to external URL. One concern is how to prevent users from accessing any data in the local filesystem, as this poses a security risk.

A straightforward solution is allowing the s3:// schema URL only in product.

From a priority perspective, index should be the first as it is the easiest

Yes. I got the priorities. We have the lookup issue #620 opened. This issue is for the requirement of loading data from a CSV (similar to the the inputlookup command in Splunk).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants