Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support read_fwf functionality in cudf #15924

Open
a-hirota opened this issue Jun 5, 2024 · 4 comments
Open

[FEA] Support read_fwf functionality in cudf #15924

a-hirota opened this issue Jun 5, 2024 · 4 comments
Labels
0 - Backlog In queue waiting for assignment cudf.pandas Issues specific to cudf.pandas feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@a-hirota
Copy link
Contributor

a-hirota commented Jun 5, 2024

Missing Pandas Feature Request

Support for pandas.read_fwf.

Profiler Output

N/A

Additional context

Background:
In the legacy enterprise space, COBOL is in continuous use, and the reality is that a complete overhaul of legacy systems is difficult to achieve at this time. If the processing of legacy systems can be made to run on GPUs, this could bring significant change to this area. Since COBOL deals with fixed-width flat files, support for fixed-width files could be a first step in addressing this need.

Code Example:
For instance, consider the following example data:

data = '''\
abcdef123456790.1234567abc           1234
ABCDEF123456790.1234567abc           5678
'''
with open('data.txt', 'w') as f:
    f.write(data)

import pandas as pd

# Example usage of pandas read_fwf
df = pd.read_fwf('data.txt', colspecs=[(0, 6), (6, 23), (23, 37), (37, 41)], header=None)

# Ensure that the output is not in scientific notation
pd.set_option('display.float_format', lambda x: '%.7f' % x)

print(df)

Expected output:

        0                 1    2     3
0  abcdef 123456790.1234567  abc  1234
1  ABCDEF 123456790.1234567  abc  5678

Supplement:

  • Since COBOL systems handle fixed-width numeric values, it would be very beneficial if a type such as cudf::detail::fixed_width_scalar could be specified using the dtype parameter.
  • The to_fwf functionality is also necessary, but since pandas does not have to_fwf, it has been excluded from this issue.
@a-hirota a-hirota added cudf.pandas Issues specific to cudf.pandas feature request New feature or request Needs Triage Need team to review and classify labels Jun 5, 2024
@a-hirota a-hirota changed the title [FEA] [FEA] Support read_fwf functionality in cudf Jun 5, 2024
@brandon-b-miller
Copy link
Contributor

Hi @a-hirota ,
Thanks for raising this issue. While I'm not aware of any current efforts to implement this feature, I'd like to leave this issue open for further discussion and updates in the future. Enough people expressing interest here might be enough to generate some ideas and eventually move forward.

@brandon-b-miller brandon-b-miller added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 5, 2024
@GregoryKimball
Copy link
Contributor

Hello @a-hirota, thank you for your request. I believe this reader is something that we can support by combining cudf APIs today. Would you please let me know if this works for you?

series = cudf.read_text('data.txt', delimiter='\n')
colspecs=[(0, 6), (6, 23), (23, 37), (37, 41)]

df = cudf.DataFrame()
for n, d in enumerate(colspecs):
    df[n] = series.str.slice(d[0], d[1]).str.strip()

    if df[n].str.contains(r'\d+\.\d+').all():
        df[n] = df[n].astype('float64')
    elif df[n].str.contains(r'\d+').all():
        df[n] = df[n].astype('int64')
     
print(df)
        0                 1    2     3
0  abcdef 123456790.1234567  abc  1234
1  ABCDEF 123456790.1234567  abc  5678

@GregoryKimball GregoryKimball added 0 - Waiting on Author Waiting for author to respond to review and removed 0 - Backlog In queue waiting for assignment labels Jun 6, 2024
@a-hirota
Copy link
Contributor Author

Hello @GregoryKimball , thank you for your prompt response! I appreciate the swift assistance.

I've conducted experiments and confirmed that it's functioning as expected.

However, due to the necessity of string slicing for each column, the processing time is somewhat inferior to that of the CPU, particularly when dealing with a dataset of around 1 million records across a maximum of 2000 columns (which represents roughly 1/50th of our usual daily processing volume). While the GPU processing time, including read time, surpasses that of the CPU, it doesn't result in a significant speedup:

< String Slicing Time >
CPU: 0.1727 seconds
GPU: 0.5404 seconds
Experiment Results:
https://github.com/a-hirota/rapids_qa/blob/main/fwf_read_nvidia15924.ipynb

I believe that providing the colspecs at the time of reading, similar to read_fwf, would eliminate the need for redefining the positions of the series object after reading. This optimization could lead to a significant speedup compared to the CPU. (Although not initially included in my Example usage, specifying dtypes might also be beneficial.)

Additionally, legacy systems tend to have lightweight computational tasks, mainly rule-based logic, resulting in a majority (80-90%) of processing time being allocated to I/O operations.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed 0 - Waiting on Author Waiting for author to respond to review labels Jun 13, 2024
rapids-bot bot pushed a commit that referenced this issue Sep 5, 2024
Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`.
Addresses some concerns from issue #15924

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #16574
rjzamora pushed a commit to rjzamora/cudf that referenced this issue Sep 6, 2024
…#16574)

Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`.
Addresses some concerns from issue rapidsai#15924

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: rapidsai#16574
res-life pushed a commit to res-life/cudf that referenced this issue Sep 11, 2024
…#16574)

Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`.
Addresses some concerns from issue rapidsai#15924

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: rapidsai#16574
@davidwendt
Copy link
Contributor

Using the improvement from #16574 improves the slice time significantly from:

1m_data.txt: last position slice time = 0.6162 seconds

to

1m_data.txt: last position slice time = 0.0345 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cudf.pandas Issues specific to cudf.pandas feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants