[FEA] Support read_fwf functionality in cudf #15924

a-hirota · 2024-06-05T04:48:06Z

Missing Pandas Feature Request

Support for pandas.read_fwf.

Profiler Output

N/A

Additional context

Background:
In the legacy enterprise space, COBOL is in continuous use, and the reality is that a complete overhaul of legacy systems is difficult to achieve at this time. If the processing of legacy systems can be made to run on GPUs, this could bring significant change to this area. Since COBOL deals with fixed-width flat files, support for fixed-width files could be a first step in addressing this need.

Code Example:
For instance, consider the following example data:

data = '''\
abcdef123456790.1234567abc           1234
ABCDEF123456790.1234567abc           5678
'''
with open('data.txt', 'w') as f:
    f.write(data)

import pandas as pd

# Example usage of pandas read_fwf
df = pd.read_fwf('data.txt', colspecs=[(0, 6), (6, 23), (23, 37), (37, 41)], header=None)

# Ensure that the output is not in scientific notation
pd.set_option('display.float_format', lambda x: '%.7f' % x)

print(df)

Expected output:

        0                 1    2     3
0  abcdef 123456790.1234567  abc  1234
1  ABCDEF 123456790.1234567  abc  5678

Supplement:

Since COBOL systems handle fixed-width numeric values, it would be very beneficial if a type such as cudf::detail::fixed_width_scalar could be specified using the dtype parameter.
The to_fwf functionality is also necessary, but since pandas does not have to_fwf, it has been excluded from this issue.

The text was updated successfully, but these errors were encountered:

brandon-b-miller · 2024-06-05T12:44:08Z

Hi @a-hirota ,
Thanks for raising this issue. While I'm not aware of any current efforts to implement this feature, I'd like to leave this issue open for further discussion and updates in the future. Enough people expressing interest here might be enough to generate some ideas and eventually move forward.

GregoryKimball · 2024-06-05T20:18:16Z

Hello @a-hirota, thank you for your request. I believe this reader is something that we can support by combining cudf APIs today. Would you please let me know if this works for you?

series = cudf.read_text('data.txt', delimiter='\n')
colspecs=[(0, 6), (6, 23), (23, 37), (37, 41)]

df = cudf.DataFrame()
for n, d in enumerate(colspecs):
    df[n] = series.str.slice(d[0], d[1]).str.strip()

    if df[n].str.contains(r'\d+\.\d+').all():
        df[n] = df[n].astype('float64')
    elif df[n].str.contains(r'\d+').all():
        df[n] = df[n].astype('int64')
     
print(df)

        0                 1    2     3
0  abcdef 123456790.1234567  abc  1234
1  ABCDEF 123456790.1234567  abc  5678

a-hirota · 2024-06-10T13:56:57Z

Hello @GregoryKimball , thank you for your prompt response! I appreciate the swift assistance.

I've conducted experiments and confirmed that it's functioning as expected.

However, due to the necessity of string slicing for each column, the processing time is somewhat inferior to that of the CPU, particularly when dealing with a dataset of around 1 million records across a maximum of 2000 columns (which represents roughly 1/50th of our usual daily processing volume). While the GPU processing time, including read time, surpasses that of the CPU, it doesn't result in a significant speedup:

< String Slicing Time >
CPU: 0.1727 seconds
GPU: 0.5404 seconds
Experiment Results:
https://github.com/a-hirota/rapids_qa/blob/main/fwf_read_nvidia15924.ipynb

I believe that providing the colspecs at the time of reading, similar to read_fwf, would eliminate the need for redefining the positions of the series object after reading. This optimization could lead to a significant speedup compared to the CPU. (Although not initially included in my Example usage, specifying dtypes might also be beneficial.)

Additionally, legacy systems tend to have lightweight computational tasks, mainly rule-based logic, resulting in a majority (80-90%) of processing time being allocated to I/O operations.

Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue #15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #16574

…#16574) Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue rapidsai#15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#16574

davidwendt · 2024-09-16T09:28:36Z

Using the improvement from #16574 improves the slice time significantly from:

1m_data.txt: last position slice time = 0.6162 seconds

to

1m_data.txt: last position slice time = 0.0345 seconds

a-hirota added cudf.pandas Issues specific to cudf.pandas feature request New feature or request Needs Triage Need team to review and classify labels Jun 5, 2024

a-hirota changed the title ~~[FEA]~~ [FEA] Support read_fwf functionality in cudf Jun 5, 2024

brandon-b-miller added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 5, 2024

GregoryKimball added 0 - Waiting on Author Waiting for author to respond to review and removed 0 - Backlog In queue waiting for assignment labels Jun 6, 2024

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed 0 - Waiting on Author Waiting for author to respond to review labels Jun 13, 2024

davidwendt mentioned this issue Aug 15, 2024

Performance improvement for strings::slice for wide strings #16574

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support read_fwf functionality in cudf #15924

[FEA] Support read_fwf functionality in cudf #15924

a-hirota commented Jun 5, 2024 •

edited

Loading

brandon-b-miller commented Jun 5, 2024

GregoryKimball commented Jun 5, 2024

a-hirota commented Jun 10, 2024

davidwendt commented Sep 16, 2024

[FEA] Support read_fwf functionality in cudf #15924

[FEA] Support read_fwf functionality in cudf #15924

Comments

a-hirota commented Jun 5, 2024 • edited Loading

Missing Pandas Feature Request

Profiler Output

Additional context

brandon-b-miller commented Jun 5, 2024

GregoryKimball commented Jun 5, 2024

a-hirota commented Jun 10, 2024

davidwendt commented Sep 16, 2024

a-hirota commented Jun 5, 2024 •

edited

Loading