AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' #104

gawbul · 2021-02-08T21:52:14Z

openOmics version: 0.8.4
Python version: 3.8.7
Operating System: macOS Big Sur 11.2

Description

Trying to run the vignettes from the README:

# Import GENCODE database (from URL)
from openomics.database import GENCODE

gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
                  file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
                                  "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
                                  "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
                                  "transcripts.fa": "gencode.v32.transcripts.fa.gz"},
                  remove_version_num=True,
                  npartitions=5)

# Annotate LncRNAs with GENCODE by gene_id
luad_data.LncRNA.annotate_genomics(gencode, index="gene_id",
                                   columns=['feature', 'start', 'end', 'strand', 'tag', 'havana_gene'])

luad_data.LncRNA.annotations.info()

What I Did

I added the text from the vignette above to a file called openomics_test.py and ran the following command:

python openomics_test.py

I received the following output:

Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz
|========================================================================================================================================================================================================| 4.4M/4.4M (100.00%)         0s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.basic.annotation.gtf.gz
|========================================================================================================================================================================================================|  26M/ 26M (100.00%)         7s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.lncRNA_transcripts.fa.gz
|========================================================================================================================================================================================================|  14M/ 14M (100.00%)         3s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.transcripts.fa.gz
|========================================================================================================================================================================================================|  72M/ 72M (100.00%)        15s
INFO:root:<_io.TextIOWrapper name='/Users/stephenmoss/.astropy/cache/download/url/141581d04d4001254d07601dfa7d983b/contents' encoding='UTF-8'>
Traceback (most recent call last):
  File "openomics_test.py", line 34, in <module>
    gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 67, in __init__
    super(GENCODE, self).__init__(path=path, file_resources=file_resources, col_rename=col_rename,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 17, in __init__
    super(SequenceDataset, self).__init__(**kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/base.py", line 39, in __init__
    self.data = self.load_dataframe(file_resources, npartitions=npartitions)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 74, in load_dataframe
    df = read_gtf(file_resources[gtf_file], npartitions=npartitions)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 349, in read_gtf
    result_df = parse_gtf_and_expand_attributes(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 290, in parse_gtf_and_expand_attributes
    result = parse_gtf(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 195, in parse_gtf
    chunk_iterator = dd.read_table(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 659, in read
    return read_pandas(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 464, in read_pandas
    paths = get_fs_token_paths(urlpath, mode="rb", storage_options=storage_options)[
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/fsspec/core.py", line 619, in get_fs_token_paths
    path = cls._strip_protocol(urlpath)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/fsspec/implementations/local.py", line 147, in _strip_protocol
    if path.startswith("file://"):
AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'

The text was updated successfully, but these errors were encountered:

gawbul · 2021-02-08T21:58:11Z

I initially opened an issue with fsspec here (fsspec/filesystem_spec#529) but realised it was due to gzip.open here (https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/database/base.py#L89-L90) returning an io.TextIOWrapper object.

fsspec is unable to get a path (see https://github.com/intake/filesystem_spec/blob/fb406453b6418052f98b64d405bd4e6a4be1def1/fsspec/utils.py#L304-L308) from the object and so it just returns it in its original state, which, of course, doesn't have a startswith method.

gawbul · 2021-02-08T22:00:27Z

Found this too https://stackoverflow.com/questions/65998183/python-dask-module-error-attributeerror-io-textiowrapper-object-has-no-at from a few days ago. Added a comment asking if they managed to figure it out.

Update: I fixed their issue https://stackoverflow.com/a/66110233/393634.

gawbul · 2021-02-08T23:24:39Z

So, I've found the issue.

The code here (https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/utils/read_gtf.py#L178-L179) is the problem:

        logging.info(filepath_or_buffer)
        chunk_iterator = dd.read_table(
            filepath_or_buffer,
            sep="\t",
            comment="#",
            names=REQUIRED_COLUMNS,
            skipinitialspace=True,
            skip_blank_lines=True,
            error_bad_lines=True,
            warn_bad_lines=True,
            # chunksize=chunksize,
            engine="c",
            dtype={
                "start": np.int64,
                "end": np.int64,
                "score": np.float32,
                "seqname": str,
            },
            na_values=".",
            converters={"frame": parse_frame})

Specifically passing filepath_or_buffer to dd.read_table.

dd is an alias for dask.dataframe as per the import dask.dataframe as dd statement.

However, the read_table function doesn't take a stream object, it takes a file path or a list of file paths, as per the documentation here https://docs.dask.org/en/latest/dataframe-api.html?highlight=read_table#dask.dataframe.read_table.

The parameter list defines the following:

urlpath:string or list
Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

gawbul · 2021-02-08T23:24:54Z

This might help? https://stackoverflow.com/q/39924518/393634 🤔

Specifically this answer https://stackoverflow.com/a/46428853/393634.

I don't think we need to do the decompression here https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/openomics/database/base.py#L80-L106. I think, perhaps, we can just pass through the file path as is and let the dask.dataframe compression='infer' parameter handle it?

JonnyTran · 2021-02-16T23:27:02Z

@gawbul Thank you so much for getting to the bottom of this issue. Amazing to see the detective work in action!

So originally, my intention was to have Dask handle the read_table(), since reading a large gtf file with Pandas' read_table() can have a huge memory footprint. Just as you've pointed out, it returns AttributeError: '_io.TextIOWrapper' error because Dask can't read_table from a StringIO stream after uncompressing the gzip at https://github.com/BioMeCIS-Lab/OpenOmics/blob/2e891028d9df0af6ab38b65b05dbdcd7b906cfdd/openomics/database/base.py#L77-L79.

I've just tried dask.dataframe.read_table() with compression="gzip" on the compressed gzip file and it worked beautifully. Thanks for the suggestion!

Will be applying the fix and refactoring codes. I'll close this when it's done.

JonnyTran · 2021-02-17T05:53:09Z

I added this line openomics/database/base.py#L74 which now only lets dask handles uncompression when dealing with GTF files. Running GENCODE(npartitions=5) now works.

I also added some functionalities to parse attributes from GTF files into Dask dataframes at openomics/utils/read_gtf.py. Now, creating dask dataframes from GTF files is working albeit no time improvement compared to creating a pandas dataframe. In the future, will look into optimizing GTF attributes parsing by using ddf.map_partitions(func) or ddf["attributes"].apply(func).

gawbul · 2021-02-17T22:08:24Z

Awesome 🥳 Glad you managed to get this sorted 🙂

gawbul mentioned this issue Feb 8, 2021

OpenOmics: Library for integration of multi-omics, annotation, and interaction data pyOpenSci/software-submission#31

Closed

22 tasks

JonnyTran closed this as completed Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' #104

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' #104

gawbul commented Feb 8, 2021 •

edited

Loading

gawbul commented Feb 8, 2021

gawbul commented Feb 8, 2021 •

edited

Loading

gawbul commented Feb 8, 2021

gawbul commented Feb 8, 2021 •

edited

Loading

JonnyTran commented Feb 16, 2021 •

edited

Loading

JonnyTran commented Feb 17, 2021 •

edited

Loading

gawbul commented Feb 17, 2021

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' #104

AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith' #104

Comments

gawbul commented Feb 8, 2021 • edited Loading

Description

What I Did

gawbul commented Feb 8, 2021

gawbul commented Feb 8, 2021 • edited Loading

gawbul commented Feb 8, 2021

gawbul commented Feb 8, 2021 • edited Loading

JonnyTran commented Feb 16, 2021 • edited Loading

JonnyTran commented Feb 17, 2021 • edited Loading

gawbul commented Feb 17, 2021

gawbul commented Feb 8, 2021 •

edited

Loading

gawbul commented Feb 8, 2021 •

edited

Loading

gawbul commented Feb 8, 2021 •

edited

Loading

JonnyTran commented Feb 16, 2021 •

edited

Loading

JonnyTran commented Feb 17, 2021 •

edited

Loading