Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] to_dataframe does not produce sparse data frames #808

Closed
cdiener opened this issue Mar 6, 2019 · 3 comments
Closed

[python] to_dataframe does not produce sparse data frames #808

cdiener opened this issue Mar 6, 2019 · 3 comments

Comments

@cdiener
Copy link
Contributor

cdiener commented Mar 6, 2019

Hi,

I noticed that the pandas.SparseDataFrame returned by Table.to_dataframe is not really sparse. For instance for the American Gut data:

In [15]: bm = load_table("deblur_125nt_no_blooms.biom")

In [16]: bm
Out[16]: 32954 x 9511 <class 'biom.table.Table'> with 1829490 nonzero entries (0% dense)

In [17]: tab = bm.to_dataframe()

In [19]: type(tab)
Out[19]: pandas.core.sparse.frame.SparseDataFrame

In [20]: tab.density
Out[20]: 1.0

In [21]: tab.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 32954 entries, AACGTAGGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGAAGGCTAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGGTCATCTAGAGTG to TACGGGGGATGCGAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGCAGGCCGAGGTTCAAGTCAGCGGTGAAACCCCCGCGCTCAACGCGGGGCATGCCGTTGATACTGTATCTCTGGAGTA
Columns: 9511 entries, 10317.000012326 to 10317.000038478
dtypes: Sparse[float64, nan](9511)
memory usage: 2.3+ GB

This is basically the memory use of the full table including zeros. Also the densities of the original table and the SparseDataTable are pretty different (~0% vs 100%).

@wasade
Copy link
Member

wasade commented Mar 6, 2019

Interesting. So, unlike scipy.sparse, pandas expects empty values to be nan and not zero. That's super annoying.

Do you have a fix by chance?

In [1]: import biom

In [2]: print(biom.example_table)
# Constructed from biom file
#OTU ID	S1	S2	S3
O1	0.0	1.0	2.0
O2	3.0	4.0	5.0

In [3]: biom.example_table.to_dataframe()
Out[3]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [4]: biom.example_table.to_dataframe().info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 2 entries, O1 to O2
Data columns (total 3 columns):
S1    2 non-null float64
S2    2 non-null float64
S3    2 non-null float64
dtypes: float64(3)
memory usage: 64.0+ bytes

In [5]: biom.example_table.to_dataframe(dense=True)
Out[5]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [6]: biom.example_table.to_dataframe(dense=True).to_sparse()
Out[6]:
     S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0

In [7]: biom.example_table.to_dataframe(dense=True).to_sparse().info
Out[7]:
<bound method DataFrame.info of      S1   S2   S3
O1  0.0  1.0  2.0
O2  3.0  4.0  5.0>

In [8]: biom.example_table.to_dataframe(dense=True).to_sparse().info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 2 entries, O1 to O2
Data columns (total 3 columns):
S1    2 non-null float64
S2    2 non-null float64
S3    2 non-null float64
dtypes: float64(3)
memory usage: 64.0+ bytes

In [9]: biom.example_table.to_dataframe(dense=True).to_sparse().density
Out[9]: 1.0

@cdiener
Copy link
Contributor Author

cdiener commented Mar 7, 2019

I think you would just have to set fill_value = 0.0 in the SparseSeries constructor. I can try with a PR if you'd like.

@wasade
Copy link
Member

wasade commented Mar 7, 2019

That would be wonderful, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants