Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread ignores colClasses factor assignment #721

Closed
AmyMikhail opened this issue Jul 7, 2014 · 9 comments
Closed

fread ignores colClasses factor assignment #721

AmyMikhail opened this issue Jul 7, 2014 · 9 comments
Assignees
Labels
Milestone

Comments

@AmyMikhail
Copy link

I'm using the fread function of data.table to read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM. Many of my variables look numeric but they are actually factors (various id numbers). In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specified colClasses() for each variable in the call to fread. However my assignments are ignored, see below:

# Function to convert dates of a specific format to date:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
setClass('myDate')

# Read in the data with column classes and NAs defined:
amrdt = fread("amrdata.csv", na.strings=c("NA", "", " "),
              colClasses=c(
                "isolate.key"="factor",
                "laboratory.id"="factor",
                "patient.id"= "factor",
                "patient.nhs.no"= "factor",
                "patient.dob"= "myDate",
                "patient.age"= "integer",
                "patient.age.in.months"= "integer",
                "patient.sex"= "factor",
                "patient.postcode"= "factor",
                "sender.code"= "factor",
                "ams.sender.code"= "factor",
                "care.trust"= "factor",
                "speciality"= "factor",
                "practice.code"= "factor",
                "lan"= "factor",
                "specimen.date"= "myDate",
                "specimen.site.id"= "factor",
                "organism.id"= "factor",
                "antibiotic.id"= "factor",
                "result.code"= "factor",
                "zone.size"= "integer",
                "mic"= "integer",
                "local.authority"= "factor",
                "repeat.exclusion"= "integer",
                "nhs.region.name"= "factor",
                "yyyy"= "integer",
                "mm"= "integer",
                "dd"= "integer"))

# Shorten the variable names:
> setnames(amrdt, c("iid", "labid", "pid", "nhsno",
                  "dob", "age", "agemon", "sex", "ppostcode",
                  "sender", "amsender", "trust", "speciality", "practice", "lan",
                  "specdate", "spectype", "organism", "antib", "result", "zone", "mic",
                  "localauthority", "repid", "nhsregion", "year", "month", "day"))

Here are the results of fread. Seemingly fread has defaulted to automatic interpretation of colClasses:

str(amrdt)
# Classes ‘data.table’ and 'data.frame':    5000000 obs. of  28 variables:
#  $ iid           : chr  "1" "2" "3" "4" ...
#  $ labid         : chr  "32520" "32520" "32520" "32520" ...
#  $ pid           : chr  "A425581             " "A425581             " "A425581             " "A425581             " ...
#  $ nhsno         : chr  "4250559920" "4250559920" "4250559920" "4250559920" ...
#  $ dob           : chr  "1942-10-09" "1942-10-09" "1942-10-09" "1942-10-09" ...
#  $ age           : int  68 68 68 68 68 68 68 68 68 3 ...
#  $ agemon        : int  NA NA NA NA NA NA NA NA NA NA ...
#  $ sex           : chr  "F" "F" "F" "F" ...
#  $ ppostcode     : chr  "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" ...
#  $ sender        : chr  "ARLNDU    " "ARLNDU    " "ARLNDU    " "ARLNDU    " ...
#  $ amsender      : chr  "RLNDU     " "RLNDU     " "RLNDU     " "RLNDU     " ...
#  $ trust         : chr  "5ND       " "5ND       " "5ND       " "5ND       " ...
#  $ speciality    : chr  "180" "180" "180" "180" ...
#  $ practice      : chr  NA NA NA NA ...
#  $ lan           : chr  "11M028966" "11M028966" "11M028966" "11M028966" ...
#  $ specdate      : chr  "2011-03-10" "2011-03-10" "2011-03-10" "2011-03-10" ...
#  $ spectype      : chr  "T0X000" "T0X000" "T0X000" "T0X000" ...
#  $ organism      : chr  "2542.0000" "2542.0000" "2542.0000" "2542.0000" ...
#  $ antib         : chr  "AUG" "CTX" "CIP" "ERY" ...
#  $ result        : chr  "S" "S" "R" "S" ...
#  $ zone          : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ mic           : chr  NA NA NA NA ...
#  $ localauthority: chr  "00EJ" "00EJ" "00EJ" "00EJ" ...
#  $ repid         : int  0 0 0 0 0 0 0 0 0 0 ...
#  $ nhsregion     : chr  "North East" "North East" "North East" "North East" ...
#  $ year          : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
#  $ month         : int  3 3 3 3 3 3 3 3 3 3 ...
#  $ day           : int  10 10 10 10 10 10 10 10 10 7 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

# They are definitely not factors:
is.factor(amrdt$result)
# [1] FALSE

# Here is example data from the head and tail:
head(amrdt)
#    iid labid                  pid      nhsno        dob age agemon sex ppostcode
#1:   1 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#2:   2 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#3:   3 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#4:   4 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#5:   5 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#6:   6 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#        sender   amsender      trust speciality practice       lan   specdate
#1: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#2: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#3: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#4: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#5: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#6: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#    spectype  organism antib result zone mic localauthority repid  nhsregion year
#1:   T0X000 2542.0000   AUG      S    0  NA           00EJ     0 North East 2011
#2:   T0X000 2542.0000   CTX      S    0  NA           00EJ     0 North East 2011
#3:   T0X000 2542.0000   CIP      R    0  NA           00EJ     0 North East 2011
#4:   T0X000 2542.0000   ERY      S    0  NA           00EJ     0 North East 2011
#5:   T0X000 2542.0000   OXA      S    0  NA           00EJ     0 North East 2011
#6:   T0X000 2542.0000   PEN      S    0  NA           00EJ     0 North East 2011
#    month day
#1:     3  10
#2:     3  10
#3:     3  10
#4:     3  10
#5:     3  10
#6:     3  10

tail(amrdt)
#        iid  labid                  pid nhsno        dob age agemon sex ppostcode
#1: 5369819 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#2: 5369820 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#3: 5369821 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#4: 5369822 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#5: 5369823 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#6: 5369824 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#        sender   amsender      trust speciality practice            lan
#1: GM85030    M85030     5M1               600       NA M.11.2021147.S
#2: GM85030    M85030     5M1               600       NA M.11.2021147.S
#3: GM85030    M85030     5M1               600       NA M.11.2021147.S
#4: GM85030    M85030     5M1               600       NA M.11.2021147.S
#5: GM85030    M85030     5M1               600       NA M.11.2021147.S
#6: GM85030    M85030     5M1               600       NA M.11.2021147.S
#      specdate spectype  organism antib result zone mic localauthority repid
#1: 2011-03-31   T7X100 1571.0010   MER      S    0  NA           00CN     0
#2: 2011-03-31   T7X100 1571.0010   NAL      R    0  NA           00CN     0
#3: 2011-03-31   T7X100 1571.0010 NITRO      S    0  NA           00CN     0
#4: 2011-03-31   T7X100 1571.0010   PIP      R    0  NA           00CN     0
#5: 2011-03-31   T7X100 1571.0010  TEMO      S    0  NA           00CN     0
#6: 2011-03-31   T7X100 1571.0010   TMP      R    0  NA           00CN     0
#        nhsregion year month day
#1: West Midlands 2011     3  31
#2: West Midlands 2011     3  31
#3: West Midlands 2011     3  31
#4: West Midlands 2011     3  31
#5: West Midlands 2011     3  31
#6: West Midlands 2011     3  31

I can convert everything to a factor with the following code:

dtnew <- amrdt[, lapply(.SD, as.factor), by=iid]

... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the colClasses() for those few columns that are not supposed to be factors.

From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class? In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above).

As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a data.table. The reason I have read in a .csv file with the above code is because I couldn't figure out a way to do this (so wrote to .csv with write.csv.ffdf, which took about 30 mins to write to a 1.002 GB file and then read in the .csv with fread, which took 3 mins 12 secs). If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a .csv first, that would be a significant time saving.

Many thanks for your help.

@arunsrinivasan
Copy link
Member

Hi @AmyMikhail, thanks for the report.

Just a couple of things on the code you use to convert to factor after reading in. 1) There's no need to group by iid here, IIUC, as you're wanting to convert the entire column to factor. 2) The idiomatic way to convert columns to factor would be to use := or set and modify in-place. The way you're doing it would result in copying the entire data, which of course will be a waste of time.

So, the idiomatic way to do it would be:

cols <- a character vector of all columns that should be factor type
amrdt[, (cols) := lapply(.SD, as.factor), .SDcols = cols]

(Although it'd be more appropriate for Matt to comment on your actual post reg. your FR) I think it'd be nice to have factor in colClasses as well. It seems like read.table supports it.

@AmyMikhail
Copy link
Author

Thanks @arunsrinivasan;

Your suggestion has worked for the columns that I want to convert to factors, however I could not get this to work with as.Date, as there seems to be no way of specifying the input date format with this syntax?

I look forward to hearing if colClasses will be expanded.

@arunsrinivasan
Copy link
Member

@AmyMikhail I'm not sure I follow, but couldn't you just do:

amrdt[, (cols) := lapply(.SD, function(x) as.Date(x, <all_other_arguments>)), .SDcols=cols]

where cols is a character vector of columns you'd like converted to objects of type Date.

@mattdowle any thoughts on colClasses expansion?

@AmyMikhail
Copy link
Author

Many thanks @arunsrinivasan; I'm still getting used to the data.table syntax....

Both these changes still take a little while to run (since I have thousands of factor levels); if these variable type assignments can be incorporated directly into fread would it be faster?

@arunsrinivasan
Copy link
Member

I'm quite positive that this FR would be a great addition, and Matt (and therefore data.table) is quite receptive to user requests.

But you should know that, unless you're plotting (where factor might be essential in preserving the order of points plotted), there's generally no need to have factors when working with data.table.

Out of curiosity, could you briefly explain what you do with these factor columns?

@AmyMikhail
Copy link
Author

I'll be using the data.table to create summary tables of counts and proportions by date for different factor groupings, which I'll then work on further with the package surveillance. The surveillance package is quite fussy about the input format, so I will always need dates in the first column and each subsequent column to contain counts of records that are in various subgroups, by date, where each variable name is the subgroup (factor level). This differs a little from the standard output of summary tables with data.table (where subgroups are defined in a second variable rather than becoming the variable names) and I have worked out how to do this with dcast but have no idea how dependent my solution is on the grouping variables being factors.

Admittedly I didn't try my code with the variables in character form as I didn't know this would work - however I also need to keep tabs on the number of factor levels in each variable (as this in itself is a descriptive summary of the data that I need, such as how many hospitals are represented in the dataset, or how many patients don't have NHS numbers etc.) I also need to keep track of the number of NAs and check that I haven't missed any items that need to be re-coded as NAs (the data is not entirely clean).

str, summary etc will not produce summary information for character vectors but I'd be curious to know if there is another way of getting this information, for character vectors?

@fabiogm
Copy link

fabiogm commented Nov 1, 2014

Hi there,

I've stumbled on this issue too. I'm trying to specify colClasses beforehand because I will use the data to train a GBM.

It's possible to convert it manually, for sure, but is specifying colClasses on the roadmap?

@markcoletti
Copy link

Put me down as yet another person that would like to see full support for factors in the fread() colClasses argument.

(And thanks to @arunsrinivasan for the tip for mass converting columns to factors!)

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Nov 30, 2015
@arunsrinivasan arunsrinivasan self-assigned this Dec 22, 2015
@nachti
Copy link
Contributor

nachti commented Jan 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants