fread ignores colClasses factor assignment #721

AmyMikhail · 2014-07-07T15:32:11Z

I'm using the fread function of data.table to read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM. Many of my variables look numeric but they are actually factors (various id numbers). In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specified colClasses() for each variable in the call to fread. However my assignments are ignored, see below:

# Function to convert dates of a specific format to date:
setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%Y") )
setClass('myDate')

# Read in the data with column classes and NAs defined:
amrdt = fread("amrdata.csv", na.strings=c("NA", "", " "),
              colClasses=c(
                "isolate.key"="factor",
                "laboratory.id"="factor",
                "patient.id"= "factor",
                "patient.nhs.no"= "factor",
                "patient.dob"= "myDate",
                "patient.age"= "integer",
                "patient.age.in.months"= "integer",
                "patient.sex"= "factor",
                "patient.postcode"= "factor",
                "sender.code"= "factor",
                "ams.sender.code"= "factor",
                "care.trust"= "factor",
                "speciality"= "factor",
                "practice.code"= "factor",
                "lan"= "factor",
                "specimen.date"= "myDate",
                "specimen.site.id"= "factor",
                "organism.id"= "factor",
                "antibiotic.id"= "factor",
                "result.code"= "factor",
                "zone.size"= "integer",
                "mic"= "integer",
                "local.authority"= "factor",
                "repeat.exclusion"= "integer",
                "nhs.region.name"= "factor",
                "yyyy"= "integer",
                "mm"= "integer",
                "dd"= "integer"))

# Shorten the variable names:
> setnames(amrdt, c("iid", "labid", "pid", "nhsno",
                  "dob", "age", "agemon", "sex", "ppostcode",
                  "sender", "amsender", "trust", "speciality", "practice", "lan",
                  "specdate", "spectype", "organism", "antib", "result", "zone", "mic",
                  "localauthority", "repid", "nhsregion", "year", "month", "day"))

Here are the results of fread. Seemingly fread has defaulted to automatic interpretation of colClasses:

str(amrdt)
# Classes ‘data.table’ and 'data.frame':    5000000 obs. of  28 variables:
#  $ iid           : chr  "1" "2" "3" "4" ...
#  $ labid         : chr  "32520" "32520" "32520" "32520" ...
#  $ pid           : chr  "A425581             " "A425581             " "A425581             " "A425581             " ...
#  $ nhsno         : chr  "4250559920" "4250559920" "4250559920" "4250559920" ...
#  $ dob           : chr  "1942-10-09" "1942-10-09" "1942-10-09" "1942-10-09" ...
#  $ age           : int  68 68 68 68 68 68 68 68 68 3 ...
#  $ agemon        : int  NA NA NA NA NA NA NA NA NA NA ...
#  $ sex           : chr  "F" "F" "F" "F" ...
#  $ ppostcode     : chr  "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" "DH1 3QQ" ...
#  $ sender        : chr  "ARLNDU    " "ARLNDU    " "ARLNDU    " "ARLNDU    " ...
#  $ amsender      : chr  "RLNDU     " "RLNDU     " "RLNDU     " "RLNDU     " ...
#  $ trust         : chr  "5ND       " "5ND       " "5ND       " "5ND       " ...
#  $ speciality    : chr  "180" "180" "180" "180" ...
#  $ practice      : chr  NA NA NA NA ...
#  $ lan           : chr  "11M028966" "11M028966" "11M028966" "11M028966" ...
#  $ specdate      : chr  "2011-03-10" "2011-03-10" "2011-03-10" "2011-03-10" ...
#  $ spectype      : chr  "T0X000" "T0X000" "T0X000" "T0X000" ...
#  $ organism      : chr  "2542.0000" "2542.0000" "2542.0000" "2542.0000" ...
#  $ antib         : chr  "AUG" "CTX" "CIP" "ERY" ...
#  $ result        : chr  "S" "S" "R" "S" ...
#  $ zone          : num  0 0 0 0 0 0 0 0 0 0 ...
#  $ mic           : chr  NA NA NA NA ...
#  $ localauthority: chr  "00EJ" "00EJ" "00EJ" "00EJ" ...
#  $ repid         : int  0 0 0 0 0 0 0 0 0 0 ...
#  $ nhsregion     : chr  "North East" "North East" "North East" "North East" ...
#  $ year          : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
#  $ month         : int  3 3 3 3 3 3 3 3 3 3 ...
#  $ day           : int  10 10 10 10 10 10 10 10 10 7 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

# They are definitely not factors:
is.factor(amrdt$result)
# [1] FALSE

# Here is example data from the head and tail:
head(amrdt)
#    iid labid                  pid      nhsno        dob age agemon sex ppostcode
#1:   1 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#2:   2 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#3:   3 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#4:   4 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#5:   5 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#6:   6 32520 A425581              1234567890 1900-01-01  110     NA   F   XX1 1YY
#        sender   amsender      trust speciality practice       lan   specdate
#1: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#2: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#3: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#4: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#5: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#6: ARLNDU     RLNDU      5ND               180       NA 11M028966 2011-03-10
#    spectype  organism antib result zone mic localauthority repid  nhsregion year
#1:   T0X000 2542.0000   AUG      S    0  NA           00EJ     0 North East 2011
#2:   T0X000 2542.0000   CTX      S    0  NA           00EJ     0 North East 2011
#3:   T0X000 2542.0000   CIP      R    0  NA           00EJ     0 North East 2011
#4:   T0X000 2542.0000   ERY      S    0  NA           00EJ     0 North East 2011
#5:   T0X000 2542.0000   OXA      S    0  NA           00EJ     0 North East 2011
#6:   T0X000 2542.0000   PEN      S    0  NA           00EJ     0 North East 2011
#    month day
#1:     3  10
#2:     3  10
#3:     3  10
#4:     3  10
#5:     3  10
#6:     3  10

tail(amrdt)
#        iid  labid                  pid nhsno        dob age agemon sex ppostcode
#1: 5369819 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#2: 5369820 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#3: 5369821 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#4: 5369822 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#5: 5369823 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#6: 5369824 610740 ZM08395925               1 1950-01-01  61     NA   F        NA
#        sender   amsender      trust speciality practice            lan
#1: GM85030    M85030     5M1               600       NA M.11.2021147.S
#2: GM85030    M85030     5M1               600       NA M.11.2021147.S
#3: GM85030    M85030     5M1               600       NA M.11.2021147.S
#4: GM85030    M85030     5M1               600       NA M.11.2021147.S
#5: GM85030    M85030     5M1               600       NA M.11.2021147.S
#6: GM85030    M85030     5M1               600       NA M.11.2021147.S
#      specdate spectype  organism antib result zone mic localauthority repid
#1: 2011-03-31   T7X100 1571.0010   MER      S    0  NA           00CN     0
#2: 2011-03-31   T7X100 1571.0010   NAL      R    0  NA           00CN     0
#3: 2011-03-31   T7X100 1571.0010 NITRO      S    0  NA           00CN     0
#4: 2011-03-31   T7X100 1571.0010   PIP      R    0  NA           00CN     0
#5: 2011-03-31   T7X100 1571.0010  TEMO      S    0  NA           00CN     0
#6: 2011-03-31   T7X100 1571.0010   TMP      R    0  NA           00CN     0
#        nhsregion year month day
#1: West Midlands 2011     3  31
#2: West Midlands 2011     3  31
#3: West Midlands 2011     3  31
#4: West Midlands 2011     3  31
#5: West Midlands 2011     3  31
#6: West Midlands 2011     3  31

I can convert everything to a factor with the following code:

dtnew <- amrdt[, lapply(.SD, as.factor), by=iid]

... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the colClasses() for those few columns that are not supposed to be factors.

From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class? In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above).

As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a data.table. The reason I have read in a .csv file with the above code is because I couldn't figure out a way to do this (so wrote to .csv with write.csv.ffdf, which took about 30 mins to write to a 1.002 GB file and then read in the .csv with fread, which took 3 mins 12 secs). If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a .csv first, that would be a significant time saving.

Many thanks for your help.

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2014-07-14T22:16:34Z

Hi @AmyMikhail, thanks for the report.

Just a couple of things on the code you use to convert to factor after reading in. 1) There's no need to group by iid here, IIUC, as you're wanting to convert the entire column to factor. 2) The idiomatic way to convert columns to factor would be to use := or set and modify in-place. The way you're doing it would result in copying the entire data, which of course will be a waste of time.

So, the idiomatic way to do it would be:

cols <- a character vector of all columns that should be factor type
amrdt[, (cols) := lapply(.SD, as.factor), .SDcols = cols]

(Although it'd be more appropriate for Matt to comment on your actual post reg. your FR) I think it'd be nice to have factor in colClasses as well. It seems like read.table supports it.

AmyMikhail · 2014-07-15T19:16:47Z

Thanks @arunsrinivasan;

Your suggestion has worked for the columns that I want to convert to factors, however I could not get this to work with as.Date, as there seems to be no way of specifying the input date format with this syntax?

I look forward to hearing if colClasses will be expanded.

arunsrinivasan · 2014-07-15T20:35:22Z

@AmyMikhail I'm not sure I follow, but couldn't you just do:

amrdt[, (cols) := lapply(.SD, function(x) as.Date(x, <all_other_arguments>)), .SDcols=cols]

where cols is a character vector of columns you'd like converted to objects of type Date.

@mattdowle any thoughts on colClasses expansion?

AmyMikhail · 2014-07-16T10:34:37Z

Many thanks @arunsrinivasan; I'm still getting used to the data.table syntax....

Both these changes still take a little while to run (since I have thousands of factor levels); if these variable type assignments can be incorporated directly into fread would it be faster?

arunsrinivasan · 2014-07-16T15:12:35Z

I'm quite positive that this FR would be a great addition, and Matt (and therefore data.table) is quite receptive to user requests.

But you should know that, unless you're plotting (where factor might be essential in preserving the order of points plotted), there's generally no need to have factors when working with data.table.

Out of curiosity, could you briefly explain what you do with these factor columns?

AmyMikhail · 2014-07-16T19:48:14Z

I'll be using the data.table to create summary tables of counts and proportions by date for different factor groupings, which I'll then work on further with the package surveillance. The surveillance package is quite fussy about the input format, so I will always need dates in the first column and each subsequent column to contain counts of records that are in various subgroups, by date, where each variable name is the subgroup (factor level). This differs a little from the standard output of summary tables with data.table (where subgroups are defined in a second variable rather than becoming the variable names) and I have worked out how to do this with dcast but have no idea how dependent my solution is on the grouping variables being factors.

Admittedly I didn't try my code with the variables in character form as I didn't know this would work - however I also need to keep tabs on the number of factor levels in each variable (as this in itself is a descriptive summary of the data that I need, such as how many hospitals are represented in the dataset, or how many patients don't have NHS numbers etc.) I also need to keep track of the number of NAs and check that I haven't missed any items that need to be re-coded as NAs (the data is not entirely clean).

str, summary etc will not produce summary information for character vectors but I'd be curious to know if there is another way of getting this information, for character vectors?

fabiogm · 2014-11-01T15:08:15Z

Hi there,

I've stumbled on this issue too. I'm trying to specify colClasses beforehand because I will use the data to train a GBM.

It's possible to convert it manually, for sure, but is specifying colClasses on the roadmap?

markcoletti · 2014-11-18T20:23:22Z

Put me down as yet another person that would like to see full support for factors in the fread() colClasses argument.

(And thanks to @arunsrinivasan for the tip for mass converting columns to factors!)

nachti · 2016-01-28T15:24:45Z

I asked for it already in 2013: http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-October/002178.html

arunsrinivasan added the fread label Sep 4, 2015

arunsrinivasan added this to the v1.9.8 milestone Nov 30, 2015

arunsrinivasan closed this as completed in 86c5b83 Dec 22, 2015

arunsrinivasan self-assigned this Dec 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread ignores colClasses factor assignment #721

fread ignores colClasses factor assignment #721

AmyMikhail commented Jul 7, 2014

arunsrinivasan commented Jul 14, 2014

AmyMikhail commented Jul 15, 2014

arunsrinivasan commented Jul 15, 2014

AmyMikhail commented Jul 16, 2014

arunsrinivasan commented Jul 16, 2014

AmyMikhail commented Jul 16, 2014

fabiogm commented Nov 1, 2014

markcoletti commented Nov 18, 2014

nachti commented Jan 28, 2016

fread ignores colClasses factor assignment #721

fread ignores colClasses factor assignment #721

Comments

AmyMikhail commented Jul 7, 2014

arunsrinivasan commented Jul 14, 2014

AmyMikhail commented Jul 15, 2014

arunsrinivasan commented Jul 15, 2014

AmyMikhail commented Jul 16, 2014

arunsrinivasan commented Jul 16, 2014

AmyMikhail commented Jul 16, 2014

fabiogm commented Nov 1, 2014

markcoletti commented Nov 18, 2014

nachti commented Jan 28, 2016