R: Converting Column Values into Their Own Binary Encoded Columns ~ Discussion of Coding

R: Converting Column Values into Their Own Binary Encoded Columns

I have a number of CSV files with columns such as gender, age, diagnosis, etc.

Currently, they are coded as such:

ID, gender, age, diagnosis  1,  male,   42,  asthma  1,  male,   42,  anxiety  2,  male,   19,  asthma  3,  female, 23,  diabetes  4,  female, 61,  diabetes  4,  female, 61,  copd

The goal is to transform this data into this target format:

Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd  1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0  2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0  3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0  4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1

I've attempted using reshape2's dcast() function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)    ID, male19, male23, male42, male61, female19, female23, female42, female61  1,  0,      0,      1,      0,      0,        0,        0,        0  2,  1,      0,      0,      0,      0,        0,        0,        0  3,  0,      0,      0,      0,      0,        1,        0,        0  4,  0,      0,      0,      0,      0,        0,        0,        1

Seeing as this is a fairly common task in machine learning data preparation; I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.

Answer by David Arenburg for R: Converting Column Values into Their Own Binary Encoded Columns

You need a melt/dcast combination here (which called recast) in order to convert all columns into one column and avoid combinations

library(reshape2)  recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)  #   ID 19 23 42 61 anxiety asthma copd diabetes female male  # 1  1  0  0  1  0       1      1    0        0      0    1  # 2  2  1  0  0  0       0      1    0        0      0    1  # 3  3  0  1  0  0       0      0    0        1      1    0  # 4  4  0  0  0  1       0      0    1        1      1    0

As per your Sidenote, you can add variable here in order to get the names added too

recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)  #   ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd  # 1  1             0           1      0      0      1      0                 1                1              0  # 2  2             0           1      1      0      0      0                 0                1              0  # 3  3             1           0      0      1      0      0                 0                0              0  # 4  4             1           0      0      0      0      1                 0                0              1  #   diagnosis_diabetes  # 1                  0  # 2                  0  # 3                  1  # 4                  1

Answer by Jaehyeon Kim for R: Converting Column Values into Their Own Binary Encoded Columns

Below is a slightly longer way with dcast() and merge(). As gender and age are not unique by ID, a function is created to turn its length into a dummy variable (dum()). On the other hand, diagnosis is set to be counted uniquely by adjusting the formula.

library(reshape2)  data.raw <- read.table(header = T, sep = ",", text = "  id, gender, age, diagnosis  1,  male,   42,  asthma  1,  male,   42,  anxiety  2,  male,   19,  asthma  3,  female, 23,  diabetes  4,  female, 61,  diabetes  4,  female, 61,  copd")    # function to create a dummy variable  dum <- function(x) { if(length(x) > 0) 1 else 0 }    # length of dignosis by id, gender and age  diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]    # length of gender by id  gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)    # length of age by id  age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)    merge(merge(gen, age, by = "id"), diag, by = "id")  #  id   female   male 19 23 42 61   anxiety   asthma   copd   diabetes  #1  1        0      1  0  0  1  0         1        1      0          0  #2  2        0      1  1  0  0  0         0        1      0          0  #3  3        1      0  0  1  0  0         0        0      0          1  #4  4        1      0  0  0  0  1         0        0      1          1

Actually I'm not well aware of your model but your setting may be too much as R handles factors by the formula object. For example, if gender is the response, the following matrix will be generated within R. Therefore, as long as you are not going to fit on your own, it'd suffice to set data types and formula appropriately.

data.raw$age <- as.factor(data.raw$age)  model.matrix(gender ~ ., data = data.raw[,-1])  #(Intercept) age23 age42 age61 diagnosis  asthma diagnosis  copd diagnosis  diabetes  #1           1     0     1     0                 1               0                   0  #2           1     0     1     0                 0               0                   0  #3           1     0     0     0                 1               0                   0  #4           1     1     0     0                 0               0                   1  #5           1     0     0     1                 0               0                   1  #6           1     0     0     1                 0               1                   0

If you need all the levels of each variable you can do this by suppressing the intercept in model.matrix and using a wee trick from all-levels-of-a-factor-in-a-model-matrix-in-r

#  Using Akrun's df1, first change all variables, except ID, to factor  df1[-1] <- lapply(df1[-1], factor)    # Use model.matrix to derive dummy coding  m <- data.frame(model.matrix( ~ 0 + . , data=df1,                contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))    # Collapse to give final solution  aggregate(. ~ ID, data=m, max)

Answer by Jota for R: Converting Column Values into Their Own Binary Encoded Columns

Using reshape from base R:

d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")  a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")  g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")      new.dat <- cbind(ID=d["ID"],      g[,grepl("_", names(g))],      a[,grepl("_", names(a))],      d[,grepl("_", names(d))])    # convert factors columns to character (if necessary)  # taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231  new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)    new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0  new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1    #  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma  #1  1           1             0      1      0      0      0                1  #3  2           1             0      0      1      0      0                1  #4  3           0             1      0      0      1      0                0  #5  4           0             1      0      0      0      1                0  #  diagnosis_anxiety diagnosis_diabetes diagnosis_copd  #1                 1                  0              0  #3                 0                  0              0  #4                 0                  1              0  #5                 0                  1              1

Answer by Steven Beaupré for R: Converting Column Values into Their Own Binary Encoded Columns

There is a function in the caret package to "dummify" data.

library(caret)  library(dplyr)  predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)

Answer by akrun for R: Converting Column Values into Their Own Binary Encoded Columns

A base R option would be

 (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L   #     values   #ID  19 23 42 61 anxiety asthma copd diabetes female male   # 1  0  0  1  0       1      1    0        0      0    1   # 2  1  0  0  0       0      1    0        0      0    1   # 3  0  1  0  0       0      0    0        1      1    0   # 4  0  0  0  1       0      0    1        1      1    0

If you need the original name as well

 (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L   #   Val   #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma   #1      0      0      1      0                 1                1   #2      1      0      0      0                 0                1   #3      0      1      0      0                 0                0   #4      0      0      0      1                 0                0   #  Val   #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male   #1              0                  0             0           1   #2              0                  0             0           1   #3              0                  1             1           0   #4              1                  1             1           0

data

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male",   "male", "male", "female", "female", "female"), age = c(42L, 42L,   19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma",   "diabetes", "diabetes", "copd")), .Names = c("ID", "gender",   "age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Friday, October 28, 2016