R: Converting Column Values into Their Own Binary Encoded Columns
R: Converting Column Values into Their Own Binary Encoded Columns
I have a number of CSV files with columns such as gender, age, diagnosis, etc.
Currently, they are coded as such:
ID, gender, age, diagnosis 1, male, 42, asthma 1, male, 42, anxiety 2, male, 19, asthma 3, female, 23, diabetes 4, female, 61, diabetes 4, female, 61, copd
The goal is to transform this data into this target format:
Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0 2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0 3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0 4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
I've attempted using reshape2's dcast()
function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length) ID, male19, male23, male42, male61, female19, female23, female42, female61 1, 0, 0, 1, 0, 0, 0, 0, 0 2, 1, 0, 0, 0, 0, 0, 0, 0 3, 0, 0, 0, 0, 0, 1, 0, 0 4, 0, 0, 0, 0, 0, 0, 0, 1
Seeing as this is a fairly common task in machine learning data preparation; I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.
Answer by David Arenburg for R: Converting Column Values into Their Own Binary Encoded Columns
You need a melt
/dcast
combination here (which called recast
) in order to convert all columns into one column and avoid combinations
library(reshape2) recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID 19 23 42 61 anxiety asthma copd diabetes female male # 1 1 0 0 1 0 1 1 0 0 0 1 # 2 2 1 0 0 0 0 1 0 0 0 1 # 3 3 0 1 0 0 0 0 0 1 1 0 # 4 4 0 0 0 1 0 0 1 1 1 0
As per your Sidenote, you can add variable
here in order to get the names added too
recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd # 1 1 0 1 0 0 1 0 1 1 0 # 2 2 0 1 1 0 0 0 0 1 0 # 3 3 1 0 0 1 0 0 0 0 0 # 4 4 1 0 0 0 0 1 0 0 1 # diagnosis_diabetes # 1 0 # 2 0 # 3 1 # 4 1
Answer by Jaehyeon Kim for R: Converting Column Values into Their Own Binary Encoded Columns
Below is a slightly longer way with dcast()
and merge()
. As gender and age are not unique by ID, a function is created to turn its length into a dummy variable (dum()
). On the other hand, diagnosis is set to be counted uniquely by adjusting the formula.
library(reshape2) data.raw <- read.table(header = T, sep = ",", text = " id, gender, age, diagnosis 1, male, 42, asthma 1, male, 42, anxiety 2, male, 19, asthma 3, female, 23, diabetes 4, female, 61, diabetes 4, female, 61, copd") # function to create a dummy variable dum <- function(x) { if(length(x) > 0) 1 else 0 } # length of dignosis by id, gender and age diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)] # length of gender by id gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum) # length of age by id age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum) merge(merge(gen, age, by = "id"), diag, by = "id") # id female male 19 23 42 61 anxiety asthma copd diabetes #1 1 0 1 0 0 1 0 1 1 0 0 #2 2 0 1 1 0 0 0 0 1 0 0 #3 3 1 0 0 1 0 0 0 0 0 1 #4 4 1 0 0 0 0 1 0 0 1 1
Actually I'm not well aware of your model but your setting may be too much as R handles factors by the formula object. For example, if gender is the response, the following matrix will be generated within R. Therefore, as long as you are not going to fit on your own, it'd suffice to set data types and formula appropriately.
data.raw$age <- as.factor(data.raw$age) model.matrix(gender ~ ., data = data.raw[,-1]) #(Intercept) age23 age42 age61 diagnosis asthma diagnosis copd diagnosis diabetes #1 1 0 1 0 1 0 0 #2 1 0 1 0 0 0 0 #3 1 0 0 0 1 0 0 #4 1 1 0 0 0 0 1 #5 1 0 0 1 0 0 1 #6 1 0 0 1 0 1 0
If you need all the levels of each variable you can do this by suppressing the intercept in model.matrix
and using a wee trick from all-levels-of-a-factor-in-a-model-matrix-in-r
# Using Akrun's df1, first change all variables, except ID, to factor df1[-1] <- lapply(df1[-1], factor) # Use model.matrix to derive dummy coding m <- data.frame(model.matrix( ~ 0 + . , data=df1, contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE))) # Collapse to give final solution aggregate(. ~ ID, data=m, max)
Answer by Jota for R: Converting Column Values into Their Own Binary Encoded Columns
Using reshape
from base R:
d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_") a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_") g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_") new.dat <- cbind(ID=d["ID"], g[,grepl("_", names(g))], a[,grepl("_", names(a))], d[,grepl("_", names(d))]) # convert factors columns to character (if necessary) # taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231 new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character) new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0 new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1 # ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma #1 1 1 0 1 0 0 0 1 #3 2 1 0 0 1 0 0 1 #4 3 0 1 0 0 1 0 0 #5 4 0 1 0 0 0 1 0 # diagnosis_anxiety diagnosis_diabetes diagnosis_copd #1 1 0 0 #3 0 0 0 #4 0 1 0 #5 0 1 1
Answer by Steven Beaupré for R: Converting Column Values into Their Own Binary Encoded Columns
There is a function in the caret
package to "dummify" data.
library(caret) library(dplyr) predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)
Answer by akrun for R: Converting Column Values into Their Own Binary Encoded Columns
A base R
option would be
(!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L # values #ID 19 23 42 61 anxiety asthma copd diabetes female male # 1 0 0 1 0 1 1 0 0 0 1 # 2 1 0 0 0 0 1 0 0 0 1 # 3 0 1 0 0 0 0 0 1 1 0 # 4 0 0 0 1 0 0 1 1 1 0
If you need the original name as well
(!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L # Val #ID age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma #1 0 0 1 0 1 1 #2 1 0 0 0 0 1 #3 0 1 0 0 0 0 #4 0 0 0 1 0 0 # Val #ID diagnosis_copd diagnosis_diabetes gender_female gender_male #1 0 0 0 1 #2 0 0 0 1 #3 0 1 1 0 #4 1 1 1 0
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", "male", "male", "female", "female", "female"), age = c(42L, 42L, 19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", "diabetes", "diabetes", "copd")), .Names = c("ID", "gender", "age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment