Organize my results in a table like a correlation matrix
Organize my results in a table like a correlation matrix
I need to organize my results of correlation in a matrix, for example. I have a file like this:
trait1 trait2 GeCor PCor a b -1.00 0.28 b c 0.40 0.45 d e -0.39 0.35 a c -0.39 0.50 b d 0.36 0.30 a d -0.35 0.30 b e -0.29 0.36 a e 0.26 0.33 c e 0.18 0.38 c d 0.04 0.31
And I need that file stay this way:
a b c d e a - 0.28 0.50 0.30 0.33 b -1.00 - 0.45 0.30 0.36 c -0.39 0.40 - 0.31 0.38 d -0.35 0.36 0.04 - 0.35 e 0.26 -0.29 0.18 -0.39 -
The symbol on the diagonal (-
) only is that this space should be left blank. The GeCor
are genetic correlation between trait1
and trait2
, these should be below of the blank diagonal, and PCor
are the phenotype correlation between trait1
and trait2
, and should be above of the blank diagonal. Well, just to tell this database is just a sample and I have many more traits. I accept programming suggestions for unix or R. Thank you very much.
Answer by nullverb for Organize my results in a table like a correlation matrix
Unfortunately, none of the GNU core utilities would solve this problem easily. The manipulation of your data structure is not a simple re-arrangement. Let me offer a generic solution, and then give you a solution using python; which would be the easiest language to implement in (least lines of code).
Essentially, if you treat trait1 and trait2 columns as your matrix indexes, you need to create a matrix of size NxN. With N being the number of unique 'traits' - (i.e. a, b, c, d - N = 4). Then using your original data structure, cells [a,b] = Pcor while [b,a] = GeCor. You never specified any other order, so I'll assume none. Therefore, we need to assume no structure or correlation between order of one row to another row. This means we need to use hash indexing to store the correct values with the correct keys - i.e. ab = 0.28 and ba = -1.00. Once we build the entire dictionary structure, we simply unroll it (in alphabetical order) and output the final data structure.
A simple functional decomposition (non object oriented), would break this into three distinct blocks. Read in original table, fill diagonal cells, and output new data structure.
Here is a python2 implementation of this:
#!/usr/bin/python2 import fileinput finalStruct = {} diagCh = '-' delimiter = '\t' # Build initial structure for line in fileinput.input(): line = line.rstrip() columns = line.split(delimiter) # Create first layer of dictionary inside dictionary if columns[0] not in finalStruct: finalStruct[columns[0]] = {} finalStruct[columns[0]][columns[1]] = columns[3] if columns[1] not in finalStruct: finalStruct[columns[1]] = {} finalStruct[columns[1]][columns[0]] = columns[2] # Add '-' for diagonals for currKey in finalStruct.keys(): finalStruct[currKey][currKey] = diagCh # Output final structure for rowIndex in sorted(finalStruct.keys()): for colIndex in sorted(finalStruct[rowIndex].keys()): print finalStruct[rowIndex][colIndex], delimiter, print
If this code file was solution.py, and our input file was:
a b -1.00 0.28 b c 0.40 0.45 d e -0.39 0.35 a c -0.39 0.50 b d 0.36 0.30 a d -0.35 0.30 b e -0.29 0.36 a e 0.26 0.33 c e 0.18 0.38 c d 0.04 0.31
And our input file was input.txt. We can run this example as:
cat input.txt | ./solution.py - 0.28 0.50 0.30 0.33 -1.00 - 0.45 0.30 0.36 -0.39 0.40 - 0.31 0.38 -0.35 0.36 0.04 - 0.35 0.26 -0.29 0.18 -0.39 -
Answer by Zheyuan Li for Organize my results in a table like a correlation matrix
I think it is time to post my solution with R.
First, with read.table
, read in your original data into a data frame, say x
. Then
n <- ceiling(sqrt(2 * nrow(x))) NAME <- with(x, sort(union(unique(trait1), unique(trait2)))) z <- matrix(NA_real_, n, n, dimnames = list(NAME, NAME)) z[lower.tri(z)] <- with(x, GeCor[order(trait1, trait2)]) z[upper.tri(z)] <- with(x, PCor[order(trait2, trait1)]) # a b c d e #a NA 0.28 0.50 0.30 0.33 #b -1.00 NA 0.45 0.30 0.36 #c -0.39 0.40 NA 0.31 0.38 #d -0.35 0.36 0.04 NA 0.35 #e 0.26 -0.29 0.18 -0.39 NA ## write to file "z.txt" write.table(z, file = "z.txt", na = "-", sep = "\t", quote = FALSE) a b c d e a - 0.28 0.5 0.3 0.33 b -1 - 0.45 0.3 0.36 c -0.39 0.4 - 0.31 0.38 d -0.35 0.36 0.04 - 0.35 e 0.26 -0.29 0.18 -0.39 -
Answer by HubertL for Organize my results in a table like a correlation matrix
Another approach in R, using packages data.table
for reading data and dplyr
:
library(data.table) df <- fread("trait1 trait2 GeCor PCor a b -1.00 0.28 b c 0.40 0.45 d e -0.39 0.35 a c -0.39 0.50 b d 0.36 0.30 a d -0.35 0.30 b e -0.29 0.36 a e 0.26 0.33 c e 0.18 0.38 c d 0.04 0.31") #or df <- fread("myfile.txt") library(dplyr) data.frame(trait1=unique(c(df$trait1,df$trait2)), trait2=unique(c(df$trait1,df$trait2)),cor=NA, stringsAsFactors = FALSE) %>% bind_rows(df %>% mutate(trait1=trait1, trait2=trait2, cor=GeCor, GeCor=NULL, PCor=NULL )) %>% bind_rows(df %>% mutate(temp=trait2, trait2=trait1, trait1=temp, temp=NULL, cor=PCor, GeCor=NULL, PCor=NULL )) %>% arrange(trait1, trait2) -> df_long print.table(matrix(df_long$cor, nrow=5, dimnames=list(unique(df_long$trait1),unique(df_long$trait1))), na.print='-') a b c d e a - 0.28 0.50 0.30 0.33 b -1.00 - 0.45 0.30 0.36 c -0.39 0.40 - 0.31 0.38 d -0.35 0.36 0.04 - 0.35 e 0.26 -0.29 0.18 -0.39 -
Answer by Sundeep for Organize my results in a table like a correlation matrix
With perl
$ perl -ae ' if($. > 1) { $h{"$F[0]$F[1]"} = $F[3]; $h{"$F[1]$F[0]"} = $F[2]; push(@hh,$F[0]) if !$done{$F[0]}++; push(@hh,$F[1]) if !$done{$F[1]}++; } END { print "\t".join("\t",sort @hh); foreach (sort keys %h) { ($k1,$k2) = /./g; print "\n$k1" if !$seen{$k1}++; print "\t-" if $k2 eq ++$k1; print "\t$h{$_}"; } print "\t-\n"; }' ip.txt a b c d e a - 0.28 0.50 0.30 0.33 b -1.00 - 0.45 0.30 0.36 c -0.39 0.40 - 0.31 0.38 d -0.35 0.36 0.04 - 0.35 e 0.26 -0.29 0.18 -0.39 -
- Input line is split based on space and saved to
@F
array - If input line is greater than one, save 3rd and 4th column in hash with two combinations of 1st and 2nd column as keys
- Also save all unique 1st or 2nd column values in an array
- At end, print in required format
Answer by Ed Morton for Organize my results in a table like a correlation matrix
With GNU awk for sorted_in:
$ cat tst.awk NR>1 { cell[$2,$1] = $3 cell[$1,$2] = $4 keys[$1] keys[$2] } END { PROCINFO["sorted_in"] = "@ind_str_asc" printf "%5s", "" for (col in keys) { printf " %-5s", col } print "" for (row in keys) { printf "%s", row for (col in keys) { printf " %5s", ((row,col) in cell ? cell[row,col] : "- ") } print "" } } $ awk -f tst.awk file a b c d e a - 0.28 0.50 0.30 0.33 b -1.00 - 0.45 0.30 0.36 c -0.39 0.40 - 0.31 0.38 d -0.35 0.36 0.04 - 0.35 e 0.26 -0.29 0.18 -0.39 -
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment