Organize my results in a table like a correlation matrix ~ Discussion of Coding

Organize my results in a table like a correlation matrix

I need to organize my results of correlation in a matrix, for example. I have a file like this:

trait1  trait2  GeCor   PCor       a       b  -1.00   0.28       b       c   0.40   0.45       d       e  -0.39   0.35       a       c  -0.39   0.50       b       d   0.36   0.30       a       d  -0.35   0.30       b       e  -0.29   0.36       a       e   0.26   0.33       c       e   0.18   0.38       c       d   0.04   0.31

And I need that file stay this way:

        a      b       c       d       e  a      -    0.28    0.50    0.30    0.33  b   -1.00     -     0.45    0.30    0.36  c   -0.39   0.40     -      0.31    0.38  d   -0.35   0.36    0.04      -     0.35  e    0.26  -0.29    0.18   -0.39     -

The symbol on the diagonal (-) only is that this space should be left blank. The GeCor are genetic correlation between trait1 and trait2, these should be below of the blank diagonal, and PCor are the phenotype correlation between trait1 and trait2, and should be above of the blank diagonal. Well, just to tell this database is just a sample and I have many more traits. I accept programming suggestions for unix or R. Thank you very much.

Answer by nullverb for Organize my results in a table like a correlation matrix

Unfortunately, none of the GNU core utilities would solve this problem easily. The manipulation of your data structure is not a simple re-arrangement. Let me offer a generic solution, and then give you a solution using python; which would be the easiest language to implement in (least lines of code).

Essentially, if you treat trait1 and trait2 columns as your matrix indexes, you need to create a matrix of size NxN. With N being the number of unique 'traits' - (i.e. a, b, c, d - N = 4). Then using your original data structure, cells [a,b] = Pcor while [b,a] = GeCor. You never specified any other order, so I'll assume none. Therefore, we need to assume no structure or correlation between order of one row to another row. This means we need to use hash indexing to store the correct values with the correct keys - i.e. ab = 0.28 and ba = -1.00. Once we build the entire dictionary structure, we simply unroll it (in alphabetical order) and output the final data structure.

A simple functional decomposition (non object oriented), would break this into three distinct blocks. Read in original table, fill diagonal cells, and output new data structure.

Here is a python2 implementation of this:

#!/usr/bin/python2    import fileinput    finalStruct = {}  diagCh = '-'  delimiter = '\t'      # Build initial structure  for line in fileinput.input():      line = line.rstrip()      columns = line.split(delimiter)        # Create first layer of dictionary inside dictionary      if columns[0] not in finalStruct:          finalStruct[columns[0]] = {}        finalStruct[columns[0]][columns[1]] = columns[3]        if columns[1] not in finalStruct:          finalStruct[columns[1]] = {}        finalStruct[columns[1]][columns[0]] = columns[2]    # Add '-' for diagonals  for currKey in finalStruct.keys():      finalStruct[currKey][currKey] = diagCh      # Output final structure  for rowIndex in sorted(finalStruct.keys()):      for colIndex in sorted(finalStruct[rowIndex].keys()):          print finalStruct[rowIndex][colIndex], delimiter,        print

If this code file was solution.py, and our input file was:

a   b   -1.00   0.28  b   c   0.40    0.45  d   e   -0.39   0.35  a   c   -0.39   0.50  b   d   0.36    0.30  a   d   -0.35   0.30  b   e   -0.29   0.36  a   e   0.26    0.33  c   e   0.18    0.38  c   d   0.04    0.31

And our input file was input.txt. We can run this example as:

cat input.txt | ./solution.py    -   0.28    0.50    0.30    0.33      -1.00   -   0.45    0.30    0.36      -0.39   0.40    -   0.31    0.38      -0.35   0.36    0.04    -   0.35      0.26    -0.29   0.18    -0.39   -

Answer by Zheyuan Li for Organize my results in a table like a correlation matrix

I think it is time to post my solution with R.

First, with read.table, read in your original data into a data frame, say x. Then

n <- ceiling(sqrt(2 * nrow(x)))  NAME <- with(x, sort(union(unique(trait1), unique(trait2))))    z <- matrix(NA_real_, n, n, dimnames = list(NAME, NAME))  z[lower.tri(z)] <- with(x, GeCor[order(trait1, trait2)])  z[upper.tri(z)] <- with(x, PCor[order(trait2, trait1)])    #      a     b    c     d    e  #a    NA  0.28 0.50  0.30 0.33  #b -1.00    NA 0.45  0.30 0.36  #c -0.39  0.40   NA  0.31 0.38  #d -0.35  0.36 0.04    NA 0.35  #e  0.26 -0.29 0.18 -0.39   NA    ## write to file "z.txt"  write.table(z, file = "z.txt", na = "-", sep = "\t", quote = FALSE)    a   b       c       d       e  a   -       0.28    0.5     0.3     0.33  b   -1      -       0.45    0.3     0.36  c   -0.39   0.4     -       0.31    0.38  d   -0.35   0.36    0.04    -       0.35  e   0.26    -0.29   0.18    -0.39   -

Answer by HubertL for Organize my results in a table like a correlation matrix

Another approach in R, using packages data.table for reading data and dplyr:

library(data.table)  df <- fread("trait1  trait2  GeCor   PCor  a       b  -1.00   0.28  b       c   0.40   0.45  d       e  -0.39   0.35  a       c  -0.39   0.50  b       d   0.36   0.30  a       d  -0.35   0.30  b       e  -0.29   0.36  a       e   0.26   0.33  c       e   0.18   0.38  c       d   0.04   0.31")  #or df <- fread("myfile.txt")    library(dplyr)  data.frame(trait1=unique(c(df$trait1,df$trait2)), trait2=unique(c(df$trait1,df$trait2)),cor=NA, stringsAsFactors = FALSE) %>%     bind_rows(df %>% mutate(trait1=trait1, trait2=trait2, cor=GeCor, GeCor=NULL, PCor=NULL )) %>%     bind_rows(df %>% mutate(temp=trait2, trait2=trait1, trait1=temp, temp=NULL, cor=PCor, GeCor=NULL, PCor=NULL )) %>%     arrange(trait1, trait2) -> df_long    print.table(matrix(df_long$cor, nrow=5, dimnames=list(unique(df_long$trait1),unique(df_long$trait1))), na.print='-')          a     b     c     d     e  a     -  0.28  0.50  0.30  0.33  b -1.00     -  0.45  0.30  0.36  c -0.39  0.40     -  0.31  0.38  d -0.35  0.36  0.04     -  0.35  e  0.26 -0.29  0.18 -0.39     -

Answer by Sundeep for Organize my results in a table like a correlation matrix

With perl

$ perl -ae '  if($. > 1)  {      $h{"$F[0]$F[1]"} = $F[3]; $h{"$F[1]$F[0]"} = $F[2];      push(@hh,$F[0]) if !$done{$F[0]}++;      push(@hh,$F[1]) if !$done{$F[1]}++;  }  END  {      print "\t".join("\t",sort @hh);      foreach (sort keys %h)      {          ($k1,$k2) = /./g;          print "\n$k1" if !$seen{$k1}++;          print "\t-" if $k2 eq ++$k1;          print "\t$h{$_}";      }      print "\t-\n";  }' ip.txt      a   b   c   d   e  a   -   0.28    0.50    0.30    0.33  b   -1.00   -   0.45    0.30    0.36  c   -0.39   0.40    -   0.31    0.38  d   -0.35   0.36    0.04    -   0.35  e   0.26    -0.29   0.18    -0.39   -

Input line is split based on space and saved to @F array
If input line is greater than one, save 3rd and 4th column in hash with two combinations of 1st and 2nd column as keys
Also save all unique 1st or 2nd column values in an array
At end, print in required format

Answer by Ed Morton for Organize my results in a table like a correlation matrix

With GNU awk for sorted_in:

$ cat tst.awk  NR>1 {      cell[$2,$1] = $3      cell[$1,$2] = $4      keys[$1]      keys[$2]  }  END {      PROCINFO["sorted_in"] = "@ind_str_asc"        printf "%5s", ""      for (col in keys) {          printf "  %-5s", col      }      print ""        for (row in keys) {          printf "%s", row          for (col in keys) {              printf "  %5s", ((row,col) in cell ? cell[row,col] : "- ")          }          print ""      }  }    $ awk -f tst.awk file         a      b      c      d      e  a     -    0.28   0.50   0.30   0.33  b  -1.00     -    0.45   0.30   0.36  c  -0.39   0.40     -    0.31   0.38  d  -0.35   0.36   0.04     -    0.35  e   0.26  -0.29   0.18  -0.39     -

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Monday, October 3, 2016