Which is not working as expected
Which is not working as expected
I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:
Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))
score_i_j
= the current score itself
score_i_i
= look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns
score_j_j
= look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns
An example is for instance, if df is as follow:
df <- read.table(text = " First.Protein,Second.Protein,Score 1,1,25 1,2,90 1,3,82 1,4,19 2,1,90 2,2,99 2,3,76 2,4,79 3,1,82 3,2,76 3,3,91 3,4,33 4,1,28 4,2,11 4,3,99 4,4,50 ", header = TRUE, sep = ",")
If we are normalizing this row:
First.Protein Second.Protein Score 4 3 99
The normalized score will be:
The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.
Therefore:
Normalized = 99 / (sqrt(50) * sqrt(91)) = 1.467674
I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:
for(i in 1:nrow(Smith_Waterman_Scores)) { Smith_Waterman_Scores$Score[i] <- Smith_Waterman_Scores$Score[i] / (sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) * (sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])])) }
Answer by asb for Which is not working as expected
You may be doing this in a very round-about manner. Can you see if this works for you:
R> xx First Second Score 1 1 1 25 2 1 2 90 3 1 3 82 4 1 4 19 5 2 1 90 6 2 2 99 7 2 3 76 8 2 4 79 9 3 1 82 10 3 2 76 11 3 3 91 12 3 4 33 13 4 1 28 14 4 2 11 15 4 3 99 16 4 4 50 R> contingency = xtabs(Score ~ ., data=xx) R> contingency Second First 1 2 3 4 1 25 90 82 19 2 90 99 76 79 3 82 76 91 33 4 28 11 99 50 R> diagonals <- unname(diag(contingency)) R> diagonals [1] 25 99 91 50 R> normalize <- function (i, j, contingencies, diagonals) { + contingencies[i, j] / (sqrt(diagonals[i]) * sqrt(diagonals[j])) + } R> normalize(4, 3, contingency, diagonals) [1] 1.467674
Answer by zx8754 for Which is not working as expected
Loop through rows using apply:
#compute df$ScoreNorm <- apply(df, 1, function(i){ i[3] / ( sqrt(df[ df$First.Protein == i[1] & df$Second.Protein == i[1], "Score"]) * sqrt(df[ df$First.Protein == i[2] & df$Second.Protein == i[2], "Score"]) ) }) #test output df[15, ] # First.Protein Second.Protein Score ScoreNorm # 15 4 3 99 1.467674
Answer by Bulat for Which is not working as expected
You can can implement this with joins, here is an example using data.table
:
library(data.table) dt <- data.table(df) dt.lookup <- dt[First.Protein == Second.Protein] setkey(dt,"First.Protein" ) setkey(dt.lookup,"First.Protein" ) colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1") dt <- dt[dt.lookup] setkey(dt,"Second.Protein" ) setkey(dt.lookup,"Second.Protein") colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2") dt <- dt[dt.lookup][ , Normalized := Score / (sqrt(Score1) * sqrt(Score2))][ , .(First.Protein, Second.Protein, Normalized)]
Just make sure you don't use for
loops.
Answer by Martin Morgan for Which is not working as expected
Here's a re-write of your original attempt (which()
is not necessary; just use the logical vector for sub-setting; with()
allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)
orig0 <- function(df) { for(i in 1:nrow(df)) { df$Score[i] <- with(df, { ii <- First.Protein == First.Protein[i] & Second.Protein == First.Protein[i] jj <- First.Protein == Second.Protein[i] & Second.Protein == Second.Protein[i] Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj])) }) } df$Score }
The problem is that Score[ii]
and Score[jj]
appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'
orig1 <- function(df) { normalized <- numeric(nrow(df)) # pre-allocate for(i in 1:nrow(df)) { normalized[i] <- with(df, { ii <- First.Protein == First.Protein[i] & Second.Protein == First.Protein[i] jj <- First.Protein == Second.Protein[i] & Second.Protein == Second.Protein[i] Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj])) }) } normalized }
I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value
orig2 <- function(df) { sapply(seq_len(nrow(df)), function(i) { with(df, { ii <- First.Protein == First.Protein[i] & Second.Protein == First.Protein[i] jj <- First.Protein == Second.Protein[i] & Second.Protein == Second.Protein[i] Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj])) }) }) }
Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.
Several of the answers attempt to avoid that polynomial scaling. My answer does this using match()
on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.
Create a subset of data with identical first and second proteins
ii = df[df$First.Protein == df$Second.Protein,]
Here's the ijth score from the original data frame
s_ij = df$Score
Look up First.Protein of df
in ii
and record the score; likewise for Second.Protein
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"] s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
The normalized scores are then
> s_ij / (sqrt(s_ii) * sqrt(s_jj)) [1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101 [8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472 [15] 1.4676736 1.0000000
This will be fast, using a single call to match()
instead of many calls to which()
inside a for loop or tests for identity inside an apply()
-- both of the latter make N^2 comparisons and so scale very poorly.
I summarized some of the proposed solutions as
f0 <- function(df) { contingency = xtabs(Score ~ ., df) diagonals <- unname(diag(contingency)) i <- df$First.Protein j <- df$Second.Protein idx <- matrix(c(i, j), ncol=2) contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j])) } f1 <- function(df) { ii = df[df$First.Protein == df$Second.Protein,] s_ij = df$Score s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"] s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"] s_ij / (sqrt(s_ii) * sqrt(s_jj)) } f2 <- function(dt) { dt.lookup <- dt[First.Protein == Second.Protein] setkey(dt,"First.Protein" ) setkey(dt.lookup,"First.Protein" ) colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1") dt <- dt[dt.lookup] setkey(dt,"Second.Protein" ) setkey(dt.lookup,"Second.Protein") colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2") dt[dt.lookup][ , Normalized := Score / (sqrt(Score1) * sqrt(Score2))][ , .(First.Protein, Second.Protein, Normalized)] } f3 <- function(dt) { eq = dt[First.Protein == Second.Protein] dt[eq, Score_ii := i.Score, on = "First.Protein"] dt[eq, Score_jj := i.Score, on = "Second.Protein"] dt[, Normalised := Score/sqrt(Score_ii * Score_jj)] dt[, c("Score_ii", "Score_jj") := NULL] }
I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3()
produces numerically similar but not identical results
> identical(orig1(df), f0(df)) [1] TRUE > identical(f0(df), f1(df)) [1] TRUE > identical(f0(df), { f3(dt3); dt3[["Normalized"]] }) # pass by reference! [1] FALSE > all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] }) [1] TRUE
There are performance differences
library(data.table) dt2 <- as.data.table(df) dt3 <- as.data.table(df) library(microbenchmark) microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))
with
> microbenchmark(f0(df), f1(df), f2(df), f3(df)) Unit: microseconds expr min lq mean median uq max neval f0(df) 967.117 992.8365 1059.7076 1030.9710 1094.247 2384.360 100 f1(df) 176.238 192.8610 210.4059 207.8865 219.687 333.260 100 f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975 100 f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430 100
The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).
On reflection, a more straight-forward impelementation of f1()
just looks up the 'diagonal' elements
f1a <- function(df) { ii = df[df$First.Protein == df$Second.Protein, ] d = sqrt(ii$Score[order(ii$First.Protein)]) df$Score / (d[df$First.Protein] * d[df$Second.Protein]) }
Answer by Arun for Which is not working as expected
Here's how I'd approach using data.table
. Hopefully @MartinMorgan finds this easier to understand :-).
require(data.table) # v1.9.6+ dt = as.data.table(df) # or use setDT(df) to convert by reference eq = dt[First.Protein == Second.Protein]
So far, I've just created a new data.table eq
which contains all rows where both columns are equal.
dt[eq, Score_ii := i.Score, on = "First.Protein"] dt[eq, Score_jj := i.Score, on = "Second.Protein"]
Here we add columns Score_ii
and Score_jj
while joining on columns First.Protein
and Second.Protein
. That it is a join operation should be clear because of on=
argument. The i.
refers to the Score
column in the data.table provided in the i-
argument (here, eq
's Score
).
Note that we can use
match()
here as well. But that wouldn't work if you've to lookup directly (and as efficiently) based on more than one column. Usingon=
, we can extend this quite easily, and is also much easier to read/understand.
Once we've all the required columns, the task is just to get the final Normalised
column (and delete the intermediates if they're not necessary).
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)] dt[, c("Score_ii", "Score_jj") := NULL] # delete if you don't want them
I'll leave out the micro- and milli- second benchmarks as I'm not interested in them.
PS: The columns Score_ii
and Score_jj
are added above on purpose under the assumption that you might need them. If you don't want them at all, you can also do:
Score_ii = eq[dt, Score, on = "First.Protein"] ## -- (1) Score_jj = eq[dt, Score, on = "Second.Protein"]
(1) reads: for each row in dt
get matching row in eq
while matching on column First.Protein
and extract eq$Score
corresponding to that matching row.
Then, we can directly add the Normalised
column as:
dt[, Normalised := Score / sqrt(Score_ii * Score_jj)]
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment