Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Tuesday, February 9, 2016

Fill NA values with the trailing row value times a growth rate?

Fill NA values with the trailing row value times a growth rate?


What would be a good way to populate NA values with the previous value times (1+growth)?

df <- data.frame(year=0:6,                  price1=c(1.1, 2.1, 3.2, 4.8, NA, NA, NA),                   price2=c(1.1, 2.1, 3.2, NA, NA, NA, NA))  growth <- .02  

In this case, I would want the missing values in price1 to be filled with 4.8*1.02, 4.8*1.02^2, and 4.8*1.02^3. Similarly, I would want the missing values in price2 to be filled with 3.2*1.02, 3.2*1.02^2, 3.2*1.02^3, and 3.2*1.02^4.

I've tried this, but I think it needs to be set to repeat somehow (apply?):

library(dplyr)   df %>% mutate(price1=ifelse(is.na(price1),               lag(price1)*(1+growth), price1))  

I'm not using dplyr for anything else (yet), so something from base R or plyr or similar would be appreciated.

Answer by Batanichek for Fill NA values with the trailing row value times a growth rate?


You can try such function

    test <- function(x,n) {        if (!is.na(df[x,n]))    return (df[x,n])        else           return (test(x-1,n)*(1+growth))      }      a=1:nrow(df)      lapply(a, FUN=function(i) test(i,2))    unlist(lapply(a, FUN=function(i) test(i,2)))  

[1] 1.100000 2.100000 3.200000 4.800000 4.896000 4.993920 5.093798

Answer by Ben Bolker for Fill NA values with the trailing row value times a growth rate?


Assuming only trailing NAs:

NAgrow <- function(x,growth=0.02) {      isna <- is.na(x)      lastval <- tail(x[!isna],1)      x[isna] <- lastval*(1+growth)^seq(sum(isna))      return(x)  }  

If there are interior NA values as well this would get a little trickier.

Apply to all columns except the first:

df[-1] <- lapply(df[-1],NAgrow)    ##   year   price1   price2  ## 1    0 1.100000 1.100000  ## 2    1 2.100000 2.100000  ## 3    2 3.200000 3.200000  ## 4    3 4.800000 3.264000  ## 5    4 4.896000 3.329280  ## 6    5 4.993920 3.395866  ## 7    6 5.093798 3.463783  

Answer by cr1msonB1ade for Fill NA values with the trailing row value times a growth rate?


It looks like dplyr can't handle access newly assigned lag values. Here is a solution that should work even if the NA's are in the middle of a column.

df <- apply(    df, 2, function(x){      if(sum(is.na(x)) == 0){return(x)}      ## updated with optimized portion from @josilber      r <- rle(is.na(x))      na.loc <- which(r$values)      b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc])      lastValIs <- 1:length(x)      lastValI[is.na(x)] <- b      x[is.na(x)] <-        sapply(which(is.na(x)), function(i){          return(x[lastValIs[i]]*(1 + growth)^(i - lastValIs[i]))        })      return(x)    })  

Answer by josliber for Fill NA values with the trailing row value times a growth rate?


The following solution based on rle works with NA in any position and does not rely on looping to fill in the missing values:

NAgrow.rle <- function(x) {    if (is.na(x[1]))  stop("Can't have NA at beginning")    r <- rle(is.na(x))    na.loc <- which(r$values)    b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc])    x[is.na(x)] <- ave(x[b], b, FUN=function(y) y[1]*(1+growth)^seq_along(y))    x  }  df[,-1] <- lapply(df[,-1], NAgrow.rle)  #   year   price1   price2  # 1    0 1.100000 1.100000  # 2    1 2.100000 2.100000  # 3    2 3.200000 3.200000  # 4    3 4.800000 3.264000  # 5    4 4.896000 3.329280  # 6    5 4.993920 3.395866  # 7    6 5.093798 3.463783  

I'll drop in two additional solutions using for loops, one in base R and one in Rcpp:

NAgrow.for <- function(x) {    for (i in which(is.na(x))) {      x[i] <- x[i-1] * (1+growth)    }    x  }    library(Rcpp)  cppFunction(  "NumericVector NAgrowRcpp(NumericVector x, double growth) {    const int n = x.size();    NumericVector y(x);    for (int i=1; i < n; ++i) {      if (R_IsNA(x[i])) {        y[i] = (1.0 + growth) * y[i-1];      }    }    return y;  }")  

The solutions based on rle (crimson and josilber.rle) take about twice as long as the simple solution based on a for loop (josilber.for), and as expected the Rcpp solution is the fastest, running in about 0.002 seconds.

set.seed(144)  big.df <- data.frame(ID=1:100000,                       price1=sample(c(1:10, NA), 100000, replace=TRUE),                       price2=sample(c(1:10, NA), 100000, replace=TRUE))  crimson <- function(df) apply(df[,-1], 2, function(x){    if(sum(is.na(x)) == 0){return(x)}    ## updated with optimized portion from @josilber    r <- rle(is.na(x))    na.loc <- which(r$values)    b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc])    lastValIs <- 1:length(x)    lastValIs[is.na(x)] <- b    x[is.na(x)] <-      sapply(which(is.na(x)), function(i){        return(x[lastValIs[i]]*(1 + growth)^(i - lastValIs[i]))      })    return(x)  })  ggrothendieck <- function(df) {    growthfun <- function(x, y) if (is.na(y)) (1+growth)*x else y    lapply(df[,-1], Reduce, f = growthfun, acc = TRUE)  }  josilber.rle <- function(df) lapply(df[,-1], NAgrow.rle)  josilber.for <- function(df) lapply(df[,-1], NAgrow.for)  josilber.rcpp <- function(df) lapply(df[,-1], NAgrowRcpp, growth=growth)  library(microbenchmark)  microbenchmark(crimson(big.df), ggrothendieck(big.df), josilber.rle(big.df), josilber.for(big.df), josilber.rcpp(big.df))  # Unit: milliseconds  #                   expr        min         lq       mean     median         uq         max neval  #        crimson(big.df)  98.447546 131.063713 161.494366 152.477661 183.175840  379.643222   100  #  ggrothendieck(big.df) 437.015693 667.760401 822.530745 817.864707 925.974019 1607.352929   100  #   josilber.rle(big.df)  59.678527 115.220519 132.874030 127.476340 151.665657  262.003756   100  #   josilber.for(big.df)  21.076516  57.479169  73.860913  72.959536  84.846912  178.412591   100  #  josilber.rcpp(big.df)   1.248793   1.894723   2.373469   2.190545   2.697246    5.646878   100  

Answer by G. Grothendieck for Fill NA values with the trailing row value times a growth rate?


A compact base R solution can be obtained using Reduce:

growthfun <- function(x, y) if (is.na(y)) (1+growth)*x else y  replace(df, TRUE, lapply(df, Reduce, f = growthfun, acc = TRUE))  

giving:

  year   price1   price2  1    0 1.100000 1.100000  2    1 2.100000 2.100000  3    2 3.200000 3.200000  4    3 4.800000 3.264000  5    4 4.896000 3.329280  6    5 4.993920 3.395866  7    6 5.093798 3.463783  

Note: The data in the question has no non-trailing NA values but if there were some then we could use na.fill from zoo to first replace the trailing NAs with a special value, such as NaN, and look for it instead of NA:

library(zoo)    DF <- as.data.frame(na.fill(df, c(NA, NA, NaN)))  growthfun <- function(x, y) if (is.nan(y)) (1+growth)*x else y  replace(DF, TRUE, lapply(DF, Reduce, f = growthfun, acc = TRUE))  


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.