Replace all zeros in vector by previous non-zero value ~ Discussion of Coding

Replace all zeros in vector by previous non-zero value

Matlab/Octave algorithm example:

 input vector: [ 1 0 2 0 7 7 7 0 5 0 0 0 9 ]  output vector: [ 1 1 2 2 7 7 7 7 5 5 5 5 9 ]

The algorithm is very simple: it goes through the vector and replaces all zeros with the last non-zero value. It seems trivial, and is so when done with a slow for (i=1:length) loop and being able to refer to the previous element (i-1), but looks impossible to be formulated in the fast vectorized form. I tried the merge() and shift() but it only works for the first occurrence of zero, not an arbitrary number of them.

Can it be done in a vectorized form in Octave/Matlab or must C be used for this to have sufficient performance on big amount of data?

Thanks, Pawel

PS: I have another similar slow for-loop algorithm to speed up and it seems generally impossible to refer to previous values in a vectorized form, like an SQL lag() or group by or loop (i-1) would easily do. But Octave/Matlab loops are terribly slow.

Has anyone found a solution to this general problem or is this futile for fundamental Octave/Matlab design reasons?

========== EDIT ===============

Performance benchmark:

==== SOLUTION 1 (slow loop)

in = out = repmat([ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] ,1 ,100000);  tic; for i=2:length(out) if (out(i)==0) out(i)=out(i-1); endif; endfor; toc;  [in(1:20); out(1:20)] # test to show side by side if ok    Elapsed time is 15.047 seconds.

==== SOLUTION 2 by Dan (~80 times faster)

in = V = repmat([ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] ,1 ,100000);  tic;  d = double(diff([0,V])>0);  d(find(d(2:end))+1) = find(diff([0,~V])==-1) - find(diff([0,~V])==1);  out = V(cumsum(~~V+d)-1);  toc;  [in(1:20); out(1:20)] # shows it works ok  Elapsed time is 0.188167 seconds.  # 15.047 / 0.188167 = 79.97 times improvement

==== SOLUTION 3 by GameOfThrows (~115 times faster)

in = a = repmat([ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] ,1 ,100000);  tic;  pada = [a,888];  b = pada(find(pada >0));  bb = b(:,1:end-1);  c = find (pada==0);  d = find(pada>0);  length = d(2:end) - (d(1:end-1));  t = accumarray(cumsum([1,length])',1);  out = R = bb(cumsum(t(1:end-1)));  toc;  Elapsed time is 0.130558 seconds.  # 15.047 / 0.130558 = 115.25 times improvement

==== Magical SOLUTION 4 by Luis Mendo (~250 times faster)

Updated to a neat one-liner

in = repmat([ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] , 1, 100000);  tic;  out = nonzeros(in).'(cumsum(in~=0));  toc;  Elapsed time is 0.0597501 seconds.  # 15.047 / 0.0597501 = 251.83 times improvement

Dan, GameOfThrows and Luis - I very much appreciate your quick, sharp and effective help with this case. These are great solutions with excellent speed-up. I am amazed such an improvement is possible and I will be posting a second challenge now. I firstly decided to skip it, because I considered it more difficult and beyond reach, but what this evidence shows - I hope I'm wrong again.

Answer by Scott M. for Replace all zeros in vector by previous non-zero value

Vector operations generally assume independence of the individual items. If you have a dependence on an earlier item, then looping is the best way to do it.

Some extra background on matlab: In matlab the operations are typically faster not because of vector operations specifically, but because a vector operation simply does the loop in native C++ code instead of through the interpreter

Answer by GameOfThrows for Replace all zeros in vector by previous non-zero value

I think it is possible, let's start with the basics, you want to capture where number is greater than 0:

 a = [ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] %//Load in Vector   pada = [a,888];  %//Pad A with a random number at the end to help in case the vector ends with a 0   b = pada(find(pada >0)); %//Find where number if bigger than 0   bb = b(:,1:end-1);     %//numbers that are bigger than 0   c = find (pada==0);   %//Index where numbers are 0   d = find(pada>0);     %//Index where numbers are greater than 0   length = d(2:end) - (d(1:end-1));  %//calculate number of repeats needed for each 0 trailing gap.   %//R = [cell2mat(arrayfun(@(x,nx) repmat(x,1,nx), bb, length,'uniformoutput',0))]; %//Repeat the value     ----------EDIT---------    %// Accumarray and cumsum method, although not as nice as Dan's 1 liner   t = accumarray(cumsum([1,length])',1);   R = bb(cumsum(t(1:end-1)));

NOTE: I used arrayfun, but you can use accumarray as well.I think this demonstrates that it is possible to do this in parallel?

R =

Columns 1 through 10

 1     1     2     2     7     7     7     7     5     5

Columns 11 through 13

 5     5     9

TESTs:

a = [ 1 0 2 0 7 7 7 0 5 0 0 0 9 0 0 0 ]    R =

Columns 1 through 10

 1     1     2     2     7     7     7     7     5     5

Columns 11 through 16

 5     5     9     9     9     9

PERFORMANCE:

a = repmat([ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] ,1,10000); %//Double of 130,000  Arrayfun Method : Elapsed time is 6.840973 seconds.  AccumArray Method : Elapsed time is 2.097432 seconds.

Answer by Dan for Replace all zeros in vector by previous non-zero value

I think is a vectorized solution. Works on your example:

V = [1 0 2 0 7 7 7 0 5 0 0 0 9]  %// This is where the numbers you will repeat lie. You have to cast to a double otherwise later when you try assign numbers to it it caps them at logical 1s  d = double(diff([0,V])>0)  %// find(diff([0,~V])==-1) - find(diff([0,~V])==1) is the length of each zero cluster  d(find(d(2:end))+1) = find(diff([0,~V])==-1) - find(diff([0,~V])==1)  %// ~~V is the same as V ~= 0  V(cumsum(~~V+d)-1)

Answer by Luis Mendo for Replace all zeros in vector by previous non-zero value

The following simple approach does what you want, and is probably very fast:

in = [1 0 2 0 7 7 7 0 5 0 0 0 9];  t = cumsum(in~=0);  u = nonzeros(in);  out = u(t).';

Answer by thewaywewalk for Replace all zeros in vector by previous non-zero value

Here is another solution, using linear interpolation with previous neighbor lookup.

I assume it to be quite fast as well, as there are just look-ups and indexing and no calculations:

in = [1 0 2 0 7 7 7 0 5 0 0 0 9]  mask = logical(in);  idx = 1:numel(in);  in(~mask) = interp1(idx(mask),in(mask),idx(~mask),'previous');  %// out = in

Explanation

You need to create an index vector:

idx = 1:numel(in)  $// = 1 2 3 4 5 ...

And a logical mask, masking all your non-zero values:

mask = logical(in);

This way you get the grid points idx(mask) and grid data in(mask) for the interpolation. The query points idx(~mask) are indices of the zero data. The query data in(~mask) is then "calculated" by next previous neighbor interpolation, so it basically looks in the grid what is the value for the previous grid point. Exactly what you want. Unfortunately the involved functions have a huge overhead for all thinkable cases, thats why it is still slower than Luis Mendo's Answer, though there are no arithmetic calculations involved.

Furthermore one could reduce the overhead of interp1 a little:

F = griddedInterpolant(idx(mask),in(mask),'previous');  in(~mask) = F(idx(~mask));

But there is not too much effect.

in =   %// = out         1     1     2     2     7     7     7     7     5     5     5     5     9

Benchmark

0.699347403200000 %// thewaywewalk  1.329058123200000 %// GameOfThrows  0.408333643200000 %// LuisMendo  1.585014923200000 %// Dan

Code

function [t] = bench()      in = repmat([ 1 0 2 0 7 7 7 0 5 0 0 0 9 ] ,1 ,100000);        % functions to compare      fcns = {          @() thewaywewalk(in);          @() GameOfThrows(in);          @() LuisMendo(in);          @() Dan(in);      };         % timeit      t = zeros(4,1);      for ii = 1:10;          t = t + cellfun(@timeit, fcns);      end      format long  end    function in = thewaywewalk(in)       mask = logical(in);      idx = 1:numel(in);      in(~mask) = interp1(idx(mask),in(mask),idx(~mask),'previous');  end  function out = GameOfThrows(a)       pada = [a,888];      b = pada(find(pada >0));      bb = b(:,1:end-1);      c = find (pada==0);      d = find(pada>0);      length = d(2:end) - (d(1:end-1));      t = accumarray(cumsum([1,length])',1);      out = bb(cumsum(t(1:end-1)));  end  function out = LuisMendo(in)       t = cumsum(in~=0);      u = nonzeros(in);      out = u(t).';  end  function out = Dan(V)       d = double(diff([0,V])>0);      d(find(d(2:end))+1) = find(diff([0,~V])==-1) - find(diff([0,~V])==1);      out = V(cumsum(~~V+d)-1);  end

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 71

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Sunday, December 20, 2015