Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Tuesday, August 2, 2016

Pandas: rolling mean by time interval

Pandas: rolling mean by time interval

I'm new to Pandas.... I've got a bunch of polling data; I want to compute a rolling mean to get an estimate for each day based on a three-day window. As I understand from this question, the rolling_* functions compute the window based on a specified number of values, and not a specific datetime range.

Is there a different function that implements this functionality? Or am I stuck writing my own?


Sample input data:

polls_subset.tail(20)  Out[185]:               favorable  unfavorable  other    enddate                                    2012-10-25       0.48         0.49   0.03  2012-10-25       0.51         0.48   0.02  2012-10-27       0.51         0.47   0.02  2012-10-26       0.56         0.40   0.04  2012-10-28       0.48         0.49   0.04  2012-10-28       0.46         0.46   0.09  2012-10-28       0.48         0.49   0.03  2012-10-28       0.49         0.48   0.03  2012-10-30       0.53         0.45   0.02  2012-11-01       0.49         0.49   0.03  2012-11-01       0.47         0.47   0.05  2012-11-01       0.51         0.45   0.04  2012-11-03       0.49         0.45   0.06  2012-11-04       0.53         0.39   0.00  2012-11-04       0.47         0.44   0.08  2012-11-04       0.49         0.48   0.03  2012-11-04       0.52         0.46   0.01  2012-11-04       0.50         0.47   0.03  2012-11-05       0.51         0.46   0.02  2012-11-07       0.51         0.41   0.00  

Output would have only one row for each date.

EDIT x2: fixed typo

Answer by Zelazny7 for Pandas: rolling mean by time interval

What about something like this:

First resample the data frame into 1D intervals. This takes the mean of the values for all duplicate days. Use the fill_method option to fill in missing date values. Next, pass the resampled frame into pd.rolling_mean with a window of 3 and min_periods=1 :

pd.rolling_mean(df.resample("1D", fill_method="ffill"), window=3, min_periods=1)                favorable  unfavorable     other  enddate  2012-10-25   0.495000     0.485000  0.025000  2012-10-26   0.527500     0.442500  0.032500  2012-10-27   0.521667     0.451667  0.028333  2012-10-28   0.515833     0.450000  0.035833  2012-10-29   0.488333     0.476667  0.038333  2012-10-30   0.495000     0.470000  0.038333  2012-10-31   0.512500     0.460000  0.029167  2012-11-01   0.516667     0.456667  0.026667  2012-11-02   0.503333     0.463333  0.033333  2012-11-03   0.490000     0.463333  0.046667  2012-11-04   0.494000     0.456000  0.043333  2012-11-05   0.500667     0.452667  0.036667  2012-11-06   0.507333     0.456000  0.023333  2012-11-07   0.510000     0.443333  0.013333  

UPDATE: As Ben points out in the comments, with pandas 0.18.0 the syntax has changed. With the new syntax this would be:

df.resample("1d").sum().fillna(0).rolling(window=3, min_periods=1).mean()  

Answer by user2689410 for Pandas: rolling mean by time interval

I just had the same question but with irregularly spaced datapoints. Resample is not really an option here. So I created my own function. Maybe it will be useful for others too:

from pandas import Series, DataFrame  import pandas as pd  from datetime import datetime, timedelta  import numpy as np    def rolling_mean(data, window, min_periods=1, center=False):      ''' Function that computes a rolling mean        Parameters      ----------      data : DataFrame or Series             If a DataFrame is passed, the rolling_mean is computed for all columns.      window : int or string               If int is passed, window is the number of observations used for calculating                the statistic, as defined by the function pd.rolling_mean()               If a string is passed, it must be a frequency string, e.g. '90S'. This is               internally converted into a DateOffset object, representing the window size.      min_periods : int                    Minimum number of observations in window required to have a value.        Returns      -------      Series or DataFrame, if more than one column          '''      def f(x):          '''Function to apply that actually computes the rolling mean'''          if center == False:              dslice = col[x-pd.datetools.to_offset(window).delta+timedelta(0,0,1):x]                  # adding a microsecond because when slicing with labels start and endpoint                  # are inclusive          else:              dslice = col[x-pd.datetools.to_offset(window).delta/2+timedelta(0,0,1):                           x+pd.datetools.to_offset(window).delta/2]          if dslice.size < min_periods:              return np.nan          else:              return dslice.mean()        data = DataFrame(data.copy())      dfout = DataFrame()      if isinstance(window, int):          dfout = pd.rolling_mean(data, window, min_periods=min_periods, center=center)      elif isinstance(window, basestring):          idx = Series(data.index.to_pydatetime(), index=data.index)          for colname, col in data.iterkv():              result = idx.apply(f)     = colname              dfout = dfout.join(result, how='outer')      if dfout.columns.size == 1:          dfout = dfout.ix[:,0]      return dfout      # Example  idx = [datetime(2011, 2, 7, 0, 0),         datetime(2011, 2, 7, 0, 1),         datetime(2011, 2, 7, 0, 1, 30),         datetime(2011, 2, 7, 0, 2),         datetime(2011, 2, 7, 0, 4),         datetime(2011, 2, 7, 0, 5),         datetime(2011, 2, 7, 0, 5, 10),         datetime(2011, 2, 7, 0, 6),         datetime(2011, 2, 7, 0, 8),         datetime(2011, 2, 7, 0, 9)]  idx = pd.Index(idx)  vals = np.arange(len(idx)).astype(float)  s = Series(vals, index=idx)  rm = rolling_mean(s, window='2min')  

Answer by Mark Horvath for Pandas: rolling mean by time interval

user2689410's code was exactly what I needed. Providing my version (credits to user2689410), which is faster due to calculating mean at once for whole rows in the DataFrame.

Hope my suffix conventions are readable: _s: string, _i: int, _b: bool, _ser: Series and _df: DataFrame. Where you find multiple suffixes, type can be both.

import pandas as pd  from datetime import datetime, timedelta  import numpy as np    def time_offset_rolling_mean_df_ser(data_df_ser, window_i_s, min_periods_i=1, center_b=False):      """ Function that computes a rolling mean        Credit goes to user2689410 at        Parameters      ----------      data_df_ser : DataFrame or Series           If a DataFrame is passed, the time_offset_rolling_mean_df_ser is computed for all columns.      window_i_s : int or string           If int is passed, window_i_s is the number of observations used for calculating           the statistic, as defined by the function pd.time_offset_rolling_mean_df_ser()           If a string is passed, it must be a frequency string, e.g. '90S'. This is           internally converted into a DateOffset object, representing the window_i_s size.      min_periods_i : int           Minimum number of observations in window_i_s required to have a value.        Returns      -------      Series or DataFrame, if more than one column        >>> idx = [      ...     datetime(2011, 2, 7, 0, 0),      ...     datetime(2011, 2, 7, 0, 1),      ...     datetime(2011, 2, 7, 0, 1, 30),      ...     datetime(2011, 2, 7, 0, 2),      ...     datetime(2011, 2, 7, 0, 4),      ...     datetime(2011, 2, 7, 0, 5),      ...     datetime(2011, 2, 7, 0, 5, 10),      ...     datetime(2011, 2, 7, 0, 6),      ...     datetime(2011, 2, 7, 0, 8),      ...     datetime(2011, 2, 7, 0, 9)]      >>> idx = pd.Index(idx)      >>> vals = np.arange(len(idx)).astype(float)      >>> ser = pd.Series(vals, index=idx)      >>> df = pd.DataFrame({'s1':ser, 's2':ser+1})      >>> time_offset_rolling_mean_df_ser(df, window_i_s='2min')                            s1   s2      2011-02-07 00:00:00  0.0  1.0      2011-02-07 00:01:00  0.5  1.5      2011-02-07 00:01:30  1.0  2.0      2011-02-07 00:02:00  2.0  3.0      2011-02-07 00:04:00  4.0  5.0      2011-02-07 00:05:00  4.5  5.5      2011-02-07 00:05:10  5.0  6.0      2011-02-07 00:06:00  6.0  7.0      2011-02-07 00:08:00  8.0  9.0      2011-02-07 00:09:00  8.5  9.5      """        def calculate_mean_at_ts(ts):          """Function (closure) to apply that actually computes the rolling mean"""          if center_b == False:              dslice_df_ser = data_df_ser[                  ts-pd.datetools.to_offset(window_i_s).delta+timedelta(0,0,1):                  ts              ]              # adding a microsecond because when slicing with labels start and endpoint              # are inclusive          else:              dslice_df_ser = data_df_ser[                  ts-pd.datetools.to_offset(window_i_s).delta/2+timedelta(0,0,1):                  ts+pd.datetools.to_offset(window_i_s).delta/2              ]          if  (isinstance(dslice_df_ser, pd.DataFrame) and dslice_df_ser.shape[0] < min_periods_i) or \              (isinstance(dslice_df_ser, pd.Series) and dslice_df_ser.size < min_periods_i):              return dslice_df_ser.mean()*np.nan   # keeps number format and whether Series or DataFrame          else:              return dslice_df_ser.mean()        if isinstance(window_i_s, int):          mean_df_ser = pd.rolling_mean(data_df_ser, window=window_i_s, min_periods=min_periods_i, center=center_b)      elif isinstance(window_i_s, basestring):          idx_ser = pd.Series(data_df_ser.index.to_pydatetime(), index=data_df_ser.index)          mean_df_ser = idx_ser.apply(calculate_mean_at_ts)        return mean_df_ser  

Answer by InterwebIsGreat for Pandas: rolling mean by time interval

I found that user2689410 code broke when I tried with window='1M' as the delta on business month threw this error:

AttributeError: 'MonthEnd' object has no attribute 'delta'  

I added the option to pass directly a relative time delta, so you can do similar things for user defined periods.

Thanks for the pointers, here's my attempt - hope it's of use.

def rolling_mean(data, window, min_periods=1, center=False):  """ Function that computes a rolling mean  Reference:    Parameters  ----------  data : DataFrame or Series         If a DataFrame is passed, the rolling_mean is computed for all columns.  window : int, string, Timedelta or Relativedelta           int - number of observations used for calculating the statistic,                 as defined by the function pd.rolling_mean()           string - must be a frequency string, e.g. '90S'. This is                    internally converted into a DateOffset object, and then                    Timedelta representing the window size.           Timedelta / Relativedelta - Can directly pass a timedeltas.  min_periods : int                Minimum number of observations in window required to have a value.  center : bool           Point around which to 'center' the slicing.    Returns  -------  Series or DataFrame, if more than one column  """  def f(x, time_increment):      """Function to apply that actually computes the rolling mean      :param x:      :return:      """      if not center:          # adding a microsecond because when slicing with labels start          # and endpoint are inclusive          start_date = x - time_increment + timedelta(0, 0, 1)          end_date = x      else:          start_date = x - time_increment/2 + timedelta(0, 0, 1)          end_date = x + time_increment/2      # Select the date index from the      dslice = col[start_date:end_date]        if dslice.size < min_periods:          return np.nan      else:          return dslice.mean()    data = DataFrame(data.copy())  dfout = DataFrame()  if isinstance(window, int):      dfout = pd.rolling_mean(data, window, min_periods=min_periods, center=center)    elif isinstance(window, basestring):      time_delta = pd.datetools.to_offset(window).delta      idx = Series(data.index.to_pydatetime(), index=data.index)      for colname, col in data.iteritems():          result = idx.apply(lambda x: f(x, time_delta)) = colname          dfout = dfout.join(result, how='outer')    elif isinstance(window, (timedelta, relativedelta)):      time_delta = window      idx = Series(data.index.to_pydatetime(), index=data.index)      for colname, col in data.iteritems():          result = idx.apply(lambda x: f(x, time_delta)) = colname          dfout = dfout.join(result, how='outer')    if dfout.columns.size == 1:      dfout = dfout.ix[:, 0]  return dfout  

And the example with a 3 day time window to calculate the mean:

from pandas import Series, DataFrame  import pandas as pd  from datetime import datetime, timedelta  import numpy as np  from dateutil.relativedelta import relativedelta    idx = [datetime(2011, 2, 7, 0, 0),             datetime(2011, 2, 7, 0, 1),             datetime(2011, 2, 8, 0, 1, 30),             datetime(2011, 2, 9, 0, 2),             datetime(2011, 2, 10, 0, 4),             datetime(2011, 2, 11, 0, 5),             datetime(2011, 2, 12, 0, 5, 10),             datetime(2011, 2, 12, 0, 6),             datetime(2011, 2, 13, 0, 8),             datetime(2011, 2, 14, 0, 9)]  idx = pd.Index(idx)  vals = np.arange(len(idx)).astype(float)  s = Series(vals, index=idx)  # Now try by passing the 3 days as a relative time delta directly.  rm = rolling_mean(s, window=relativedelta(days=3))  >>> rm  Out[2]:   2011-02-07 00:00:00    0.0  2011-02-07 00:01:00    0.5  2011-02-08 00:01:30    1.0  2011-02-09 00:02:00    1.5  2011-02-10 00:04:00    3.0  2011-02-11 00:05:00    4.0  2011-02-12 00:05:10    5.0  2011-02-12 00:06:00    5.5  2011-02-13 00:08:00    6.5  2011-02-14 00:09:00    7.5  Name: 0, dtype: float64  

Answer by JohnE for Pandas: rolling mean by time interval

This example seems to call for a weighted mean as suggested in @andyhayden's comment. For example, there are two polls on 10/25 and one each on 10/26 and 10/27. If you just resample and then take the mean, this effectively gives twice as much weighting to the polls on 10/26 and 10/27 compared to the ones on 10/25.

To give equal weight to each poll rather than equal weight to each day, you could do something like the following.

>>> wt = df.resample('D',limit=5).count()                favorable  unfavorable  other  enddate                                    2012-10-25          2            2      2  2012-10-26          1            1      1  2012-10-27          1            1      1    >>> df2 = df.resample('D').mean()                favorable  unfavorable  other  enddate                                    2012-10-25      0.495        0.485  0.025  2012-10-26      0.560        0.400  0.040  2012-10-27      0.510        0.470  0.020  

That gives you the raw ingredients for doing a poll-based mean instead of a day-based mean. As before, the polls are averaged on 10/25, but the weight for 10/25 is also stored and is double the weight on 10/26 or 10/27 to reflect that two polls were taken on 10/25.

>>> df3 = df2 * wt  >>> df3 = df3.rolling(3,min_periods=1).sum()  >>> wt3 = wt.rolling(3,min_periods=1).sum()    >>> df3 = df3 / wt3                  favorable  unfavorable     other  enddate                                       2012-10-25   0.495000     0.485000  0.025000  2012-10-26   0.516667     0.456667  0.030000  2012-10-27   0.515000     0.460000  0.027500  2012-10-28   0.496667     0.465000  0.041667  2012-10-29   0.484000     0.478000  0.042000  2012-10-30   0.488000     0.474000  0.042000  2012-10-31   0.530000     0.450000  0.020000  2012-11-01   0.500000     0.465000  0.035000  2012-11-02   0.490000     0.470000  0.040000  2012-11-03   0.490000     0.465000  0.045000  2012-11-04   0.500000     0.448333  0.035000  2012-11-05   0.501429     0.450000  0.032857  2012-11-06   0.503333     0.450000  0.028333  2012-11-07   0.510000     0.435000  0.010000  

Note that the rolling mean for 10/27 is now 0.51500 (poll-weighted) rather than 52.1667 (day-weighted).

Also note that there have been changes to the APIs for resample and rolling as of version 0.18.0.

rolling (what's new in pandas 0.18.0)

resample (what's new in pandas 0.18.0)

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72


Post a Comment

Popular Posts

Powered by Blogger.