{"id":145,"date":"2013-03-31T23:47:58","date_gmt":"2013-04-01T04:47:58","guid":{"rendered":"http:\/\/homepages.uc.edu\/~yaozo\/wordpress\/?p=145"},"modified":"2013-03-31T23:47:58","modified_gmt":"2013-04-01T04:47:58","slug":"computational-tools","status":"publish","type":"post","link":"https:\/\/zhuoyao.net\/index.php\/2013\/03\/31\/computational-tools\/","title":{"rendered":"Computational tools"},"content":{"rendered":"<h2>Statistical functions<\/h2>\n<div id=\"percent-change\">\n<h3>Percent Change<\/h3>\n<p>Both\u00a0<tt>Series<\/tt>\u00a0and\u00a0<tt>DataFrame<\/tt>\u00a0has a method\u00a0<tt>pct_change<\/tt>\u00a0to compute the percent change over a given number of periods (using\u00a0<tt>fill_method<\/tt>\u00a0to fill NA\/null values).<\/p>\n<div>\n<div>\n<pre>In [376]: ser = Series(randn(8))\n\nIn [377]: ser.pct_change()\nOut[377]: \n0         NaN\n1   -1.602976\n2    4.334938\n3   -0.247456\n4   -2.067345\n5   -1.142903\n6   -1.688214\n7   -9.759729\ndtype: float64<\/pre>\n<\/div>\n<\/div>\n<div>\n<div>\n<pre>In [378]: df = DataFrame(randn(10, 4))\n\nIn [379]: df.pct_change(periods=3)\nOut[379]: \n          0         1         2         3\n0       NaN       NaN       NaN       NaN\n1       NaN       NaN       NaN       NaN\n2       NaN       NaN       NaN       NaN\n3 -0.218320 -1.054001  1.987147 -0.510183\n4 -0.439121 -1.816454  0.649715 -4.822809\n5 -0.127833 -3.042065 -5.866604 -1.776977\n6 -2.596833 -1.959538 -2.111697 -3.798900\n7 -0.117826 -2.169058  0.036094 -0.067696\n8  2.492606 -1.357320 -1.205802 -1.558697\n9 -1.012977  2.324558 -1.003744 -0.371806<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"covariance\">\n<h3>Covariance<\/h3>\n<p>The\u00a0<tt>Series<\/tt>\u00a0object has a method\u00a0<tt>cov<\/tt>\u00a0to compute covariance between series (excluding NA\/null values).<\/p>\n<div>\n<div>\n<pre>In [380]: s1 = Series(randn(1000))\n\nIn [381]: s2 = Series(randn(1000))\n\nIn [382]: s1.cov(s2)\nOut[382]: 0.00068010881743109321<\/pre>\n<\/div>\n<\/div>\n<p>Analogously,\u00a0<tt>DataFrame<\/tt>\u00a0has a method\u00a0<tt>cov<\/tt>\u00a0to compute pairwise covariances among the series in the DataFrame, also excluding NA\/null values.<\/p>\n<div>\n<div>\n<pre>In [383]: frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])\n\nIn [384]: frame.cov()\nOut[384]: \n          a         b         c         d         e\na  1.000882 -0.003177 -0.002698 -0.006889  0.031912\nb -0.003177  1.024721  0.000191  0.009212  0.000857\nc -0.002698  0.000191  0.950735 -0.031743 -0.005087\nd -0.006889  0.009212 -0.031743  1.002983 -0.047952\ne  0.031912  0.000857 -0.005087 -0.047952  1.042487<\/pre>\n<\/div>\n<\/div>\n<p><tt>DataFrame.cov<\/tt>\u00a0also supports an optional\u00a0<tt>min_periods<\/tt>\u00a0keyword that specifies the required minimum number of observations for each column pair in order to have a valid result.<\/p>\n<div>\n<div>\n<pre>In [385]: frame = DataFrame(randn(20, 3), columns=['a', 'b', 'c'])\n\nIn [386]: frame.ix[:5, 'a'] = np.nan\n\nIn [387]: frame.ix[5:10, 'b'] = np.nan\n\nIn [388]: frame.cov()\nOut[388]: \n          a         b         c\na  1.210090 -0.430629  0.018002\nb -0.430629  1.240960  0.347188\nc  0.018002  0.347188  1.301149\n\nIn [389]: frame.cov(min_periods=12)\nOut[389]: \n          a         b         c\na  1.210090       NaN  0.018002\nb       NaN  1.240960  0.347188\nc  0.018002  0.347188  1.301149<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"correlation\">\n<h3>Correlation<\/h3>\n<p>Several methods for computing correlations are provided. Several kinds of correlation methods are provided:<\/p>\n<table border=\"1\">\n<colgroup>\n<col width=\"20%\" \/>\n<col width=\"80%\" \/><\/colgroup>\n<thead valign=\"bottom\">\n<tr>\n<th>Method name<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody valign=\"top\">\n<tr>\n<td><tt>pearson\u00a0(default)<\/tt><\/td>\n<td>Standard correlation coefficient<\/td>\n<\/tr>\n<tr>\n<td><tt>kendall<\/tt><\/td>\n<td>Kendall Tau correlation coefficient<\/td>\n<\/tr>\n<tr>\n<td><tt>spearman<\/tt><\/td>\n<td>Spearman rank correlation coefficient<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>All of these are currently computed using pairwise complete observations.<\/p>\n<div>\n<div>\n<pre>In [390]: frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])\n\nIn [391]: frame.ix[::2] = np.nan\n\n# Series with Series\nIn [392]: frame['a'].corr(frame['b'])\nOut[392]: 0.013479040400098763\n\nIn [393]: frame['a'].corr(frame['b'], method='spearman')\nOut[393]: -0.0072898851595406388\n\n# Pairwise correlation of DataFrame columns\nIn [394]: frame.corr()\nOut[394]: \n          a         b         c         d         e\na  1.000000  0.013479 -0.049269 -0.042239 -0.028525\nb  0.013479  1.000000 -0.020433 -0.011139  0.005654\nc -0.049269 -0.020433  1.000000  0.018587 -0.054269\nd -0.042239 -0.011139  0.018587  1.000000 -0.017060\ne -0.028525  0.005654 -0.054269 -0.017060  1.000000<\/pre>\n<\/div>\n<\/div>\n<p>Note that non-numeric columns will be automatically excluded from the correlation calculation.<\/p>\n<p>Like\u00a0<tt>cov<\/tt>,\u00a0<tt>corr<\/tt>\u00a0also supports the optional\u00a0<tt>min_periods<\/tt>\u00a0keyword:<\/p>\n<div>\n<div>\n<pre>In [395]: frame = DataFrame(randn(20, 3), columns=['a', 'b', 'c'])\n\nIn [396]: frame.ix[:5, 'a'] = np.nan\n\nIn [397]: frame.ix[5:10, 'b'] = np.nan\n\nIn [398]: frame.corr()\nOut[398]: \n          a         b         c\na  1.000000 -0.076520  0.160092\nb -0.076520  1.000000  0.135967\nc  0.160092  0.135967  1.000000\n\nIn [399]: frame.corr(min_periods=12)\nOut[399]: \n          a         b         c\na  1.000000       NaN  0.160092\nb       NaN  1.000000  0.135967\nc  0.160092  0.135967  1.000000<\/pre>\n<\/div>\n<\/div>\n<p>A related method\u00a0<tt>corrwith<\/tt>\u00a0is implemented on DataFrame to compute the correlation between like-labeled Series contained in different DataFrame objects.<\/p>\n<div>\n<div>\n<pre>In [400]: index = ['a', 'b', 'c', 'd', 'e']\n\nIn [401]: columns = ['one', 'two', 'three', 'four']\n\nIn [402]: df1 = DataFrame(randn(5, 4), index=index, columns=columns)\n\nIn [403]: df2 = DataFrame(randn(4, 4), index=index[:4], columns=columns)\n\nIn [404]: df1.corrwith(df2)\nOut[404]: \none     -0.125501\ntwo     -0.493244\nthree    0.344056\nfour     0.004183\ndtype: float64\n\nIn [405]: df2.corrwith(df1, axis=1)\nOut[405]: \na   -0.675817\nb    0.458296\nc    0.190809\nd   -0.186275\ne         NaN\ndtype: float64<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"data-ranking\">\n<h3>Data ranking<\/h3>\n<p>The\u00a0<tt>rank<\/tt>\u00a0method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:<\/p>\n<div>\n<div>\n<pre>In [406]: s = Series(np.random.randn(5), index=list('abcde'))\n\nIn [407]: s['d'] = s['b'] # so there's a tie\n\nIn [408]: s.rank()\nOut[408]: \na    5.0\nb    2.5\nc    1.0\nd    2.5\ne    4.0\ndtype: float64<\/pre>\n<\/div>\n<\/div>\n<p><tt>rank<\/tt>\u00a0is also a DataFrame method and can rank either the rows (<tt>axis=0<\/tt>) or the columns (<tt>axis=1<\/tt>).\u00a0<tt>NaN<\/tt>\u00a0values are excluded from the ranking.<\/p>\n<div>\n<div>\n<pre>In [409]: df = DataFrame(np.random.randn(10, 6))\n\nIn [410]: df[4] = df[2][:5] # some ties\n\nIn [411]: df\nOut[411]: \n          0         1         2         3         4         5\n0 -0.904948 -1.163537 -1.457187  0.135463 -1.457187  0.294650\n1 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809\n2  0.401965  1.460840  1.256057  1.308127  1.256057  0.876004\n3  0.205954  0.369552 -0.669304  0.038378 -0.669304  1.140296\n4 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.211196\n5 -1.092970 -0.689246  0.908114  0.204848       NaN  0.463347\n6  0.376892  0.959292  0.095572 -0.593740       NaN -0.069180\n7 -1.002601  1.957794 -0.120708  0.094214       NaN -1.467422\n8 -0.547231  0.664402 -0.519424 -0.073254       NaN -1.263544\n9 -0.250277 -0.237428 -1.056443  0.419477       NaN  1.375064\n\nIn [412]: df.rank(1)\nOut[412]: \n   0  1    2  3    4  5\n0  4  3  1.5  5  1.5  6\n1  2  6  4.5  1  4.5  3\n2  1  6  3.5  5  3.5  2\n3  4  5  1.5  3  1.5  6\n4  5  3  1.5  4  1.5  6\n5  1  2  5.0  3  NaN  4\n6  4  5  3.0  1  NaN  2\n7  2  5  3.0  4  NaN  1\n8  2  5  3.0  4  NaN  1\n9  2  3  1.0  4  NaN  5<\/pre>\n<\/div>\n<\/div>\n<p><tt>rank<\/tt>\u00a0optionally takes a parameter\u00a0<tt>ascending<\/tt>\u00a0which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.<\/p>\n<p><tt>rank<\/tt>\u00a0supports different tie-breaking methods, specified with the\u00a0<tt>method<\/tt>\u00a0parameter:<\/p>\n<blockquote>\n<ul>\n<li><tt>average<\/tt>\u00a0: average rank of tied group<\/li>\n<li><tt>min<\/tt>\u00a0: lowest rank in the group<\/li>\n<li><tt>max<\/tt>\u00a0: highest rank in the group<\/li>\n<li><tt>first<\/tt>\u00a0: ranks assigned in the order they appear in the array<\/li>\n<\/ul>\n<\/blockquote>\n<\/div>\n<div id=\"moving-rolling-statistics-moments\">\n<h2>Moving (rolling) statistics \/ moments<\/h2>\n<p>For working with time series data, a number of functions are provided for computing common\u00a0<em>moving<\/em>\u00a0or\u00a0<em>rolling<\/em>\u00a0statistics. Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis. All of these methods are in the\u00a0<a title=\"pandas\" href=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/index.html#module-pandas\"><tt>pandas<\/tt><\/a>\u00a0namespace, but otherwise they can be found in<tt>pandas.stats.moments<\/tt>.<\/p>\n<table border=\"1\">\n<colgroup>\n<col width=\"20%\" \/>\n<col width=\"80%\" \/><\/colgroup>\n<thead valign=\"bottom\">\n<tr>\n<th>Function<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody valign=\"top\">\n<tr>\n<td><tt>rolling_count<\/tt><\/td>\n<td>Number of non-null observations<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_sum<\/tt><\/td>\n<td>Sum of values<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_mean<\/tt><\/td>\n<td>Mean of values<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_median<\/tt><\/td>\n<td>Arithmetic median of values<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_min<\/tt><\/td>\n<td>Minimum<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_max<\/tt><\/td>\n<td>Maximum<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_std<\/tt><\/td>\n<td>Unbiased standard deviation<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_var<\/tt><\/td>\n<td>Unbiased variance<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_skew<\/tt><\/td>\n<td>Unbiased skewness (3rd moment)<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_kurt<\/tt><\/td>\n<td>Unbiased kurtosis (4th moment)<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_quantile<\/tt><\/td>\n<td>Sample quantile (value at %)<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_apply<\/tt><\/td>\n<td>Generic apply<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_cov<\/tt><\/td>\n<td>Unbiased covariance (binary)<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_corr<\/tt><\/td>\n<td>Correlation (binary)<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_corr_pairwise<\/tt><\/td>\n<td>Pairwise correlation of DataFrame columns<\/td>\n<\/tr>\n<tr>\n<td><tt>rolling_window<\/tt><\/td>\n<td>Moving window function<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Generally these methods all have the same interface. The binary operators (e.g.<tt>rolling_corr<\/tt>) take two Series or DataFrames. Otherwise, they all accept the following arguments:<\/p>\n<blockquote>\n<ul>\n<li><tt>window<\/tt>: size of moving window<\/li>\n<li><tt>min_periods<\/tt>: threshold of non-null data points to require (otherwise result is NA)<\/li>\n<li><tt>freq<\/tt>: optionally specify a\u00a0<a href=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/timeseries.html#timeseries-alias\"><em>frequency string<\/em><\/a>\u00a0or\u00a0<a href=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/timeseries.html#timeseries-offsets\"><em>DateOffset<\/em><\/a>\u00a0to pre-conform the data to. Note that prior to pandas v0.8.0, a keyword argument\u00a0<tt>time_rule<\/tt>\u00a0was used instead of\u00a0<tt>freq<\/tt>\u00a0that referred to the legacy time rule constants<\/li>\n<\/ul>\n<\/blockquote>\n<p>These functions can be applied to ndarrays or Series objects:<\/p>\n<div>\n<div>\n<pre>In [413]: ts = Series(randn(1000), index=date_range('1\/1\/2000', periods=1000))\n\nIn [414]: ts = ts.cumsum()\n\nIn [415]: ts.plot(style='k--')\nOut[415]: &lt;matplotlib.axes.AxesSubplot at 0x69342d0&gt;\n\nIn [416]: rolling_mean(ts, 60).plot(style='k')\nOut[416]: &lt;matplotlib.axes.AxesSubplot at 0x69342d0&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/rolling_mean_ex.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/rolling_mean_ex.png\" \/>They can also be applied to DataFrame objects. This is really just syntactic sugar for applying the moving window operator to all of the DataFrame\u2019s columns:<\/p>\n<div>\n<div>\n<pre>In [417]: df = DataFrame(randn(1000, 4), index=ts.index,\n   .....:                columns=['A', 'B', 'C', 'D'])\n   .....:\n\nIn [418]: df = df.cumsum()\n\nIn [419]: rolling_sum(df, 60).plot(subplots=True)\nOut[419]: \narray([Axes(0.125,0.747826;0.775x0.152174),\n       Axes(0.125,0.565217;0.775x0.152174),\n       Axes(0.125,0.382609;0.775x0.152174), Axes(0.125,0.2;0.775x0.152174)], dtype=object)<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/rolling_mean_frame.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/rolling_mean_frame.png\" \/>The\u00a0<tt>rolling_apply<\/tt>\u00a0function takes an extra\u00a0<tt>func<\/tt>\u00a0argument and performs generic rolling computations. The\u00a0<tt>func<\/tt>\u00a0argument should be a single function that produces a single value from an ndarray input. Suppose we wanted to compute the mean absolute deviation on a rolling basis:<\/p>\n<div>\n<div>\n<pre>In [420]: mad = lambda x: np.fabs(x - x.mean()).mean()\n\nIn [421]: rolling_apply(ts, 60, mad).plot(style='k')\nOut[421]: &lt;matplotlib.axes.AxesSubplot at 0x74e1f90&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/rolling_apply_ex.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/rolling_apply_ex.png\" \/>The\u00a0<tt>rolling_window<\/tt>\u00a0function performs a generic rolling window computation on the input data. The weights used in the window are specified by the\u00a0<tt>win_type<\/tt>\u00a0keyword. The list of recognized types are:<\/p>\n<blockquote>\n<ul>\n<li><tt>boxcar<\/tt><\/li>\n<li><tt>triang<\/tt><\/li>\n<li><tt>blackman<\/tt><\/li>\n<li><tt>hamming<\/tt><\/li>\n<li><tt>bartlett<\/tt><\/li>\n<li><tt>parzen<\/tt><\/li>\n<li><tt>bohman<\/tt><\/li>\n<li><tt>blackmanharris<\/tt><\/li>\n<li><tt>nuttall<\/tt><\/li>\n<li><tt>barthann<\/tt><\/li>\n<li><tt>kaiser<\/tt>\u00a0(needs beta)<\/li>\n<li><tt>gaussian<\/tt>\u00a0(needs std)<\/li>\n<li><tt>general_gaussian<\/tt>\u00a0(needs power, width)<\/li>\n<li><tt>slepian<\/tt>\u00a0(needs width).<\/li>\n<\/ul>\n<\/blockquote>\n<div>\n<div>\n<pre>In [422]: ser = Series(randn(10), index=date_range('1\/1\/2000', periods=10))\n\nIn [423]: rolling_window(ser, 5, 'triang')\nOut[423]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03         NaN\n2000-01-04         NaN\n2000-01-05   -0.622722\n2000-01-06   -0.460623\n2000-01-07   -0.229918\n2000-01-08   -0.237308\n2000-01-09   -0.335064\n2000-01-10   -0.403449\nFreq: D, dtype: float64<\/pre>\n<\/div>\n<\/div>\n<p>Note that the\u00a0<tt>boxcar<\/tt>\u00a0window is equivalent to\u00a0<tt>rolling_mean<\/tt>:<\/p>\n<div>\n<div>\n<pre>In [424]: rolling_window(ser, 5, 'boxcar')\nOut[424]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03         NaN\n2000-01-04         NaN\n2000-01-05   -0.841164\n2000-01-06   -0.779948\n2000-01-07   -0.565487\n2000-01-08   -0.502815\n2000-01-09   -0.553755\n2000-01-10   -0.472211\nFreq: D, dtype: float64\n\nIn [425]: rolling_mean(ser, 5)\nOut[425]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03         NaN\n2000-01-04         NaN\n2000-01-05   -0.841164\n2000-01-06   -0.779948\n2000-01-07   -0.565487\n2000-01-08   -0.502815\n2000-01-09   -0.553755\n2000-01-10   -0.472211\nFreq: D, dtype: float64<\/pre>\n<\/div>\n<\/div>\n<p>For some windowing functions, additional parameters must be specified:<\/p>\n<div>\n<div>\n<pre>In [426]: rolling_window(ser, 5, 'gaussian', std=0.1)\nOut[426]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03         NaN\n2000-01-04         NaN\n2000-01-05   -0.261998\n2000-01-06   -0.230600\n2000-01-07    0.121276\n2000-01-08   -0.136220\n2000-01-09   -0.057945\n2000-01-10   -0.199326\nFreq: D, dtype: float64<\/pre>\n<\/div>\n<\/div>\n<p>By default the labels are set to the right edge of the window, but a\u00a0<tt>center<\/tt>\u00a0keyword is available so the labels can be set at the center. This keyword is available in other rolling functions as well.<\/p>\n<div>\n<div>\n<pre>In [427]: rolling_window(ser, 5, 'boxcar')\nOut[427]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03         NaN\n2000-01-04         NaN\n2000-01-05   -0.841164\n2000-01-06   -0.779948\n2000-01-07   -0.565487\n2000-01-08   -0.502815\n2000-01-09   -0.553755\n2000-01-10   -0.472211\nFreq: D, dtype: float64\n\nIn [428]: rolling_window(ser, 5, 'boxcar', center=True)\nOut[428]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03   -0.841164\n2000-01-04   -0.779948\n2000-01-05   -0.565487\n2000-01-06   -0.502815\n2000-01-07   -0.553755\n2000-01-08   -0.472211\n2000-01-09         NaN\n2000-01-10         NaN\nFreq: D, dtype: float64\n\nIn [429]: rolling_mean(ser, 5, center=True)\nOut[429]: \n2000-01-01         NaN\n2000-01-02         NaN\n2000-01-03   -0.841164\n2000-01-04   -0.779948\n2000-01-05   -0.565487\n2000-01-06   -0.502815\n2000-01-07   -0.553755\n2000-01-08   -0.472211\n2000-01-09         NaN\n2000-01-10         NaN\nFreq: D, dtype: float64<\/pre>\n<\/div>\n<\/div>\n<div id=\"binary-rolling-moments\">\n<h3>Binary rolling moments<\/h3>\n<p><tt>rolling_cov<\/tt>\u00a0and\u00a0<tt>rolling_corr<\/tt>\u00a0can compute moving window statistics about two\u00a0<tt>Series<\/tt>\u00a0or any combination of\u00a0<tt>DataFrame\/Series<\/tt>\u00a0or\u00a0<tt>DataFrame\/DataFrame<\/tt>. Here is the behavior in each case:<\/p>\n<ul>\n<li>two\u00a0<tt>Series<\/tt>: compute the statistic for the pairing<\/li>\n<li><tt>DataFrame\/Series<\/tt>: compute the statistics for each column of the DataFrame with the passed Series, thus returning a DataFrame<\/li>\n<li><tt>DataFrame\/DataFrame<\/tt>: compute statistic for matching column names, returning a DataFrame<\/li>\n<\/ul>\n<p>For example:<\/p>\n<div>\n<div>\n<pre>In [430]: df2 = df[:20]\n\nIn [431]: rolling_corr(df2, df2['B'], window=5)\nOut[431]: \n                   A   B         C         D\n2000-01-01       NaN NaN       NaN       NaN\n2000-01-02       NaN NaN       NaN       NaN\n2000-01-03       NaN NaN       NaN       NaN\n2000-01-04       NaN NaN       NaN       NaN\n2000-01-05 -0.262853   1  0.334449  0.193380\n2000-01-06 -0.083745   1 -0.521587 -0.556126\n2000-01-07 -0.292940   1 -0.658532 -0.458128\n2000-01-08  0.840416   1  0.796505 -0.498672\n2000-01-09 -0.135275   1  0.753895 -0.634445\n2000-01-10 -0.346229   1 -0.682232 -0.645681\n2000-01-11 -0.365524   1 -0.775831 -0.561991\n2000-01-12 -0.204761   1 -0.855874 -0.382232\n2000-01-13  0.575218   1 -0.747531  0.167892\n2000-01-14  0.519499   1 -0.687277  0.192822\n2000-01-15  0.048982   1  0.167669 -0.061463\n2000-01-16  0.217190   1  0.167564 -0.326034\n2000-01-17  0.641180   1 -0.164780 -0.111487\n2000-01-18  0.130422   1  0.322833  0.632383\n2000-01-19  0.317278   1  0.384528  0.813656\n2000-01-20  0.293598   1  0.159538  0.742381<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"computing-rolling-pairwise-correlations\">\n<h3>Computing rolling pairwise correlations<\/h3>\n<p>In financial data analysis and other fields it\u2019s common to compute correlation matrices for a collection of time series. More difficult is to compute a moving-window correlation matrix. This can be done using the\u00a0<tt>rolling_corr_pairwise<\/tt>\u00a0function, which yields a\u00a0<tt>Panel<\/tt>\u00a0whose\u00a0<tt>items<\/tt>are the dates in question:<\/p>\n<div>\n<div>\n<pre>In [432]: correls = rolling_corr_pairwise(df, 50)\n\nIn [433]: correls[df.index[-50]]\nOut[433]: \n          A         B         C         D\nA  1.000000  0.604221  0.767429 -0.776170\nB  0.604221  1.000000  0.461484 -0.381148\nC  0.767429  0.461484  1.000000 -0.748863\nD -0.776170 -0.381148 -0.748863  1.000000<\/pre>\n<\/div>\n<\/div>\n<p>You can efficiently retrieve the time series of correlations between two columns using\u00a0<tt>ix<\/tt>indexing:<\/p>\n<div>\n<div>\n<pre>In [434]: correls.ix[:, 'A', 'C'].plot()\nOut[434]: &lt;matplotlib.axes.AxesSubplot at 0x79e4e10&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/rolling_corr_pairwise_ex.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/rolling_corr_pairwise_ex.png\" \/><\/div>\n<\/div>\n<div id=\"expanding-window-moment-functions\">\n<h2>Expanding window moment functions<\/h2>\n<p>A common alternative to rolling statistics is to use an\u00a0<em>expanding<\/em>\u00a0window, which yields the value of the statistic with all the data available up to that point in time. As these calculations are a special case of rolling statistics, they are implemented in pandas such that the following two calls are equivalent:<\/p>\n<div>\n<div>\n<pre>In [435]: rolling_mean(df, window=len(df), min_periods=1)[:5]\nOut[435]: \n                   A         B         C         D\n2000-01-01 -1.388345  3.317290  0.344542 -0.036968\n2000-01-02 -1.123132  3.622300  1.675867  0.595300\n2000-01-03 -0.628502  3.626503  2.455240  1.060158\n2000-01-04 -0.768740  3.888917  2.451354  1.281874\n2000-01-05 -0.824034  4.108035  2.556112  1.140723\n\nIn [436]: expanding_mean(df)[:5]\nOut[436]: \n                   A         B         C         D\n2000-01-01 -1.388345  3.317290  0.344542 -0.036968\n2000-01-02 -1.123132  3.622300  1.675867  0.595300\n2000-01-03 -0.628502  3.626503  2.455240  1.060158\n2000-01-04 -0.768740  3.888917  2.451354  1.281874\n2000-01-05 -0.824034  4.108035  2.556112  1.140723<\/pre>\n<\/div>\n<\/div>\n<p>Like the\u00a0<tt>rolling_<\/tt>\u00a0functions, the following methods are included in the\u00a0<tt>pandas<\/tt>\u00a0namespace or can be located in\u00a0<tt>pandas.stats.moments<\/tt>.<\/p>\n<table border=\"1\">\n<colgroup>\n<col width=\"20%\" \/>\n<col width=\"80%\" \/><\/colgroup>\n<thead valign=\"bottom\">\n<tr>\n<th>Function<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody valign=\"top\">\n<tr>\n<td><tt>expanding_count<\/tt><\/td>\n<td>Number of non-null observations<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_sum<\/tt><\/td>\n<td>Sum of values<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_mean<\/tt><\/td>\n<td>Mean of values<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_median<\/tt><\/td>\n<td>Arithmetic median of values<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_min<\/tt><\/td>\n<td>Minimum<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_max<\/tt><\/td>\n<td>Maximum<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_std<\/tt><\/td>\n<td>Unbiased standard deviation<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_var<\/tt><\/td>\n<td>Unbiased variance<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_skew<\/tt><\/td>\n<td>Unbiased skewness (3rd moment)<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_kurt<\/tt><\/td>\n<td>Unbiased kurtosis (4th moment)<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_quantile<\/tt><\/td>\n<td>Sample quantile (value at %)<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_apply<\/tt><\/td>\n<td>Generic apply<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_cov<\/tt><\/td>\n<td>Unbiased covariance (binary)<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_corr<\/tt><\/td>\n<td>Correlation (binary)<\/td>\n<\/tr>\n<tr>\n<td><tt>expanding_corr_pairwise<\/tt><\/td>\n<td>Pairwise correlation of DataFrame columns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Aside from not having a\u00a0<tt>window<\/tt>\u00a0parameter, these functions have the same interfaces as their<tt>rolling_<\/tt>\u00a0counterpart. Like above, the parameters they all accept are:<\/p>\n<blockquote>\n<ul>\n<li><tt>min_periods<\/tt>: threshold of non-null data points to require. Defaults to minimum needed to compute statistic. No\u00a0<tt>NaNs<\/tt>\u00a0will be output once\u00a0<tt>min_periods<\/tt>\u00a0non-null data points have been seen.<\/li>\n<li><tt>freq<\/tt>: optionally specify a\u00a0<a href=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/timeseries.html#timeseries-alias\"><em>frequency string<\/em><\/a>\u00a0or\u00a0<a href=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/timeseries.html#timeseries-offsets\"><em>DateOffset<\/em><\/a>\u00a0to pre-conform the data to. Note that prior to pandas v0.8.0, a keyword argument\u00a0<tt>time_rule<\/tt>\u00a0was used instead of\u00a0<tt>freq<\/tt>\u00a0that referred to the legacy time rule constants<\/li>\n<\/ul>\n<\/blockquote>\n<div>\n<p>Note<\/p>\n<p>The output of the\u00a0<tt>rolling_<\/tt>\u00a0and\u00a0<tt>expanding_<\/tt>\u00a0functions do not return a\u00a0<tt>NaN<\/tt>\u00a0if there are at least\u00a0<tt>min_periods<\/tt>\u00a0non-null values in the current window. This differs from\u00a0<tt>cumsum<\/tt>,\u00a0<tt>cumprod<\/tt>,<tt>cummax<\/tt>, and\u00a0<tt>cummin<\/tt>, which return\u00a0<tt>NaN<\/tt>\u00a0in the output wherever a\u00a0<tt>NaN<\/tt>\u00a0is encountered in the input.<\/p>\n<\/div>\n<p>An expanding window statistic will be more stable (and less responsive) than its rolling window counterpart as the increasing window size decreases the relative impact of an individual data point. As an example, here is the\u00a0<tt>expanding_mean<\/tt>\u00a0output for the previous time series dataset:<\/p>\n<div>\n<div>\n<pre>In [437]: ts.plot(style='k--')\nOut[437]: &lt;matplotlib.axes.AxesSubplot at 0x7e2b410&gt;\n\nIn [438]: expanding_mean(ts).plot(style='k')\nOut[438]: &lt;matplotlib.axes.AxesSubplot at 0x7e2b410&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/expanding_mean_frame.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/expanding_mean_frame.png\" \/><\/div>\n<div id=\"exponentially-weighted-moment-functions\">\n<h2>Exponentially weighted moment functions<\/h2>\n<p>A related set of functions are exponentially weighted versions of many of the above statistics. A number of EW (exponentially weighted) functions are provided using the blending method. For example, where\u00a0<img decoding=\"async\" alt=\"y_t\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/math\/84b7f27da2c227a9434fb4e732197dc030e2a168.png\" \/>\u00a0is the result and\u00a0<img decoding=\"async\" alt=\"x_t\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/math\/4485828f5a19c01ef573976d83d057fa840ed1e3.png\" \/>\u00a0the input, we compute an exponentially weighted moving average as<\/p>\n<div>\n<p><img decoding=\"async\" alt=\"y_t = \\alpha y_{t-1} + (1 - \\alpha) x_t\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/math\/bf3b3a68742139efe07b615d8c8451046a7f4d71.png\" \/><\/p>\n<\/div>\n<p>One must have\u00a0<img decoding=\"async\" alt=\"0 &lt; \\alpha \\leq 1\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/math\/6ebcaa0ff9767c981a5ec0a8289ab63df1e04b68.png\" \/>, but rather than pass\u00a0<img decoding=\"async\" alt=\"\\alpha\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/math\/10f32377ac67d94f764f12a15ea987e88c85d3e1.png\" \/>\u00a0directly, it\u2019s easier to think about either the\u00a0<strong>span<\/strong>\u00a0or\u00a0<strong>center of mass (com)<\/strong>\u00a0of an EW moment:<\/p>\n<div>\n<p><img decoding=\"async\" alt=\"\\alpha =\n \\begin{cases}\n     \\frac{2}{s + 1}, s = \\text{span}\\\\\n     \\frac{1}{c + 1}, c = \\text{center of mass}\n \\end{cases}\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/math\/1fc5e9dda6173d4f595d2c8709d9773f87c75000.png\" \/><\/p>\n<\/div>\n<p>You can pass one or the other to these functions but not both.\u00a0<strong>Span<\/strong>\u00a0corresponds to what is commonly called a \u201c20-day EW moving average\u201d for example.\u00a0<strong>Center of mass<\/strong>\u00a0has a more physical interpretation. For example,\u00a0<strong>span<\/strong>\u00a0= 20 corresponds to\u00a0<strong>com<\/strong>\u00a0= 9.5. Here is the list of functions available:<\/p>\n<table border=\"1\">\n<colgroup>\n<col width=\"20%\" \/>\n<col width=\"80%\" \/><\/colgroup>\n<thead valign=\"bottom\">\n<tr>\n<th>Function<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody valign=\"top\">\n<tr>\n<td><tt>ewma<\/tt><\/td>\n<td>EW moving average<\/td>\n<\/tr>\n<tr>\n<td><tt>ewmvar<\/tt><\/td>\n<td>EW moving variance<\/td>\n<\/tr>\n<tr>\n<td><tt>ewmstd<\/tt><\/td>\n<td>EW moving standard deviation<\/td>\n<\/tr>\n<tr>\n<td><tt>ewmcorr<\/tt><\/td>\n<td>EW moving correlation<\/td>\n<\/tr>\n<tr>\n<td><tt>ewmcov<\/tt><\/td>\n<td>EW moving covariance<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here are an example for a univariate time series:<\/p>\n<div>\n<div>\n<pre>In [439]: plt.close('all')\n\nIn [440]: ts.plot(style='k--')\nOut[440]: &lt;matplotlib.axes.AxesSubplot at 0x87c9f90&gt;\n\nIn [441]: ewma(ts, span=20).plot(style='k')\nOut[441]: &lt;matplotlib.axes.AxesSubplot at 0x87c9f90&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/ewma_ex.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/ewma_ex.png\" \/><\/p>\n<div>\n<p>Note<\/p>\n<p>The EW functions perform a standard adjustment to the initial observations whereby if there are fewer observations than called for in the span, those observations are reweighted accordingly.<\/p>\n<\/div>\n<\/div>\n<div id=\"linear-and-panel-regression\">\n<h2>Linear and panel regression<\/h2>\n<div>\n<p>Note<\/p>\n<p>We plan to move this functionality to\u00a0<a href=\"http:\/\/statsmodels.sourceforge.net\/\">statsmodels<\/a>\u00a0for the next release. Some of the result attributes may change names in order to foster naming consistency with the rest of statsmodels. We will provide every effort to provide compatibility with older versions of pandas, however.<\/p>\n<\/div>\n<p>We have implemented a very fast set of\u00a0<em>moving-window linear regression<\/em>\u00a0classes in pandas. Two different types of regressions are supported:<\/p>\n<blockquote>\n<ul>\n<li>Standard ordinary least squares (OLS) multiple regression<\/li>\n<li>Multiple regression (OLS-based) on\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Panel_data\">panel data<\/a>\u00a0including with fixed-effects (also known as entity or individual effects) or time-effects.<\/li>\n<\/ul>\n<\/blockquote>\n<p>Both kinds of linear models are accessed through the\u00a0<tt>ols<\/tt>\u00a0function in the pandas namespace. They all take the following arguments to specify either a static (full sample) or dynamic (moving window) regression:<\/p>\n<blockquote>\n<ul>\n<li><tt>window_type<\/tt>:\u00a0<tt>'full\u00a0sample'<\/tt>\u00a0(default),\u00a0<tt>'expanding'<\/tt>, or\u00a0<tt>rolling<\/tt><\/li>\n<li><tt>window<\/tt>: size of the moving window in the\u00a0<tt>window_type='rolling'<\/tt>\u00a0case. If\u00a0<tt>window<\/tt>\u00a0is specified,\u00a0<tt>window_type<\/tt>\u00a0will be automatically set to\u00a0<tt>'rolling'<\/tt><\/li>\n<li><tt>min_periods<\/tt>: minimum number of time periods to require to compute the regression coefficients<\/li>\n<\/ul>\n<\/blockquote>\n<p>Generally speaking, the\u00a0<tt>ols<\/tt>\u00a0works by being given a\u00a0<tt>y<\/tt>\u00a0(response) object and an\u00a0<tt>x<\/tt>\u00a0(predictors) object. These can take many forms:<\/p>\n<blockquote>\n<ul>\n<li><tt>y<\/tt>: a Series, ndarray, or DataFrame (panel model)<\/li>\n<li><tt>x<\/tt>: Series, DataFrame, dict of Series, dict of DataFrame or Panel<\/li>\n<\/ul>\n<\/blockquote>\n<p>Based on the types of\u00a0<tt>y<\/tt>\u00a0and\u00a0<tt>x<\/tt>, the model will be inferred to either a panel model or a regular linear model. If the\u00a0<tt>y<\/tt>\u00a0variable is a DataFrame, the result will be a panel model. In this case, the\u00a0<tt>x<\/tt>\u00a0variable must either be a Panel, or a dict of DataFrame (which will be coerced into a Panel).<\/p>\n<div id=\"standard-ols-regression\">\n<h3>Standard OLS regression<\/h3>\n<p>Let\u2019s pull in some sample data:<\/p>\n<div>\n<div>\n<pre>In [442]: from pandas.io.data import DataReader\n\nIn [443]: symbols = ['MSFT', 'GOOG', 'AAPL']\n\nIn [444]: data = dict((sym, DataReader(sym, \"yahoo\"))\n   .....:             for sym in symbols)\n   .....:\n\nIn [445]: panel = Panel(data).swapaxes('items', 'minor')\n\nIn [446]: close_px = panel['Close']\n\n# convert closing prices to returns\nIn [447]: rets = close_px \/ close_px.shift(1) - 1\n\nIn [448]: rets.info()\n&lt;class 'pandas.core.frame.DataFrame'&gt;\nDatetimeIndex: 810 entries, 2010-01-04 00:00:00 to 2013-03-22 00:00:00\nData columns (total 3 columns):\nAAPL    809  non-null values\nGOOG    809  non-null values\nMSFT    809  non-null values\ndtypes: float64(3)<\/pre>\n<\/div>\n<\/div>\n<p>Let\u2019s do a static regression of\u00a0<tt>AAPL<\/tt>\u00a0returns on\u00a0<tt>GOOG<\/tt>\u00a0returns:<\/p>\n<div>\n<div>\n<pre>In [449]: model = ols(y=rets['AAPL'], x=rets.ix[:, ['GOOG']])\n\nIn [450]: model\nOut[450]: \n-------------------------Summary of Regression Analysis-------------------------\nFormula: Y ~ &lt;GOOG&gt; + &lt;intercept&gt;\nNumber of Observations:         809\nNumber of Degrees of Freedom:   2\nR-squared:         0.2394\nAdj R-squared:     0.2385\nRmse:              0.0156\nF-stat (1, 807):   253.9945, p-value:     0.0000\nDegrees of Freedom: model 1, resid 807\n-----------------------Summary of Estimated Coefficients------------------------\n      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%\n--------------------------------------------------------------------------------\n          GOOG     0.5262     0.0330      15.94     0.0000     0.4615     0.5909\n     intercept     0.0009     0.0006       1.58     0.1134    -0.0002     0.0020\n---------------------------------End of Summary---------------------------------\n\nIn [451]: model.beta\nOut[451]: \nGOOG         0.526216\nintercept    0.000872\ndtype: float64<\/pre>\n<\/div>\n<\/div>\n<p>If we had passed a Series instead of a DataFrame with the single\u00a0<tt>GOOG<\/tt>\u00a0column, the model would have assigned the generic name\u00a0<tt>x<\/tt>\u00a0to the sole right-hand side variable.<\/p>\n<p>We can do a moving window regression to see how the relationship changes over time:<\/p>\n<div>\n<div>\n<pre>In [452]: model = ols(y=rets['AAPL'], x=rets.ix[:, ['GOOG']],\n   .....:             window=250)\n   .....:\n\n# just plot the coefficient for GOOG\nIn [453]: model.beta['GOOG'].plot()\nOut[453]: &lt;matplotlib.axes.AxesSubplot at 0x8d9e650&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/moving_lm_ex.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/moving_lm_ex.png\" \/>It looks like there are some outliers rolling in and out of the window in the above regression, influencing the results. We could perform a simple\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Winsorising\">winsorization<\/a>\u00a0at the 3 STD level to trim the impact of outliers:<\/p>\n<div>\n<div>\n<pre>In [454]: winz = rets.copy()\n\nIn [455]: std_1year = rolling_std(rets, 250, min_periods=20)\n\n# cap at 3 * 1 year standard deviation\nIn [456]: cap_level = 3 * np.sign(winz) * std_1year\n\nIn [457]: winz[np.abs(winz) &gt; 3 * std_1year] = cap_level\n\nIn [458]: winz_model = ols(y=winz['AAPL'], x=winz.ix[:, ['GOOG']],\n   .....:             window=250)\n   .....:\n\nIn [459]: model.beta['GOOG'].plot(label=\"With outliers\")\nOut[459]: &lt;matplotlib.axes.AxesSubplot at 0x8db0750&gt;\n\nIn [460]: winz_model.beta['GOOG'].plot(label=\"Winsorized\"); plt.legend(loc='best')\nOut[460]: &lt;matplotlib.legend.Legend at 0x9a4e710&gt;<\/pre>\n<\/div>\n<\/div>\n<p><img decoding=\"async\" alt=\"_images\/moving_lm_winz.png\" src=\"http:\/\/pandas.pydata.org\/pandas-docs\/dev\/_images\/moving_lm_winz.png\" \/>So in this simple example we see the impact of winsorization is actually quite significant. Note the correlation after winsorization remains high:<\/p>\n<div>\n<div>\n<pre>In [461]: winz.corrwith(rets)\nOut[461]: \nAAPL    0.988561\nGOOG    0.973117\nMSFT    0.998421\ndtype: float64<\/pre>\n<\/div>\n<\/div>\n<p>Multiple regressions can be run by passing a DataFrame with multiple columns for the predictors\u00a0<tt>x<\/tt>:<\/p>\n<div>\n<div>\n<pre>In [462]: ols(y=winz['AAPL'], x=winz.drop(['AAPL'], axis=1))\nOut[462]: \n-------------------------Summary of Regression Analysis-------------------------\nFormula: Y ~ &lt;GOOG&gt; + &lt;MSFT&gt; + &lt;intercept&gt;\nNumber of Observations:         809\nNumber of Degrees of Freedom:   3\nR-squared:         0.3283\nAdj R-squared:     0.3266\nRmse:              0.0139\nF-stat (2, 806):   196.9405, p-value:     0.0000\nDegrees of Freedom: model 2, resid 806\n-----------------------Summary of Estimated Coefficients------------------------\n      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%\n--------------------------------------------------------------------------------\n          GOOG     0.4579     0.0382      12.00     0.0000     0.3831     0.5327\n          MSFT     0.3090     0.0424       7.28     0.0000     0.2258     0.3921\n     intercept     0.0009     0.0005       1.84     0.0663    -0.0001     0.0019\n---------------------------------End of Summary---------------------------------<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"panel-regression\">\n<h3>Panel regression<\/h3>\n<p>We\u2019ve implemented moving window panel regression on potentially unbalanced panel data (see\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Panel_data\">this article<\/a>\u00a0if this means nothing to you). Suppose we wanted to model the relationship between the magnitude of the daily return and trading volume among a group of stocks, and we want to pool all the data together to run one big regression. This is actually quite easy:<\/p>\n<div>\n<div>\n<pre># make the units somewhat comparable\nIn [463]: volume = panel['Volume'] \/ 1e8\n\nIn [464]: model = ols(y=volume, x={'return' : np.abs(rets)})\n\nIn [465]: model\nOut[465]: \n-------------------------Summary of Regression Analysis-------------------------\nFormula: Y ~ &lt;return&gt; + &lt;intercept&gt;\nNumber of Observations:         2427\nNumber of Degrees of Freedom:   2\nR-squared:         0.0193\nAdj R-squared:     0.0188\nRmse:              0.2648\nF-stat (1, 2425):    47.6083, p-value:     0.0000\nDegrees of Freedom: model 1, resid 2425\n-----------------------Summary of Estimated Coefficients------------------------\n      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%\n--------------------------------------------------------------------------------\n        return     3.2605     0.4725       6.90     0.0000     2.3343     4.1866\n     intercept     0.2248     0.0077      29.37     0.0000     0.2098     0.2398\n---------------------------------End of Summary---------------------------------<\/pre>\n<\/div>\n<\/div>\n<p>In a panel model, we can insert dummy (0-1) variables for the \u201centities\u201d involved (here, each of the stocks) to account the a entity-specific effect (intercept):<\/p>\n<div>\n<div>\n<pre>In [466]: fe_model = ols(y=volume, x={'return' : np.abs(rets)},\n   .....:                entity_effects=True)\n   .....:\n\nIn [467]: fe_model\nOut[467]: \n-------------------------Summary of Regression Analysis-------------------------\nFormula: Y ~ &lt;return&gt; + &lt;FE_GOOG&gt; + &lt;FE_MSFT&gt; + &lt;intercept&gt;\nNumber of Observations:         2427\nNumber of Degrees of Freedom:   4\nR-squared:         0.7400\nAdj R-squared:     0.7397\nRmse:              0.1364\nF-stat (3, 2423):  2298.9069, p-value:     0.0000\nDegrees of Freedom: model 3, resid 2423\n-----------------------Summary of Estimated Coefficients------------------------\n      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%\n--------------------------------------------------------------------------------\n        return     4.4246     0.2447      18.08     0.0000     3.9449     4.9043\n       FE_GOOG    -0.1554     0.0068     -22.87     0.0000    -0.1687    -0.1421\n       FE_MSFT     0.3852     0.0068      56.52     0.0000     0.3719     0.3986\n     intercept     0.1348     0.0058      23.36     0.0000     0.1235     0.1461\n---------------------------------End of Summary---------------------------------<\/pre>\n<\/div>\n<\/div>\n<p>Because we ran the regression with an intercept, one of the dummy variables must be dropped or the design matrix will not be full rank. If we do not use an intercept, all of the dummy variables will be included:<\/p>\n<div>\n<div>\n<pre>In [468]: fe_model = ols(y=volume, x={'return' : np.abs(rets)},\n   .....:                entity_effects=True, intercept=False)\n   .....:\n\nIn [469]: fe_model\nOut[469]: \n-------------------------Summary of Regression Analysis-------------------------\nFormula: Y ~ &lt;return&gt; + &lt;FE_AAPL&gt; + &lt;FE_GOOG&gt; + &lt;FE_MSFT&gt;\nNumber of Observations:         2427\nNumber of Degrees of Freedom:   4\nR-squared:         0.7400\nAdj R-squared:     0.7397\nRmse:              0.1364\nF-stat (4, 2423):  2298.9069, p-value:     0.0000\nDegrees of Freedom: model 3, resid 2423\n-----------------------Summary of Estimated Coefficients------------------------\n      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%\n--------------------------------------------------------------------------------\n        return     4.4246     0.2447      18.08     0.0000     3.9449     4.9043\n       FE_AAPL     0.1348     0.0058      23.36     0.0000     0.1235     0.1461\n       FE_GOOG    -0.0206     0.0055      -3.73     0.0002    -0.0315    -0.0098\n       FE_MSFT     0.5200     0.0054      96.10     0.0000     0.5094     0.5306\n---------------------------------End of Summary---------------------------------<\/pre>\n<\/div>\n<\/div>\n<p>We can also include\u00a0<em>time effects<\/em>, which demeans the data cross-sectionally at each point in time (equivalent to including dummy variables for each date). More mathematical care must be taken to properly compute the standard errors in this case:<\/p>\n<div>\n<div>\n<pre>In [470]: te_model = ols(y=volume, x={'return' : np.abs(rets)},\n   .....:                time_effects=True, entity_effects=True)\n   .....:\n\nIn [471]: te_model\nOut[471]: \n-------------------------Summary of Regression Analysis-------------------------\nFormula: Y ~ &lt;return&gt; + &lt;FE_GOOG&gt; + &lt;FE_MSFT&gt;\nNumber of Observations:         2427\nNumber of Degrees of Freedom:   812\nR-squared:         0.8166\nAdj R-squared:     0.7244\nRmse:              0.1313\nF-stat (3, 1615):     8.8641, p-value:     0.0000\nDegrees of Freedom: model 811, resid 1615\n-----------------------Summary of Estimated Coefficients------------------------\n      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%\n--------------------------------------------------------------------------------\n        return     3.5003     0.3480      10.06     0.0000     2.8182     4.1824\n       FE_GOOG    -0.1571     0.0066     -23.95     0.0000    -0.1700    -0.1442\n       FE_MSFT     0.3826     0.0066      57.94     0.0000     0.3697     0.3956\n---------------------------------End of Summary---------------------------------<\/pre>\n<\/div>\n<\/div>\n<p>Here the intercept (the mean term) is dropped by default because it will be 0 according to the model assumptions, having subtracted off the group means.<\/p>\n<\/div>\n<div id=\"result-fields-and-tests\">\n<h3>Result fields and tests<\/h3>\n<p>We\u2019ll leave it to the user to explore the docstrings and source, especially as we\u2019ll be moving this code into statsmodels in the near future.<\/p>\n<p>http:\/\/pandas.pydata.org\/pandas-docs\/dev\/computation.html<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Statistical functions Percent Change Both\u00a0Series\u00a0and\u00a0DataFrame\u00a0has a method\u00a0pct_change\u00a0to compute the percent change over a given number of periods (using\u00a0fill_method\u00a0to fill NA\/null values). In [376]: ser =&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19],"tags":[],"class_list":["post-145","post","type-post","status-publish","format-standard","hentry","category-python"],"_links":{"self":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/145","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/comments?post=145"}],"version-history":[{"count":0,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/posts\/145\/revisions"}],"wp:attachment":[{"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/media?parent=145"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/categories?post=145"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zhuoyao.net\/index.php\/wp-json\/wp\/v2\/tags?post=145"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}