Statsample is a basic and advance statistics suite on Ruby. It is majorly supported by many Ruby 1.8.7, 1.9.1, 1.9.2, 1.9.3 and 2.0.0.
Statsample is perfect library for anyone who is interested in exploring statistical aspects and even a little familiar(or interested) in Ruby.
It had very rich API; except that of TimeSeries and Generalized linear models, which was rather basic.
So, in this Google Summer of Code 2013 program, with SciRuby, I released two extensions  Statsample TimeSeries and Statsample GLM. These gems aim to take Statsample further and incorporate various functionalities and estimation techniques on continuous data.
TimeSeries is equipped with wide amount of operations, which are directly available on Series
object. Few of those functionalities are:
To get your hands dirty,
statsample
with gem install statsample
gem install statsampletimeseries
.Now, let’s make a simple TimeSeries
object:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

You can go through the documentation and API, here
Statsample GLM includes many helpful regression techniques, which can be used for regression analysis on data.
Some of those techniques are:
The toplevel module for Regression techniques is Statsample::Regression
Using it as simple as ever,
* First, install `statsample` by `gem install statsample`
* Now, install GLM by `gem install `statsampleglm`
Let’s get quickly started:
1 2 3 4 5 6 7 8 9 10 11 

The documentation and API details is available here
We have some more plans for GLM module. First in list is to make the algorithms work with SVD, because manual inversion of matrices is not cool for larger values in Poisson Regression.
I have blogged about most of the functionalities on my blog  www.ankurgoel.com.
Please explore and use the libraries; I will be waiting for your inputs, suggestions and questions. Feel free to leave your questions on Github issues.
Had an amazing summer!
Stay tuned and Enjoy. :)
 Ankur Goel
]]>Statsample::TimeSeries::Arima::KalmanFilter
) and LogLikelihood(Statsample::TimeSeries::Arima::KF::LogLikelihood
).
I am way too grateful to Claudio for his uberawesome support and guidance. While implementing loglikelihood, I understood why he asked me to go through GSL
minimization at the first place. Thanks! :)
KalmanFilter enables us to find the ARIMA(p, d, q) of series where,
The filter finds the autoregressive and moving average coefficients for the given series and orders.
In the previous phase, we were working on the simulations of ARIMA model and manually provided the phi
and theta
coefficients to the simulator.
The KalmanFilter removes that dependency, by simplex algorithm minimizing approach on the log likelihood of the series, it successfully finds the ARIMA of a given series.
Use Case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The source code of KalmanFilter
can be found here.
Now, as name depicts, LogLikelihood
generates loglikelihood and few other attributes of the series.
With the LogLikelihood
class, we are generating many internal matrices(we even added few utility functions to make computation easier here). With this class, on given coefficients, order and series, we are able to calculate sigma, loglikelihood and AIC(Akaike Information Criterion) of the series.
LogLikelihood is the important class, since this is the function which is repeatedly minimized in KalmanFilter, which in turn generates the estimated parameters for ARIMA.
Use Case:
1 2 3 4 5 6 7 8 9 10 11 12 13 

The source code of LogLikelihood
can be found here
The tests for both passes all the case.
In between, we have covered the detailed documentation of pretty much everything in statsampletimeseries.
With that, we have also bumped our version. Go, gem install statsampletimeseries
Cheers,
Ankur Goel
add_constant
to a matrix. add_constant
prepends or appends a column of ones to a matrix if it already doesn’t have one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

There are other such methods such as chain_dot
which carries out dot multplication of matrices in chain. It uses the ruby’s reduce
ability to reduce the available arguments(matrices) to consequential product.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Apart from adding such functionalities, I have covered entire documentation of statsampletimeseries
. I made sure to explain role of each function, every input parameter, and return type of the function. In most of the cases, I also added the usage examples too.
Later, I will add the usage examples in all those which are still not equipped with that and details about parameters wherever it is still missing.
After this, with the great help from Claudio and Ra’s pointers about modular(namespace) hierarchial convention; we managed to make it more conventional. :)
Here are the final results:
1 2 3 4 5 6 

We are now reading and continuing to code Kalman filter. Hope it doesn’t stay tricky. :)
Till next time,
Cheers,
 Ankur Goel
First, this blog is coming a bit late than usual; sorry for that. I was traveling to my hometown(Delhi) for some occassion and couldn’t do much in last 3 days. I am thankful to Claudio for his support.
So, in this phase, as discussed, we continue to compose estimation methods for ARMA/ARIMA
. Good news  Most of method seem to be in place. Even if we manage to make atleast one or two; we seem to be in good position. Bad news  the estimation methods, I am hanging out with has lot of prerequisitie. These requisites are both theoretical and technical. So, I’m currently initially coding them as I go. This comes with a plus. These methods will be extremely valuable in many other analysis. ;)
So, we started up with developing Kalman filter. Kalman filter is one of the crucial method for ARIMA model fit. It is primarily identified with constitution of 3 matrices 
Currently, these methods are available as class methods
of the new class  KalmanFilter
in ARIMA
. It can be found here
The example snippet of T matrix code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

The complete coding of R
matrix is still pending.
While venturing into another estimation method; I encoutered Cholesky decomposition of matrix; and it took me by surprise. Cholesky decomposition is the decomposition of a symmetric matrix in the product of lower half of Hermitian matrix and it’s conjugate.
I implemented the following as extension of Matrix
here. Since the matrix has to be symmetric before it can be decomposed to Hermitian matrix, I also wrote down is_symmetric?
method to check if the matrix is symmetric or not. Though symmetric?
is present in Ruby Matrix 1.9+, to satisfy back compatibility with Ruby 1.8, it was necessary.
That’s pretty much for now.
Continuing the work.
Cheers,
 Ankur Goel
This summer, I’m working on new statsampletimeseries
gem. This will act as an extension to existing Statsample
by Claudio Bustos. I aim to add the support of timeseries and related functionalities to it.
I am using MiniTest
for unittesting and Cucumber for the feature testing. In the intial period of first phase, I targeted on completing as much portion of testing as possible for existing Statsample
.
Also, another goal is to make statsample supported by current and previous Ruby versions. We are taking care of that  travisci.org/AnkurGel/statsampletimeseries by managing support of Ruby 2.0.0, 1.9.3, 1.9.2, jruby19mode and rbx19mode.
Now, we have many basic and advanced functions in place for timeseries. We have enabled:
Apart from them, we are also working on ARIMA module(another goal of project) and have realized the simulation of:
For those simulations, the requisite was to preacquire the values of parameters against which the simulation was generated. For pure model, in this phase we aim to work and complete most of such functions to support that. We have completed:
I will now start with other such modelling like burg
algorithm and IRLS
. :).
The deliverables of project must be the all this stuff. Since the estimation methods for these modelling poses lot of theoretical and accuracy challenges, the pace to achieve them may not be too fast. :).
One thing I liked in R and Statsmodels is amount of documentation and detail of API for developers and users. I wish to have similar amount of documentation for Statsample, so as to attract more number of Ruby developers and scientists.
Considering the amount of code present in Statsample and statsampletimeseries combined, maybe devoting considerable quality time for RDoc documentation will be a good idea!.
I already expressed my wish to continue contributing to project after timeline to Claudio. We will continue to work on it! On next module, if this is near completion. :)
]]>phi
in AR
modelling. We covered yule_walker
earlier, I’ll write a post about that. After it’s implementation, we go ahead with another estimation method  Levinson Durbin
LevinsonDurbin requires timeline series to be demeaned(series = series  series.mean
) and it’s autocovavirance.
Autocovariance of series is represented by summation of summation of product of series with series at lag k
. That is, summation of (x_i * x_{i+lag})
. It is also directly related with acf
of series as acf(k) = acvf(h) / acvf(0)
. It’s code can now be found in Statsample::TimeSeries
’s acvf
method.
Now, with the help of autocovariance series, our levinson_durbin
function recursively computes the following parameters:
LD performs recursive matrix and vector multiplications to populate it’s toeplitz matrix. Here is some code depicting those manipulations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Implementation can be found here.
Now, in this week, I will integrate this in AR modelling and perform some tests to verify the estimation. And will soon start with next estimation method :)
Cheers,
Ankur Goel
While analyzing and writing tests for same, I also took some time to visualize that data on ilne and bar charts to get a clearer picture:
AR(1) process is the autoregressive simulation with order p = 1, i.e, with one value of phi.
Ideal AR(p) process is represented by:
To simulate this, install statsampletimeseries
from here.
1 2 3 4 

Here, number of observations, n = 1500 (greater value is preferrable for best fit), p = 1, with phi = [0.9].
To generate it’s autocorrelation
1 2 

For an AR(1) process, acf
must exponentially decay if phi > 0
, or alternate in sign if phi < 0
Ref. Go through the analysis above. It can be visualized as:
When phi > 0
, acf decreases exponentially:
When phi < 0
, you get the alternate acf lags:
To generate it’s partial autocorrelation:
1 2 

For AR(1) process, pacf
must have a spike at lag 1, then 0. Former spike must be positive if phi > 0
, otherwise, negative spike. Have a look at the pacf series generated above. On visualizing the data:
When phi > 0
, positive lag at 1 and 0(contains 1.0):
When phi < 0
, negative lag at 1:
Here is the representation of ideal acfvspacf for positive phi in AR(1):
Simulation of AR(p) process is similar as AR(1).
1 2 3 4 5 6 

For AR(p), acf
must give a damping sine wave. The pattern is greatly dependent on the value and sign of phi parameters.
When positive content in phi coefficients is more, you will get a sine wave starting from positive side, else, sine wave will start from negative side.
Notice, the damping sine wave starting from positive side here:
and negative side here..
pacf
gives spike at lag 0(value = 1.0, default) and from lag 1 to lag k. The example above, features AR(2) process, for this, we must get spikes at lag 1  2 as:
MA(1) process is the moving average simulation with order q = 1
, i.e, with one value of theta
.
To simulate this, use ma_sim
method from Statsample::ARIMA::ARIMA
1 2 3 4 5 6 

For theta > 0
, for MA(1)
, we must get a positive spike at lag 1 as:
For theta < 0
, the spike at lag 1 must be in negatie direction as:
When I put these two visualizations aside each other, the visualization seems quite fit:
MA(q) process. Order = q
=> Number of theta coefficients = q.
Ideal MA(q) process is represented by:
Similar to AR(1) simulation, it will have spikes for lag 1  lag p as :
In pacf of MA(q) simulation, we observe exponentially decaying/damping sine wave.
ARMA(p, q) is combination of autoregressive and moving average simulations.
When q = 0
, the process is called as pure autoregressive process; when p = 0
, the process is purely moving average.
The simulator of ARMA can be found as arma_sim
in Statsample::ARIMA::ARIMA
.
For ARMA(1, 1)
process, here are the comparisons of the visualizations from R
and this code, which just made my day :)
Quite Fit!
Cheers,
 Ankur Goel
We wrote the simulations for AR(p)
and MA(q)
. The idea behind creating them is to first simulate the process with known coefficients and then move on to write ARMA process which could also estimate the coefficients.
This model is represented by AR(p) where p is the order of this model. For a pure autoregressive model, we consider order of moving average model as 0.
AR(p)
is represented by the following equation:
Here, are the parameters of model, and is the error noise.
To realize this model, we have to observe and keep track of previous x(t)
values, and realize current value with observed summation and error noise.
(1..n)
.phi
values from the stack.backshifts * parameters
, then adds the result.x(t)
values is then returned with active white noise.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

There are certain tests which will now be performed in context of acf
and pacf
, which I earlier coded. They form the basis of correctness of our autoregressive estimation. :) Expect the next post about those tests.
This model is represented by MA(q), where q
is the order of this model. Again, for pure movingaverage model, we consider p=0
.
This is represented by following equation:
Unlike autoregressive model, this model was somewhat hard to obtain. It needs to obsere previous error noise instead of previous x(t)
values. And the series largely depends upon the order of the model.
It’s code can also be found on my github branch.
I am now currently working on those tests analysis, I mentioned; for various combinations of AR(p)
and MA(q)
. As they get over, I will move to realize few algorithms like yule walker and burgs for estimation of coefficients.
Cheers,
Ankur Goel
Consider this small snippet from my pacf feature:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

Yes, these are tests! And they perform the operations as they say.
Feature
denotes the feature this test will cover. It is followed by the description of the feature as:
As a
statistician > (usecase)So that
I can quickly evaluate pacf of a seires > (purpose)I want
to evaluate pacf > (expected result)Given
is analogous to before
in RSpec. In context of Background
, it denotes before all
. That is, the forementioned timeseries will be available in all scenarios furhter. This timeseries is resolved by Gherkin parser.
This is further resolved after parsing by following definition:
1 2 3 4 5 6 7 

Scenarios
cover the test cases with the combination of When
, And
, Then
keywords.
They are regular English sentences and combine to form a gramatically sound process.
These sentences are then captured by regularexpressions written by programmer. For example;1


1 2 3 

Above will capture the lags and the strings like:
Result: Compliant for both acf and pacf. :)
You can check my features and step definitions here.
Cheers
Ankur Goel
f(0) = 1
, f(x) = g(x)(x)
correlation of series with itself. for x >= 1
The first component of every pacf series is 1.
I implemented pacf with yulewalker equations of unbiased and mle outcomes. Yulewalker equations are the set of equations represented by:
Yulewalker uses the Toeplitz matrix(gives same output when stored in either rowmajor or columnmajor form) inverse with the outcomes to generate the intermediate vector results.
Here, we can generate pacf by making use of either unbiased
and mle
method with yulewalker function. For unbiased
, the denominator is (nk)
whereas for mle
, it is n
(n is the size of timeseries). To achieve that, I made use of fantastic Ruby lambdas
to make a closure over the variable k
as:
1 2 3 4 5 6 7 8 

Below might have been a viable shortcut, but I used former for maintaining descriptive comments and simplicity in code:
1


Here is the useful description and theoretical implementation of yulewalker by University of Pennsylvania.
Henceforth, the overall yulewalker method looks like following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 

toeplitz
method generates the Toeplitz matrix, and solve_matrix
solves the equation by using the inverse and matrix muliplication.
pacf
is available in Statsample::TimeSeries
and can be called as:
1 2 3 4 

The entire implementation can be seen at : https://github.com/AnkurGel/statsample/blob/master/lib/statsample/timeseries.rb#L151 with it’s tests at : https://github.com/AnkurGel/statsample/blob/master/test/test_pacf.rb
Cheers, /Ankur Goel
]]>Wald test is used to test if a series of n acf or pacf indeces are equal to 0.
For acf
, the distribution for a white noise sationary process are approximately independent and identically distributed normal random variables with mean 0 and variance n^{1.}
What that means is, if terms in an acf
of a timeseries with k lags are squared and added (sumofsquares), then that statistic is chisquare distributed over degree of freedom, directly dependent on the k number of lags.
I will demonstrate this with example:
1 2 3 4 5 6 7 8 9 10 11 12 

So far, we have managed to find the sum of squares of a acfseries with k = 10 = number of lags.
Now, we will check whether or not it is less than quantile 0.95 of a chisquare
with k
degree of freedom.
For that, include Distribution
as:
1 2 3 4 5 6 7 

This verifies the Wald test.
The tests can be found on Github repository at: https://github.com/AnkurGel/statsample/blob/master/test/test_wald2.rb
Cheers,
Ankur Goel
Shoulda
issues with MiniTest while running the tests. The error which I encounter looks something like this:
1 2 3 4 5 6 7 

While adding tests for regression, I worked around it by converting them in pure MiniTest
references. I just committed tests for F following similar analogy. I am currently trying to make Shoulda work which could probably save some work involved in conversion.
For setting up StatSample to make it work on multiple Ruby versions, I have configured rvm
(tool to manage multiple ruby environments with their own gemsets) with Ruby 1.9.2p320
, 1.9.3
. I also committed a clean gemspec
for StatSample. I hope to make StatSample compatible for both first, and then fix it for Ruby1.8.x
, as this is usually the workflow which just works.
As always, project can be forked from http://github.com/AnkurGel/statsample.
You can now build it by:
1 2 

Do let me know of any trouble you encounter at ankurgel at gmail dot com.
Cheers!
Ankur Goel
This summer, I am working with Ruby Science Foundation for StatSample project. As you must have read in previous blog posts; StatSample is a powerful statistical library in Ruby. Unfortunately, development of this great utility has been on hold from last 2 years. My project aims to revamp StatSample and primarily to enhance functionality for TimeSeries and Generalized Linear Models.
You can read more about my proposal, here.
During the community bonding period, I initially studied on few topics which my project is concerned about  primarily, estimation methods like ARIMA. I saw it’s implementation in alternative statistical applications like R and StatsModels. StatsModels uses Kalman filter for maximum likelihood and provides other estimations such as loglikelihood and conditionalsumofsquares etc. The basic interface for ARIMA in StatsModels is as follows:
1 2 3 4 

The returned ARIMA object can be called with :
fit(...)
for maximum likelihood with primarily three methods  maximumlikelihood, conditionalsumofsquares, cssthenmle.predict(...)
, it is a recursive function which gives back list of predictedvalues for supplied varying series.loglike_css(...)
 stands for conditionalsumofsquares, returns aggregated css value.The R Project too has substantial work in ARIMA. I talked about it on mailing list. Thanks to John’s concerns, researching more in StatsModels was good idea than in R. In StatSample, we should work on ARIMA module as idiomatically they have done in StatsModels.
Beside this, I honestly didn’t get much time to devote to project during this period because of my then ongoing semester examinations, which I initially brought into notice to my mentors.
Currently, I am working on repairing and brining uniformity in tests. StatSample’s tests are written in MiniTest primarily, and somewhere making use of shoulda
DSL. Tests using the latter, are breaking on my system with:
1 2 3 4 5 6 7 

To aid this, I am correcting and testing specs as : commit.
1 2 3 4 5 6 7 8 9 10 11 

Hopefully, setting up the codebase in good position will work great as I dwell in coding further with TimeSeries.
Github: http://github.com/AnkurGel/statsample
Cheers!
Ankur Goel
Statsample has a module for Time Series as Statsample::TimeSeries
. This module has a class named TimeSeries
which enables users to perform operations on sequence of data points, indexed by time and ordered from earliest to latest. Example: Stock data.
Suppose, we have a time series as:
1 2 

This is the returned TimeSeries object which is now capable of performing several interesting operations such as:
1 2 3 4 5 

This is frequently used statistical operation. In Digital signal processing, autocorrelation of series is the crosscorrelation of signal with itself, but without the normalization. Though, in statistics, normalization exists.
1 2 

diff
performs the first difference of the series. That is difference of series with itself and it’s first lag.
1 2 

Moving average is a finite impulse response filter which creates a series of averages of subsets of fulldata to analyze the given set of data points.
EMA is similar to moving average, but more weight is given to latest data.
In StatSample, EMA can be accessed from TimeSeries
by calling ema
on a timeseries. Example:
1 2 3 4 

acf
takes optional parameters  n(default: 10) that accounts on how many observations to consider and Welles Wilder coefficent(default: zero) which uses smoothing value of 2/(n + 1)
on false and 1/n
on true.
TimeSeries
module, as can be seen, can become highly sophisticated on inclusion of other methods such as ARMA Estimation etc.
SRS is an unbiased technique to choose subset of individuals (sample) from a larger set (called, population). Selection of each individual in that sample is entirely random and has equal probability as other individuals. Various techniques for SRS is given here.
SRS
is a module in StatSample which comprises of various sections for Proportion estimation, confidence intervals, standard deviation, mean estimation etc.
I covered various tests of SRS methods here, as I explored and understood them. I am currently still writing few more tests for this and other modules in StatSample.
I will update the post as soon as I write them. If anyone wishes me to write about the detailed functionality of this module too, please comment. I will be delighted to do that.
Cheers,
Ankur Goel
Statsample is currently makes use of Ruby/GSL which uses NArray for vector and matrix operations. It conflicts with the SciRuby’s NMatrix which also uses the same class names  NMatrix and NVector. Thus, this conflict makes Statsample unusable for system which already has NMatrix. To aid this, SciRuby developed a fork of rbgsl which makes use of NMatrix instead of NArray. I went through it’s code structure and found it to be great. Devs did a great job in removing many references of NArray and made use of NMatrix in lieu of that.
Statsample is purposed idea for Google Summer of Code 2013 program. And, I am excited about making Statsample more flexible by covering various aspects:
I have been playing around with existing codebase by writing few examples and test cases from few days and had discussion about this with fellow folks at SciRuby (John Woods, Claudio Bustos and Carlos Agarie). I’m very grateful for their response and persuasion. The discussion with them helped me to clarify many aspects which were a little obscure earlier. :)
Just before writing down this blogentry, I was trying out TimeSeries class and it’s methods. I simply loved it  the ease with which I was able to compute the operations, I learnt back in Digital Signal processing such as lagging of series, autocorrelation, exponential moving average etc is mindblowing. It currently supports many basic operations, which after the successful execution of this project can definitely be expanded.
I will be delighted to work on Statsample in this summer, if given an opportunity.
Cheers!
 Ankur Goel
PS: I will try to blog with example codes in next posts.
]]>