Ankur Goel

On hunt of awesomeness!

Statistics With Ruby - TimeSeries and GLM

Introduction:

Statsample is a basic and advance statistics suite on Ruby. It is majorly supported by many Ruby 1.8.7, 1.9.1, 1.9.2, 1.9.3 and 2.0.0.

Statsample is perfect library for anyone who is interested in exploring statistical aspects and even a little familiar(or interested) in Ruby.
It had very rich API; except that of TimeSeries and Generalized linear models, which was rather basic.

So, in this Google Summer of Code 2013 program, with SciRuby, I released two extensions - Statsample TimeSeries and Statsample GLM. These gems aim to take Statsample further and incorporate various functionalities and estimation techniques on continuous data.

Statsample TimeSeries

TimeSeries is equipped with wide amount of operations, which are directly available on Series object. Few of those functionalities are:

  • Autocorrelation of series
  • Partial autocorrelation with yule walker
  • Levinson Durbin estimation
  • AutoRegressive and Moving Average(ARIMA)
  • KalmanFiltering, Log Likelihood estimation
  • Moving averages, differences and autocovariances.

To get your hands dirty,

  • Install statsample with gem install statsample
  • Now, simply install timeseries gem by gem install statsample-timeseries.

Now, let’s make a simple TimeSeries object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
require 'statsample-timeseries'
include Statsample::TimeSeries
#creates a randomized timeseries of 100 continuous elements
ts = 100.times.map { rand(100) }

#Gives autocorrelation of series
ts.acf

#Gives partial autocorrelation of series
ts.pacf
#Partial autocorrelation with 11 lags by maximum likelihood estimation
ts.pacf(11, 'mle')

#autoregressive coefficients:
ts.ar

#ARIMA(2, 1, 1)
k_obj = TimeSeries.arima(ts, 2, 1, 1)
k_obj.ar #Gives autoregressive coefficients
k_obj.ma #Gives moving average coefficients

You can go through the documentation and API, here

Statsample-GLM

Statsample GLM includes many helpful regression techniques, which can be used for regression analysis on data.
Some of those techniques are:

  • Poisson Regression
  • Logistic Regression
  • Exponential Regression
  • Iteratively Reweighted Least Squares

The toplevel module for Regression techniques is Statsample::Regression Using it as simple as ever,

  * First, install `statsample` by `gem install statsample`
  * Now, install GLM by `gem install `statsample-glm`

Let’s get quickly started:

1
2
3
4
5
6
7
8
9
10
11
require 'statsample-glm'
#Creating datasets:
x1 = Statsample::Vector.new([0.537322309644812,-0.717124209978434,-0.519166718891331,0.434970973986765,-0.761822002215759,1.51170030921189,0.883854199811195,-0.908689798854196,1.70331977539793,-0.246971150634099,-1.59077593922623,-0.721548040910253,0.467025703920194,-0.510132788447137,0.430106510266798,-0.144353683251536,-1.54943800728303,0.849307651309298,-0.640304240933579,1.31462478279425,-0.399783455165345,0.0453055645017902,-2.58212161987746,-1.16484414309359,-1.08829266466281,-0.243893919684792,-1.96655661929441,0.301335373291024,-0.665832694463588,-0.0120650855753837,1.5116066367604,0.557300353673344,1.12829931872045,0.234443748015922,-2.03486690662651,0.275544751380246,-0.231465849558696,-0.356880153225012,-0.57746647541923,1.35758352580655,1.23971669378224,-0.662466275100489,0.313263561921793,-1.08783223256362,1.41964722846899,1.29325100940785,0.72153880625103,0.440580131022748,0.0351917814720056, -0.142353224879252],:scale)
x2 = Statsample::Vector.new([-0.866655707911859,-0.367820249977585,0.361486610435,0.857332626245179,0.133438466268095,0.716104533073575,1.77206093023382,-0.10136697295802,-0.777086491435508,-0.204573554913706,0.963353531412233,-1.10103024900542,-0.404372761837392,-0.230226345183469,0.0363730246866971,-0.838265540390497,1.12543549657924,-0.57929175648001,-0.747060244805248,0.58946979365152,-0.531952663697324,1.53338594419818,0.521992029051441,1.41631763288724,0.611402316795129,-0.518355638373296,-0.515192557101107,-0.672697937866108,1.84347042325327,-0.21195540664804,-0.269869371631611,0.296155694010096,-2.18097898069634,-1.21314663927206,1.49193669881581,1.38969280369493,-0.400680808117106,-1.87282814976479,1.82394870451051,0.637864732838274,-0.141155946382493,0.0699950644281617,1.32568550595165,-0.412599258349398,0.14436832227506,-1.16507785388489,-2.16782049922428,0.24318371493798,0.258954871320764,-0.151966534521183],:scale)

y = Statsample::Vector.new([0,0,1,0,1,1,1,1,0,1,1,1,1,0,1,0,1,1,0,1,0,1,1,1,1,0,0,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1],:scale)

x = Statsample::Dataset.new({"i"=>intercept,"x1"=>x1,"x2"=>x2})

obj = Statsample::Regression.glm(x, y, :binomial)
#=> Returns logistic regression object

The documentation and API details is available here

We have some more plans for GLM module. First in list is to make the algorithms work with SVD, because manual inversion of matrices is not cool for larger values in Poisson Regression.

I have blogged about most of the functionalities on my blog - www.ankurgoel.com.
Please explore and use the libraries; I will be waiting for your inputs, suggestions and questions. Feel free to leave your questions on Github issues.

Had an amazing summer!
Stay tuned and Enjoy. :)

- Ankur Goel

Kalman Filter, Log Likelihood and Minimization

Last two weeks have been a rollercoaster ride.
We have wrote the extensive modules for KalmanFilter(Statsample::TimeSeries::Arima::KalmanFilter) and LogLikelihood(Statsample::TimeSeries::Arima::KF::LogLikelihood).

I am way too grateful to Claudio for his uber-awesome support and guidance. While implementing log-likelihood, I understood why he asked me to go through GSL minimization at the first place. Thanks! :)

KalmanFilter enables us to find the ARIMA(p, d, q) of series where,

  • p = Order of Autoregressive part.
  • d = Integerated part.
  • q = Order of Moving Average part.

The filter finds the autoregressive and moving average coefficients for the given series and orders.
In the previous phase, we were working on the simulations of ARIMA model and manually provided the phi and theta coefficients to the simulator.
The KalmanFilter removes that dependency, by simplex algorithm minimizing approach on the log likelihood of the series, it successfully finds the ARIMA of a given series.

Use Case:

KalmanFilter example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#Make a timeseries. Below is random series
require 'statsample-timeseries'
ts = (1..200).map { rand }.to_ts

include Statsample::TimeSeries

#generate kalman filter object for ARIMA(2, 0, 1)
kf = ARIMA.ks(ts, 2, 0, 1)

#AR Coeffs:
kf.ar
#=> [0.26666666666666666, 0.3666666666666667]

#kf.ma
#=> => [0.13333333333333336]

The source code of KalmanFilter can be found here.

Now, as name depicts, LogLikelihood generates log-likelihood and few other attributes of the series.
With the LogLikelihood class, we are generating many internal matrices(we even added few utility functions to make computation easier here). With this class, on given coefficients, order and series, we are able to calculate sigma, log-likelihood and AIC(Akaike Information Criterion) of the series.

LogLikelihood is the important class, since this is the function which is repeatedly minimized in KalmanFilter, which in turn generates the estimated parameters for ARIMA.

Use Case:

LogLikelihood example
1
2
3
4
5
6
7
8
9
10
11
12
13
#continuing with example above

l = Arima::KF::LogLikelihood.new(kf.ar+kf.ma, ts, 2, 1)
#  => LogLikelihood(p = 2, q = 1) on params: [0.26666666666666666, 0.3666666666666667, 0.13333333333333336]

l.log_likelihood
#  => -81.50161474755537

l.sigma
#  => 0.13135393469540402

l.aic
#  => 171.00322949511073

The source code of LogLikelihood can be found here

The tests for both passes all the case.
In between, we have covered the detailed documentation of pretty much everything in statsample-timeseries.

With that, we have also bumped our version. Go, gem install statsample-timeseries

Cheers,
Ankur Goel