Ankur Goel

On hunt of awesomeness!

StatSample - Correcting Tests and Configuration

In previous blog, I mentioned about the Shoulda issues with MiniTest while running the tests. The error which I encounter looks something like this:

1
2
3
4
5
6
7
test git:(master) ✗ ruby test_anovaoneway.rb
test_anovaoneway.rb:3:in `<class:StatsampleAnovaOneWayTestCase>': undefined method `context' for StatsampleAnovaOneWayTestCase:Class (NoMethodError)
  from test_anovaoneway.rb:2:in `<main>'
➜  test git:(master) ✗ ruby test_regression.rb 
test_regression.rb:4:in `<class:StatsampleRegressionTestCase>': undefined method `context' for StatsampleRegressionTestCase:Class (NoMethodError)
  from test_regression.rb:3:in `<main>'
test git:(master)

While adding tests for regression, I worked around it by converting them in pure MiniTest references. I just committed tests for F following similar analogy. I am currently trying to make Shoulda work which could probably save some work involved in conversion.

For setting up StatSample to make it work on multiple Ruby versions, I have configured rvm(tool to manage multiple ruby environments with their own gemsets) with Ruby 1.9.2-p320, 1.9.3. I also committed a clean gemspec for StatSample. I hope to make StatSample compatible for both first, and then fix it for Ruby-1.8.x, as this is usually the workflow which just works.

As always, project can be forked from http://github.com/AnkurGel/statsample.
You can now build it by:

1
2
gem build statsample.gemspec
gem install statsample-1.1.0.2013.gem

Do let me know of any trouble you encounter at ankurgel at gmail dot com.

Cheers!
-Ankur Goel

StatSample | Code Begins

Hi everyone!

This summer, I am working with Ruby Science Foundation for StatSample project. As you must have read in previous blog posts; StatSample is a powerful statistical library in Ruby. Unfortunately, development of this great utility has been on hold from last 2 years. My project aims to revamp StatSample and primarily to enhance functionality for TimeSeries and Generalized Linear Models.

You can read more about my proposal, here.

During the community bonding period, I initially studied on few topics which my project is concerned about - primarily, estimation methods like ARIMA. I saw it’s implementation in alternative statistical applications like R and StatsModels. StatsModels uses Kalman filter for maximum likelihood and provides other estimations such as log-likelihood and conditional-sum-of-squares etc. The basic interface for ARIMA in StatsModels is as follows:

ARIMA class in StatsModelsSource code of class
1
2
3
4
ARIMA(series, order, dates=None)
#series => list of timeseries values
#order  => ARIMA order(p=autoregressive, d=differenced, q=moving average)
#dates  => [optional] timeline

The returned ARIMA object can be called with :

  • fit(...) for maximum likelihood with primarily three methods - maximum-likelihood, conditional-sum-of-squares, css-then-mle.
  • predict(...), it is a recursive function which gives back list of predictedvalues for supplied varying series.
  • loglike_css(...) - stands for conditional-sum-of-squares, returns aggregated css value.

The R Project too has substantial work in ARIMA. I talked about it on mailing list. Thanks to John’s concerns, researching more in StatsModels was good idea than in R. In StatSample, we should work on ARIMA module as idiomatically they have done in StatsModels.
Beside this, I honestly didn’t get much time to devote to project during this period because of my then ongoing semester examinations, which I initially brought into notice to my mentors.

Currently, I am working on repairing and brining uniformity in tests. StatSample’s tests are written in MiniTest primarily, and somewhere making use of shoulda DSL. Tests using the latter, are breaking on my system with:

1
2
3
4
5
6
7
test git:(master) ✗ ruby test_anovaoneway.rb
test_anovaoneway.rb:3:in `<class:StatsampleAnovaOneWayTestCase>': undefined method `context' for StatsampleAnovaOneWayTestCase:Class (NoMethodError)
  from test_anovaoneway.rb:2:in `<main>'
➜  test git:(master) ✗ ruby test_regression.rb 
test_regression.rb:4:in `<class:StatsampleRegressionTestCase>': undefined method `context' for StatsampleRegressionTestCase:Class (NoMethodError)
  from test_regression.rb:3:in `<main>'
test git:(master)

To aid this, I am correcting and testing specs as : commit.

1
2
3
4
5
6
7
8
9
10
11
➜  statsample git:(master) ✗ ruby test/test_regression2.rb
Run options: --seed 40873

# Running tests:

..S......

Finished tests in 0.176938s, 50.8652 tests/s, 740.3708 assertions/s.

9 tests, 131 assertions, 0 failures, 0 errors, 1 skips
➜  statsample git:(master)

Hopefully, setting up the codebase in good position will work great as I dwell in coding further with TimeSeries.

Github: http://github.com/AnkurGel/statsample

Cheers!
-Ankur Goel

Examples With Statsample

TimeSeries

Statsample has a module for Time Series as Statsample::TimeSeries. This module has a class named TimeSeries which enables users to perform operations on sequence of data points, indexed by time and ordered from earliest to latest. Example: Stock data. Suppose, we have a time series as:

1
2
timeseries = (1..10).map { rand 100 }.to_ts
#=> Time Series(type:scale, n:10)[62,91,92,71,86,99,80,64,15,94]

This is the returned TimeSeries object which is now capable of performing several interesting operations such as:

Lag

1
2
3
4
5
timeseries.lag
#=> Vector(type:scale, n:10)[nil,62,91,92,71,86,99,80,64,15]
timeseries.lag(3)
#Lag of series by three units, will place nil in first three positions.
#=> Vector(type:scale, n:10)[nil,nil,nil,62,91,92,71,86,99,80]

Auto-Correlation

This is frequently used statistical operation. In Digital signal processing, autocorrelation of series is the cross-correlation of signal with itself, but without the normalization. Though, in statistics, normalization exists.

1
2
timeseries.acf
#=> Returns the auto-correlation of series.

Diff

diff performs the first difference of the series. That is difference of series with itself and it’s first lag.

1
2
timeseries.diff
#=> Time Series(type:scale, n:10)[nil,29,1,-21,15,13,-19,-16,-49,79]

Exponential moving average

Moving average is a finite impulse response filter which creates a series of averages of subsets of full-data to analyze the given set of data points.
EMA is similar to moving average, but more weight is given to latest data.
image_ema
In StatSample, EMA can be accessed from TimeSeries by calling ema on a timeseries. Example:

1
2
3
4
t_series = (1..100).map { rand }.to_timeseries
t_series.ema
t_series.ema(15, true)
#=> uses 15 observations and sets Welles wilder coefficient to true.

acf takes optional parameters - n(default: 10) that accounts on how many observations to consider and Welles Wilder coefficent(default: zero) which uses smoothing value of 2/(n + 1) on false and 1/n on true.

TimeSeries module, as can be seen, can become highly sophisticated on inclusion of other methods such as ARMA Estimation etc.

Simple Random Sampling

SRS is an unbiased technique to choose subset of individuals (sample) from a larger set (called, population). Selection of each individual in that sample is entirely random and has equal probability as other individuals. Various techniques for SRS is given here.
SRS is a module in StatSample which comprises of various sections for Proportion estimation, confidence intervals, standard deviation, mean estimation etc.
I covered various tests of SRS methods here, as I explored and understood them. I am currently still writing few more tests for this and other modules in StatSample.

I will update the post as soon as I write them. If anyone wishes me to write about the detailed functionality of this module too, please comment. I will be delighted to do that.

Cheers,
-Ankur Goel