In previous blog, I mentioned about the Shoulda issues with MiniTest while running the tests. The error which I encounter looks something like this:
1234567
➜ test git:(master) ✗ ruby test_anovaoneway.rb
test_anovaoneway.rb:3:in `<class:StatsampleAnovaOneWayTestCase>': undefined method `context'for StatsampleAnovaOneWayTestCase:Class (NoMethodError) from test_anovaoneway.rb:2:in `<main>'➜ test git:(master) ✗ ruby test_regression.rb test_regression.rb:4:in `<class:StatsampleRegressionTestCase>': undefined method `context' for StatsampleRegressionTestCase:Class (NoMethodError) from test_regression.rb:3:in `<main>'➜ test git:(master) ✗
While adding tests for regression, I worked around it by converting them in pure MiniTest references. I just committed tests for F following similar analogy. I am currently trying to make Shoulda work which could probably save some work involved in conversion.
For setting up StatSample to make it work on multiple Ruby versions, I have configured rvm(tool to manage multiple ruby environments with their own gemsets) with Ruby 1.9.2-p320, 1.9.3. I also committed a clean gemspec for StatSample. I hope to make StatSample compatible for both first, and then fix it for Ruby-1.8.x, as this is usually the workflow which just works.
This summer, I am working with Ruby Science Foundation for StatSample project. As you must have read in previous blog posts; StatSample is a powerful statistical library in Ruby. Unfortunately, development of this great utility has been on hold from last 2 years. My project aims to revamp StatSample and primarily to enhance functionality for TimeSeries and Generalized Linear Models.
During the community bonding period, I initially studied on few topics which my project is concerned about - primarily, estimation methods like ARIMA. I saw it’s implementation in alternative statistical applications like R and StatsModels. StatsModels uses Kalman filter for maximum likelihood and provides other estimations such as log-likelihood and conditional-sum-of-squares etc. The basic interface for ARIMA in StatsModels is as follows:
ARIMA(series,order,dates=None)#series => list of timeseries values#order => ARIMA order(p=autoregressive, d=differenced, q=moving average)#dates => [optional] timeline
The returned ARIMA object can be called with :
fit(...) for maximum likelihood with primarily three methods - maximum-likelihood, conditional-sum-of-squares, css-then-mle.
predict(...), it is a recursive function which gives back list of predictedvalues for supplied varying series.
loglike_css(...) - stands for conditional-sum-of-squares, returns aggregated css value.
The R Project too has substantial work in ARIMA. I talked about it on mailing list. Thanks to John’s concerns, researching more in StatsModels was good idea than in R. In StatSample, we should work on ARIMA module as idiomatically they have done in StatsModels.
Beside this, I honestly didn’t get much time to devote to project during this period because of my then ongoing semester examinations, which I initially brought into notice to my mentors.
Currently, I am working on repairing and brining uniformity in tests. StatSample’s tests are written in MiniTest primarily, and somewhere making use of shoulda DSL. Tests using the latter, are breaking on my system with:
1234567
➜ test git:(master) ✗ ruby test_anovaoneway.rb
test_anovaoneway.rb:3:in `<class:StatsampleAnovaOneWayTestCase>': undefined method `context'for StatsampleAnovaOneWayTestCase:Class (NoMethodError) from test_anovaoneway.rb:2:in `<main>'➜ test git:(master) ✗ ruby test_regression.rb test_regression.rb:4:in `<class:StatsampleRegressionTestCase>': undefined method `context' for StatsampleRegressionTestCase:Class (NoMethodError) from test_regression.rb:3:in `<main>'➜ test git:(master) ✗
To aid this, I am correcting and testing specs as : commit.
Statsample has a module for Time Series as Statsample::TimeSeries. This module has a class named TimeSeries which enables users to perform operations on sequence of data points, indexed by time and ordered from earliest to latest. Example: Stock data.
Suppose, we have a time series as:
12
timeseries=(1..10).map{rand100}.to_ts#=> Time Series(type:scale, n:10)[62,91,92,71,86,99,80,64,15,94]
This is the returned TimeSeries object which is now capable of performing several interesting operations such as:
Lag
12345
timeseries.lag#=> Vector(type:scale, n:10)[nil,62,91,92,71,86,99,80,64,15]timeseries.lag(3)#Lag of series by three units, will place nil in first three positions.#=> Vector(type:scale, n:10)[nil,nil,nil,62,91,92,71,86,99,80]
Auto-Correlation
This is frequently used statistical operation. In Digital signal processing, autocorrelation of series is the cross-correlation of signal with itself, but without the normalization. Though, in statistics, normalization exists.
12
timeseries.acf#=> Returns the auto-correlation of series.
Diff
diff performs the first difference of the series. That is difference of series with itself and it’s first lag.
12
timeseries.diff#=> Time Series(type:scale, n:10)[nil,29,1,-21,15,13,-19,-16,-49,79]
Exponential moving average
Moving average is a finite impulse response filter which creates a series of averages of subsets of full-data to analyze the given set of data points. EMA is similar to moving average, but more weight is given to latest data.
In StatSample, EMA can be accessed from TimeSeries by calling ema on a timeseries. Example:
1234
t_series=(1..100).map{rand}.to_timeseriest_series.emat_series.ema(15,true)#=> uses 15 observations and sets Welles wilder coefficient to true.
acf takes optional parameters - n(default: 10) that accounts on how many observations to consider and Welles Wilder coefficent(default: zero) which uses smoothing value of 2/(n + 1) on false and 1/n on true.
TimeSeries module, as can be seen, can become highly sophisticated on inclusion of other methods such as ARMA Estimation etc.
Simple Random Sampling
SRS is an unbiased technique to choose subset of individuals (sample) from a larger set (called, population). Selection of each individual in that sample is entirely random and has equal probability as other individuals. Various techniques for SRS is given here. SRS is a module in StatSample which comprises of various sections for Proportion estimation, confidence intervals, standard deviation, mean estimation etc.
I covered various tests of SRS methods here, as I explored and understood them. I am currently still writing few more tests for this and other modules in StatSample.
I will update the post as soon as I write them. If anyone wishes me to write about the detailed functionality of this module too, please comment. I will be delighted to do that.