Complete list of features

The code is classified in two categories of maturity:

Definitive - This category includes all the code which has a stable API and it was covered carefully with tests, usually with a percentage over 90%
Experimental - All the code found under the package rapaio.experimental. This includes drafts and untested code, which can be used, but production tools cannot rely on this code. Usually, when code migrates outside the experimental package, API can change and sometimes even the philosophy behind the implementation.

Definitive features

Data handling

Any data library needs a way to structure and handle data. Rapaio library defines a high level API for data handling in form of Var and Frame. Variables and frames are an in-memory data model.

Variables and frames - the in memory data model. There are 3 flavors of variables and frames: solid, mapped or bind. Solid variable and frames defined dense array data type implementations. Mapped or bind variables or arrays are implementations which relays on other data structures for storing data. Those looks like concept of views from relational data bases.
Data access using streams is available on variables and frames.
DensityVector - vector of frequencies
Unique - data structure to collect and manipulate unique values of a variable
Group - data structure to build and manipulate group by aggregations
Index - data structure for transforming value domains into dense indexes

Core

Core package contains various basic tools like computing various statistics, various types of random sampling or other sampling strategies for sampling rows.

Statistics: Maximum, Minimum, Sum, Mean, Variance, Quantiles, GeometricMean, Skewness, Kurtosis
Online Statistics: minimum, maximum, count, mean, variance, standard deviation, skewness, kurtosis
Op interface: provides simple way to apply mathematical operations or functions on variables or between them
WeightedMean, WeightedOnlineStat: weighted variant of statistics
Pearson product-moment coefficient: linear correlation
Spearman's rank correlation coefficient: linear correlation on rankings
SamplingTools
- generates discrete integer samples with/without replacement, weighted/non-weighted
- offers utility methods for bootstraps, simple random, stratified sampling
RowSampler implementations used in machine learning algorithms: bootstrap, identity, subsampling
DensityVector onw way discrete density vector tool
DensityTable two way discrete density table tool
Distance Matrix

Distributions

This package provides access to some very common statistical distributions in an uniform way

Bernoulli
Binomial
ChiSquare
Discrete Uniform
Fisher
Gamma
Hypergeometric
Normal/Gaussian
Poisson
Student t
Continuous Uniform
Empirical KDE (gaussian, epanechnikov, cosine, tricube, biweight, triweight, triangular, uniform)

Hypothesis Testing

Hypothesis testing framework provides implementations for some common hypothesis tests

z test
- one sample test for testing the sample mean
- two unpaired samples test for testing difference of the sample means
- two paired samples test for testing sample mean of the differences
t test
- one sample test for testing the sample mean
- two unpaired samples t test with same variance
- two unpaired samples Welch t test with different variances
- two paired samples test for testing sample mean of differences
Kolmogorov Smirnoff KS test
- one sample test for testing if a sample belongs to a distribution
- two samples test for testing if both samples comes from the same distribution
Pearson Chi-Square tests
- goodness of fit
- independence test
- conditional independence test
Anderson-Darling goodness of fit
- normality test

Filters

Data can be manipulated ad hoc or by using various types of APIs they expose (direct methods, streams, ops interface). All those methods are useful in different contexts. Some of data transformation operations, however, needs to be applied multiple times or ar too complex to be implemented all over again. Those operations can be collected into a chain of transformations and applied to multiple sets of data many times - this is the concept of filter. A filter have two fundamental operations: fit to the data and apply the filter to data. There are filters which can be applied to a single variable or filters which can be applied to multiple variables from a data frame.

Frame filters

FApplyDouble - apply a function on the double values of variables
FFillNaDouble - apply a given fill value over missing values
FIntercept - add an intercept variable to a given data frame
FJitter - add jitter to data according with a noise distribution
FMapVars - select some variables according with a VRange pattern
FOneHotEncoding - encodes nominal variables into multiple 0/1 variables
FQuantileDiscrete - splits numeric variables into nominal categories based on quantile intervals
FRandomProjection - project a data frame onto random projections
FRefSort - sort a data frame based on reference comparators
FRemoveVars - removes some variables according with a VRange pattern
FRetainTypes - retain only variables of given types
FShuffle - shuffle rows from a data frame
FStandardize - standardize variables from a given data frame
FToDouble - convert variables to double
FTransformBoxCox - apply box cox transformation

Var filters

VApply - apply a function over the stream spots
VApplyDouble - apply a function over the double values
VApplyInt - updates a variable using a lambda on int value
VApplyLabel - updates a variable using a lambda on label value
VCumSum - builds a numeric vector with a cumulative sum
VJitter - adds noise to a given numeric vector according with a noise distribution
VQUantileDiscrete - converts a numerical variable into a nominal based on quantile intervals
VRefSort - sorts a variable according with a given set of row comparators
VShuffle - shuffles values from a variable
VSort - sorts a variable according with default comparator
VStandardize - standardize values from a given numeric variable
VToDouble - transforms a variable into double using a lambda
VToInt - transforms a variable into an int type using a lambda
VTransformBoxCox - transform a variable with BoxCox transform
VTransformPower - transform a variable with power transform

Machine learning There are a lot of various problems which can be solved using methods from the field of machine learning: classification, regression, clustering, time series forecasting, etc.

Model selection and evaluation

Confusion Matrix
CrossValidation for Classifiers (metrics: Accuracy, LogLoss)

Classification

ZeroRule
OneRule
Bayesian: NaiveBayes (Gaussian, KernelDensity, Bernoulli, Multinoulli, Multinomial, Poisson)
Linear: BinaryLogistic (optionally L2 penalization)
Decision Trees - CTree: DecisionStump, ID3, C45, CART
purity: entropy, infogain, gain ration, gini index
weight on instances
split: numeric binary, nominal binary, nominal full
missing value handling: ignore, random, majority, weighted
reduced-error pruning
variable importance: frequency, gain and permutation based
Ensemble: CForest - Bagging, Random Forests
Boosting: AdaBoost
Boosting: GBT Classifier
SVM: BinarySMO (Platt, Keerthi & all)

Regression

Simple: ConstantRegression
Simple: L1Regression
Simple: L2Regression
Simple: RandomValueRegression
LinearRegression
RidgeRegression
WeightedLinearRegression
Decision Trees: CART (no pruning), C45 (no pruning), DecisionStumps
Ensemble: RForest
Boost: Gradient Boosting Trees
RVM (Relevance Vector Machine)

Analysis

Principal Components Analysis

Clustering

KMeans
Minkowski Weighted KMeans
KMedians
Cluster Silhouette

Graphics

QQ Plot - quantile to quantile plots
Box Plot - boc plots
Bar Plot - bar plots
Histogram - histograms
2D Histogram - 2 dimensional histograms
Function line - function lines
Vertical/horizontal/ab line - simple lines
Plot lines - lines from points
Plot points - scatter plot points
Density line KDE
ROC Curve
Segment2D - line segment
Plot legend - legends
PolyLine, PolyFill - polygons from plots
CorrGram - diagram of correlations
Silhouette - cluster silhouette
Text - simple texts
IsoCurves - iso bands and iso curves
Matrix - plot of a matrix
Image - images

Experimental Features

Most of the features contained under this section does not meet the production ready bar. This does not mean that most of them are not usable, and sometimes what is missing is only a tiny piece like no complete printing facilities.

However, due to the high likelihood of future changes, they will be kept under this umbrella until enough time and code is spend on improvements and testing to raise those tools to a production ready state. Until that happens, these are the experimental features:

Core

Special Math functions

Evaluation: metrics

Receiver Operator Characteristic - ROC curves and ROC Area
Root Mean Square Error
Mean Absolute Error
Gini / Normalized Gini

Analysis

Fast Fourier Transform
Fischer Linear Discriminant Analysis

Classification

Ensemble: SplitClassifier

Regression

NNet: MultiLayer Perceptron Regressor

Time Series

Acf (correlation, covariance)
Pacf

Matrices and vectors

Numeric vector operations
Basic matrix operations and matrix decompositions