Complete list of features
The code is classified in two categories of maturity:
- Definitive - This category includes all the code which has a stable API and it was covered carefully with tests, usually with a percentage over 90%
- Experimental - All the code found under the package
rapaio.experimental
. This includes drafts and untested code, which can be used, but production tools cannot rely on this code. Usually, when code migrates outside the experimental package, API can change and sometimes even the philosophy behind the implementation.
Definitive features
Data handling
Any data library needs a way to structure and handle data. Rapaio library defines a high level API for data handling in form of Var and Frame. Variables and frames are an in-memory data model.
- Variables and frames - the in memory data model. There are 3 flavors of variables and frames: solid, mapped or bind. Solid variable and frames defined dense array data type implementations. Mapped or bind variables or arrays are implementations which relays on other data structures for storing data. Those looks like concept of views from relational data bases.
- Data access using streams is available on variables and frames.
- DensityVector - vector of frequencies
- Unique - data structure to collect and manipulate unique values of a variable
- Group - data structure to build and manipulate group by aggregations
- Index - data structure for transforming value domains into dense indexes
Core
Core package contains various basic tools like computing various statistics, various types of random sampling or other sampling strategies for sampling rows.
- Statistics: Maximum, Minimum, Sum, Mean, Variance, Quantiles, GeometricMean, Skewness, Kurtosis
- Online Statistics: minimum, maximum, count, mean, variance, standard deviation, skewness, kurtosis
- Op interface: provides simple way to apply mathematical operations or functions on variables or between them
- WeightedMean, WeightedOnlineStat: weighted variant of statistics
- Pearson product-moment coefficient: linear correlation
- Spearman's rank correlation coefficient: linear correlation on rankings
- SamplingTools
- generates discrete integer samples with/without replacement, weighted/non-weighted
- offers utility methods for bootstraps, simple random, stratified sampling
- RowSampler implementations used in machine learning algorithms: bootstrap, identity, subsampling
- DensityVector onw way discrete density vector tool
- DensityTable two way discrete density table tool
- Distance Matrix
Distributions
This package provides access to some very common statistical distributions in an uniform way
- Bernoulli
- Binomial
- ChiSquare
- Discrete Uniform
- Fisher
- Gamma
- Hypergeometric
- Normal/Gaussian
- Poisson
- Student t
- Continuous Uniform
- Empirical KDE (gaussian, epanechnikov, cosine, tricube, biweight, triweight, triangular, uniform)
Hypothesis Testing
Hypothesis testing framework provides implementations for some common hypothesis tests
- z test
- one sample test for testing the sample mean
- two unpaired samples test for testing difference of the sample means
- two paired samples test for testing sample mean of the differences
- t test
- one sample test for testing the sample mean
- two unpaired samples t test with same variance
- two unpaired samples Welch t test with different variances
- two paired samples test for testing sample mean of differences
- Kolmogorov Smirnoff KS test
- one sample test for testing if a sample belongs to a distribution
- two samples test for testing if both samples comes from the same distribution
- Pearson Chi-Square tests
- goodness of fit
- independence test
- conditional independence test
- Anderson-Darling goodness of fit
- normality test
Filters
Data can be manipulated ad hoc or by using various types of APIs they expose (direct methods, streams, ops interface). All those methods are useful in different contexts. Some of data transformation operations, however, needs to be applied multiple times or ar too complex to be implemented all over again. Those operations can be collected into a chain of transformations and applied to multiple sets of data many times - this is the concept of filter. A filter have two fundamental operations: fit to the data and apply the filter to data. There are filters which can be applied to a single variable or filters which can be applied to multiple variables from a data frame.
Frame filters
- FApplyDouble - apply a function on the double values of variables
- FFillNaDouble - apply a given fill value over missing values
- FIntercept - add an intercept variable to a given data frame
- FJitter - add jitter to data according with a noise distribution
- FMapVars - select some variables according with a VRange pattern
- FOneHotEncoding - encodes nominal variables into multiple 0/1 variables
- FQuantileDiscrete - splits numeric variables into nominal categories based on quantile intervals
- FRandomProjection - project a data frame onto random projections
- FRefSort - sort a data frame based on reference comparators
- FRemoveVars - removes some variables according with a VRange pattern
- FRetainTypes - retain only variables of given types
- FShuffle - shuffle rows from a data frame
- FStandardize - standardize variables from a given data frame
- FToDouble - convert variables to double
- FTransformBoxCox - apply box cox transformation
Var filters
- VApply - apply a function over the stream spots
- VApplyDouble - apply a function over the double values
- VApplyInt - updates a variable using a lambda on int value
- VApplyLabel - updates a variable using a lambda on label value
- VCumSum - builds a numeric vector with a cumulative sum
- VJitter - adds noise to a given numeric vector according with a noise distribution
- VQUantileDiscrete - converts a numerical variable into a nominal based on quantile intervals
- VRefSort - sorts a variable according with a given set of row comparators
- VShuffle - shuffles values from a variable
- VSort - sorts a variable according with default comparator
- VStandardize - standardize values from a given numeric variable
- VToDouble - transforms a variable into double using a lambda
- VToInt - transforms a variable into an int type using a lambda
- VTransformBoxCox - transform a variable with BoxCox transform
- VTransformPower - transform a variable with power transform
Machine learning There are a lot of various problems which can be solved using methods from the field of machine learning: classification, regression, clustering, time series forecasting, etc.
Model selection and evaluation
- Confusion Matrix
- CrossValidation for Classifiers (metrics: Accuracy, LogLoss)
Classification
- ZeroRule
- OneRule
- Bayesian: NaiveBayes (Gaussian, KernelDensity, Bernoulli, Multinoulli, Multinomial, Poisson)
- Linear: BinaryLogistic (optionally L2 penalization)
- Decision Trees - CTree: DecisionStump, ID3, C45, CART
- purity: entropy, infogain, gain ration, gini index
- weight on instances
- split: numeric binary, nominal binary, nominal full
- missing value handling: ignore, random, majority, weighted
- reduced-error pruning
- variable importance: frequency, gain and permutation based
- Ensemble: CForest - Bagging, Random Forests
- Boosting: AdaBoost
- Boosting: GBT Classifier
- SVM: BinarySMO (Platt, Keerthi & all)
Regression
- Simple: ConstantRegression
- Simple: L1Regression
- Simple: L2Regression
- Simple: RandomValueRegression
- LinearRegression
- RidgeRegression
- WeightedLinearRegression
- Decision Trees: CART (no pruning), C45 (no pruning), DecisionStumps
- Ensemble: RForest
- Boost: Gradient Boosting Trees
- RVM (Relevance Vector Machine)
Analysis
- Principal Components Analysis
Clustering
- KMeans
- Minkowski Weighted KMeans
- KMedians
- Cluster Silhouette
Graphics
- QQ Plot - quantile to quantile plots
- Box Plot - boc plots
- Bar Plot - bar plots
- Histogram - histograms
- 2D Histogram - 2 dimensional histograms
- Function line - function lines
- Vertical/horizontal/ab line - simple lines
- Plot lines - lines from points
- Plot points - scatter plot points
- Density line KDE
- ROC Curve
- Segment2D - line segment
- Plot legend - legends
- PolyLine, PolyFill - polygons from plots
- CorrGram - diagram of correlations
- Silhouette - cluster silhouette
- Text - simple texts
- IsoCurves - iso bands and iso curves
- Matrix - plot of a matrix
- Image - images
Experimental Features
Most of the features contained under this section does not meet the production ready bar. This does not mean that most of them are not usable, and sometimes what is missing is only a tiny piece like no complete printing facilities.
However, due to the high likelihood of future changes, they will be kept under this umbrella until enough time and code is spend on improvements and testing to raise those tools to a production ready state. Until that happens, these are the experimental features:
Core
- Special Math functions
Evaluation: metrics
- Receiver Operator Characteristic - ROC curves and ROC Area
- Root Mean Square Error
- Mean Absolute Error
- Gini / Normalized Gini
Analysis
- Fast Fourier Transform
- Fischer Linear Discriminant Analysis
Classification
- Ensemble: SplitClassifier
Regression
- NNet: MultiLayer Perceptron Regressor
Time Series
- Acf (correlation, covariance)
- Pacf
Matrices and vectors
- Numeric vector operations
- Basic matrix operations and matrix decompositions