Variables¶
%load ../../rapaio-bootstrap.ipynb
Add /home/ati/work/rapaio/rapaio-core/target/rapaio-core-5.0.1.jar to classpath
Variable represents unidimensional sets of observations which comes from the same random variable (hence the name). Because the values of a variable are supposed to be generated from the same process, they have the same type and semantic. Variables have names and are of some given types. Var
implementations can be categorized as storage variables and view variables. Storage variables are those variables which directly contains and maintain data. View variables are higher constructs obtained by filtering and/or merging other variables. View variables does not contain data directly, but maintain refferences to storage variables it wrap and allows reading and updating operations in the same way as storage variables.
Strorage variables implemented have different storage types and representations. We have VarDouble
, VarInt
, VarNominal
, VarBinary
, VarLong
, VarString
and VarInstant
.
VarDouble¶
Numeric double variables are implemented by VarDouble
and are used to handle discrete or continuous numerical values. Double variables offers value and label representations. All other representations can be used, but with caution since it can alter the content. For example int representation truncates floating point values to the biggest integer values, however the int setter sets a correct value since an integer can be converted to a double with no information loss.
Various builders¶
Double variables can be built in various was and there are handy shortcuts for various scenarios.
// builds a variable with no data
Var empty1 = VarDouble.empty();
// builds a variable of a given size which contains only missing data
Var empty2 = VarDouble.empty(100);
// a sequence of numbers, starting from 0, ending with 5 with step 1
Var seq1 = VarDouble.seq(5);
// a sequence of numbers, starting from 1, ending with 5 with step 1
Var seq2 = VarDouble.seq(1, 5);
// a sequence of numbers starting at 0, ending at 1 with step 0.1
Var seq3 = VarDouble.seq(0, 1, 0.1);
// build a variable of a given size which contains only zeros
Var fill1 = VarDouble.fill(5);
// builds a variable of a given size which contains only ones
Var fill2 = VarDouble.fill(5, 1);
// numeric variable which contains the values copied from another variable
Var copy1 = VarDouble.copy(seq1);
// numeric variable with values copied from a collection
Normal normal = Normal.std();
List<Double> list1 = DoubleStream.generate(normal::sampleNext).limit(10).boxed().collect(Collectors.toList());
Var copy2 = VarDouble.copy(list1);
// numeric variable with values copied from a double array
Var copy3 = VarDouble.copy(1, 3, 4.0, 7);
// numeric variable with values copied from an int array
Var copy4 = VarDouble.copy(1, 3, 4, 7);
// numeric variables with values generated as the sqrt of the row number
Var from1 = VarDouble.from(10, Math::sqrt);
// numeric variable with values generated using a function which receives a row value
// as parameter and outputs a double value; in this case we generate values as
// a sum of the values of other two variables
Var from2 = VarDouble.from(4, row -> copy3.getDouble(row) + copy4.getDouble(row));
// numeric variable with values generated from values of another variable using
// a transformation provided via a lambda function
Var from3 = VarDouble.from(from1, x -> x + 1);
Wrapper around a double array This builder creates a new numeric variable instance as a wrapper around a double array of values. Notice that it is not the same as the copy builder, since in the wrapper case any change in the new numerical variable is reflected also in the original array of numbers. In the case of the copy builder this is not true, since the copy builder (as its name implies) creates an internal copy of the array.
double[] array = DoubleArrays.newFrom(1, 5, x -> x*x);
Var wrap1 = VarDouble.wrap(array);
wrap1.printString();
array[2] = 17;
wrap1.printString();
VarDouble [name:"?", rowCount:4, values: 1.0, 4.0, 9.0, 16.0] VarDouble [name:"?", rowCount:4, values: 1.0, 4.0, 17.0, 16.0]
Printing variables¶
Most of the objects which contains information implements the Printable
interface. This interface allows one to display a summary of the content of the given object. This is the case also with the numerical variables. Additionally, the numerical variables implements also two other methods, one which displays all the values and another one which displays only the first values.
// build a numerical variable with values as the sqrt
// of the first 200 integer values
Var x = VarDouble.from(200, Math::sqrt).name("x");
// prints the text produced by toString
x.printString();
// print a reasonable part of value
x.printContent();
// print all values of the variable
x.printFullContent();
// print a summary of the content of the variable
x.printSummary();
VarDouble [name:"x", rowCount:200, values: 0.0, 1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.23606797749979, 2.449489742783178, 2.6457513110645907, 2.8284271247461903, 3.0, ..., 14.071247279470288, 14.106735979665885] VarDouble [name:"x", rowCount:200] row value row value row value row value row value row value [0] 0 [17] 4.123105625617661 [34] 5.830951894845301 [51] 7.14142842854285 [68] 8.246211251235321 [185] 13.601470508735444 [1] 1 [18] 4.242640687119285 [35] 5.916079783099616 [52] 7.211102550927978 [69] 8.306623862918075 [186] 13.638181696985855 [2] 1.4142135623730951 [19] 4.358898943540674 [36] 6 [53] 7.280109889280518 [70] 8.366600265340756 [187] 13.674794331177344 [3] 1.7320508075688772 [20] 4.47213595499958 [37] 6.082762530298219 [54] 7.3484692283495345 [71] 8.426149773176359 [188] 13.711309200802088 [4] 2 [21] 4.58257569495584 [38] 6.164414002968976 [55] 7.416198487095663 [72] 8.48528137423857 [189] 13.74772708486752 [5] 2.23606797749979 [22] 4.69041575982343 [39] 6.244997998398398 [56] 7.483314773547883 [73] 8.54400374531753 [190] 13.784048752090222 [6] 2.449489742783178 [23] 4.795831523312719 [40] 6.324555320336759 [57] 7.54983443527075 [74] 8.602325267042627 [191] 13.820274961085254 [7] 2.6457513110645907 [24] 4.898979485566356 [41] 6.4031242374328485 [58] 7.615773105863909 [75] 8.660254037844387 [192] 13.856406460551018 [8] 2.8284271247461903 [25] 5 [42] 6.48074069840786 [59] 7.681145747868608 [76] 8.717797887081348 [193] 13.892443989449804 [9] 3 [26] 5.0990195135927845 [43] 6.557438524302 [60] 7.745966692414834 [77] 8.774964387392123 [194] 13.92838827718412 [10] 3.1622776601683795 [27] 5.196152422706632 [44] 6.6332495807108 [61] 7.810249675906654 [78] 8.831760866327848 [195] 13.96424004376894 [11] 3.3166247903554 [28] 5.291502622129181 [45] 6.708203932499369 [62] 7.874007874011811 ... ... [196] 14 [12] 3.4641016151377544 [29] 5.385164807134504 [46] 6.782329983125268 [63] 7.937253933193772 [180] 13.416407864998739 [197] 14.035668847618199 [13] 3.605551275463989 [30] 5.477225575051661 [47] 6.855654600401044 [64] 8 [181] 13.45362404707371 [198] 14.071247279470288 [14] 3.7416573867739413 [31] 5.5677643628300215 [48] 6.928203230275509 [65] 8.06225774829855 [182] 13.490737563232042 [199] 14.106735979665885 [15] 3.872983346207417 [32] 5.656854249492381 [49] 7 [66] 8.12403840463596 [183] 13.527749258468683 [16] 4 [33] 5.744562646538029 [50] 7.0710678118654755 [67] 8.18535277187245 [184] 13.564659966250536 VarDouble [name:"x", rowCount:200] row value row value row value row value row value row value [0] 0 [34] 5.830951894845301 [68] 8.246211251235321 [102] 10.099504938362077 [136] 11.661903789690601 [170] 13.038404810405298 [1] 1 [35] 5.916079783099616 [69] 8.306623862918075 [103] 10.14889156509222 [137] 11.704699910719626 [171] 13.076696830622021 [2] 1.4142135623730951 [36] 6 [70] 8.366600265340756 [104] 10.198039027185569 [138] 11.74734012447073 [172] 13.114877048604 [3] 1.7320508075688772 [37] 6.082762530298219 [71] 8.426149773176359 [105] 10.246950765959598 [139] 11.789826122551595 [173] 13.152946437965905 [4] 2 [38] 6.164414002968976 [72] 8.48528137423857 [106] 10.295630140987 [140] 11.832159566199232 [174] 13.19090595827292 [5] 2.23606797749979 [39] 6.244997998398398 [73] 8.54400374531753 [107] 10.344080432788601 [141] 11.874342087037917 [175] 13.228756555322953 [6] 2.449489742783178 [40] 6.324555320336759 [74] 8.602325267042627 [108] 10.392304845413264 [142] 11.916375287812984 [176] 13.2664991614216 [7] 2.6457513110645907 [41] 6.4031242374328485 [75] 8.660254037844387 [109] 10.44030650891055 [143] 11.958260743101398 [177] 13.30413469565007 [8] 2.8284271247461903 [42] 6.48074069840786 [76] 8.717797887081348 [110] 10.488088481701515 [144] 12 [178] 13.341664064126334 [9] 3 [43] 6.557438524302 [77] 8.774964387392123 [111] 10.535653752852738 [145] 12.041594578792296 [179] 13.379088160259652 [10] 3.1622776601683795 [44] 6.6332495807108 [78] 8.831760866327848 [112] 10.583005244258363 [146] 12.083045973594572 [180] 13.416407864998739 [11] 3.3166247903554 [45] 6.708203932499369 [79] 8.888194417315589 [113] 10.63014581273465 [147] 12.12435565298214 [181] 13.45362404707371 [12] 3.4641016151377544 [46] 6.782329983125268 [80] 8.94427190999916 [114] 10.677078252031311 [148] 12.165525060596439 [182] 13.490737563232042 [13] 3.605551275463989 [47] 6.855654600401044 [81] 9 [115] 10.723805294763608 [149] 12.206555615733702 [183] 13.527749258468683 [14] 3.7416573867739413 [48] 6.928203230275509 [82] 9.055385138137417 [116] 10.770329614269007 [150] 12.24744871391589 [184] 13.564659966250536 [15] 3.872983346207417 [49] 7 [83] 9.1104335791443 [117] 10.816653826391969 [151] 12.288205727444508 [185] 13.601470508735444 [16] 4 [50] 7.0710678118654755 [84] 9.16515138991168 [118] 10.862780491200215 [152] 12.328828005937952 [186] 13.638181696985855 [17] 4.123105625617661 [51] 7.14142842854285 [85] 9.219544457292887 [119] 10.908712114635714 [153] 12.36931687685298 [187] 13.674794331177344 [18] 4.242640687119285 [52] 7.211102550927978 [86] 9.273618495495704 [120] 10.954451150103322 [154] 12.409673645990857 [188] 13.711309200802088 [19] 4.358898943540674 [53] 7.280109889280518 [87] 9.327379053088816 [121] 11 [155] 12.449899597988733 [189] 13.74772708486752 [20] 4.47213595499958 [54] 7.3484692283495345 [88] 9.38083151964686 [122] 11.045361017187261 [156] 12.489995996796797 [190] 13.784048752090222 [21] 4.58257569495584 [55] 7.416198487095663 [89] 9.433981132056603 [123] 11.090536506409418 [157] 12.529964086141668 [191] 13.820274961085254 [22] 4.69041575982343 [56] 7.483314773547883 [90] 9.486832980505138 [124] 11.135528725660043 [158] 12.569805089976535 [192] 13.856406460551018 [23] 4.795831523312719 [57] 7.54983443527075 [91] 9.539392014169456 [125] 11.180339887498949 [159] 12.609520212918492 [193] 13.892443989449804 [24] 4.898979485566356 [58] 7.615773105863909 [92] 9.591663046625438 [126] 11.224972160321824 [160] 12.649110640673518 [194] 13.92838827718412 [25] 5 [59] 7.681145747868608 [93] 9.643650760992955 [127] 11.269427669584644 [161] 12.68857754044952 [195] 13.96424004376894 [26] 5.0990195135927845 [60] 7.745966692414834 [94] 9.695359714832659 [128] 11.313708498984761 [162] 12.727922061357855 [196] 14 [27] 5.196152422706632 [61] 7.810249675906654 [95] 9.746794344808963 [129] 11.357816691600547 [163] 12.767145334803704 [197] 14.035668847618199 [28] 5.291502622129181 [62] 7.874007874011811 [96] 9.797958971132712 [130] 11.40175425099138 [164] 12.806248474865697 [198] 14.071247279470288 [29] 5.385164807134504 [63] 7.937253933193772 [97] 9.848857801796104 [131] 11.445523142259598 [165] 12.84523257866513 [199] 14.106735979665885 [30] 5.477225575051661 [64] 8 [98] 9.899494936611665 [132] 11.489125293076057 [166] 12.884098726725126 [31] 5.5677643628300215 [65] 8.06225774829855 [99] 9.9498743710662 [133] 11.532562594670797 [167] 12.922847983320086 [32] 5.656854249492381 [66] 8.12403840463596 [100] 10 [134] 11.575836902790225 [168] 12.96148139681572 [33] 5.744562646538029 [67] 8.18535277187245 [101] 10.04987562112089 [135] 11.61895003862225 [169] 13 > summary(name: x, type: DOUBLE) rows: 200, complete: 200, missing: 0 x [dbl] Min. : 0.0000000 1st Qu. : 7.0533009 Median : 9.9749372 Mean : 9.3917104 2nd Qu. : 12.2167789 Max. : 14.1067360
WS.image(points(x, VarDouble.from(x.size(), () -> Normal.std().sampleNext()), fill(1), pch.circleFull()), 600, 300);
VarOp interface¶
There are various mathemaical operations available under VarOp interface. The interface to those operators can be called using op()
method on any variable. The fllowing examples uses some of those operators.
// computes the sum of all values in variable
x.dv().nansum();
1878.3420754046178
// apply a lambda function on a copy of the varialble
x.copy().dv().apply(v -> Math.sqrt(v + 3./8)).printContent();
[0] 0.6123724356957945 [6] 1.6806218321749773 [12] 1.9593625532651568 [18] 2.1488696300891044 [1] 1.1726039399558574 [7] 1.7380308717236845 [13] 1.9951318942526053 [19] 2.175752500524973 [2] 1.337614878196671 [8] 1.7898120361496597 [14] 2.0289547522736777 ... ... [3] 1.4515683957598682 [9] 1.8371173070873836 [15] 2.061063644385446 [198] 3.8008219215677927 [4] 1.541103500742244 [10] 1.8807651794331954 [16] 2.091650066335189 [199] 3.80548761391571 [5] 1.6158799390733798 [11] 1.9213601407220355 [17] 2.120873788233911
// add a constant to all values of a copy
x.copy().dv().add(Math.E);
DVectorDense{size:200, values:[2.7182818,3.7182818,4.1324954,4.4503326,4.7182818,4.9543498,5.1677716,5.3640331,5.546709,5.7182818,5.8805595,6.0349066,6.1823834,6.3238331,6.4599392,6.5912652,6.7182818,6.8413875,6.9609225,7.0771808,...]}
Nominal variables¶
Nominal variables are defined byVarNominal
and contains string valued categories. Nominal variables offers integer and label representations. The label representation describes the categories as labels or texts, while integer representation is an integer indexe on categories. The index representation does not imply an order between categories/labels.
Various builders¶
Nominal variables can be built in various was and are handy shortcuts for various scenarios.
// creates an empty nominal variable with provided levels
var nom1 = VarNominal.empty(10, "a", "b");
// note the first label which is a placeholder for missing values
nom1.levels();
[?, a, b]
VarNominal.from(10, row -> row % 2 == 0 ? "even" : "odd").printString();
VarNominal [name:"?", rowCount:10, values: even, odd, even, odd, even, odd, even, odd, even, odd]
VarNominal.copy("a", "b", "c", "b").printContent()
VarNominal [name:"?", rowCount:4] row value [0] a [1] b [2] c [3] b
Unique.of(VarNominal.copy("a","b","c","b"))
UniqueLabel{count=3, values=[a:1,b:2,c:1]}
Overview of variable types¶
All variables implements a common API making easy the manipulation of it's content in a generic way. However, depending on type, some variables might not implement some operations or rely back on specific implementations which makes sense for that variable type. For example, for a numeric variable it makes sense to set the value at some specific index in order to change it. For nominal variables, the same operation would not make sense. Instead of the meaning 'change numerical value at some given position' will have the following semantic 'change the string value to that category associated with the rounded integer value of the double parameter'. Let's see an example:
// we create a nominal value with label `a` of first position and label `b` on the second position
var nom = VarNominal.copy("a", "b");
nom.printString();
// set the value from the first position to the label which corresponds to indexed label 2, which is `b`
nom.setDouble(0, 2.1);
// let's see the result
nom.printString();
VarNominal [name:"?", rowCount:2, values: a, b] VarNominal [name:"?", rowCount:2, values: b, b]
Missing values¶
All variables offers API for missing values. A missing value is a special value which is used as placeholder for an unspecified value. We can have missing values for various reasons. There are cases when the data set does not contains some values because the experimenter did not collect it. Sometimes having a value does not make sense, for example a male cannot be pregnant, so measuring any metric related with pregnancy have missing value for male subjects. Also, missing values can appear as an effect of data manipulation operations, like joining two data frames which does not present a one-to-one presence relation.
Missing values are different for each representation, which makes sense since a double value have a different type than a String.
WS.println(VarDouble.MISSING_VALUE);
WS.println(VarNominal.MISSING_VALUE);
WS.println(VarInt.MISSING_VALUE);
NaN ? 2.147483647E9
Most of time we do not have to deals directly with the missing value placeholders, since Var
and Frame
interfaces offers a way to handle gracefully missing value operations. Below is an illustrative example:
var x = VarDouble.seq(10).name("x");
x.printString();
VarDouble [name:"x", rowCount:11, values: 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
// put missing on values from indexes 2,4 and 6
x.setMissing(2);
x.setMissing(4);
x.setMissing(6);
x.printString();
VarDouble [name:"x", rowCount:11, values: 0.0, 1.0, ?, 3.0, ?, 5.0, ?, 7.0, 8.0, 9.0, 10.0]
// count the number of non missing values
x.stream().complete().count();
8
// compute the sum of all non missing values
x.dv().nansum();
43.0
Var Iterators¶
Each Var
allows easy data manipulation through iterators. There is a generic construct available for each variable implementation which under the form of VSpot
. In the terminology used in rapaio
a spot is a position in a variable which can contain a value. Since that value can have different reporesentation, the VSpot
interface is used to manipulate what happen in a given position. Additionally there are various iterators which can be used for other data representations. The example below are illustrative:
var d = VarDouble.seq(3).name("d");
// iterate through double values, we can do this because we have the specific type VarDouble
for(double value : d) WS.println(value)
0.0 1.0 2.0 3.0
// for each spot print the string representation
d.forEachSpot(s -> WS.println(s.getLabel()))
0.0 1.0 2.0 3.0
// compute the sum using spot iterator and streaming API
WS.println(d.stream().mapToDouble(VSpot::getDouble).sum());
WS.println(d.dv().nansum());
6.0 6.0
// display the row indexes of all values which are missing
var y = VarDouble.copy(1, Double.NaN, 2, Double.NaN, 3);
// collect indexes to an int array
int[] indexes = y.stream().filter(s -> !s.isMissing()).mapToInt(s -> s.row()).toArray();
// create a var int wrapper to see the content
VarInt.wrap(indexes).printString();
VarInt [name:"?", rowCount:3, values: 0, 2, 4]