Data Handling Overview¶
%load ../../rapaio-bootstrap.ipynb
Add /home/ati/work/rapaio/rapaio-core/target/rapaio-core-5.0.1.jar to classpath
Data types, arrays and collections¶
Most programming languages have constructs for data and instructions or commands. This is true for both low and high level languages. If the language does not provide those kind of constructs, commonly used libraries, like system libraries supply them. This is true also for Java language. We start our discussion about data handling from what the language itself provides.
Java provides value and reference data types. There are few values types like: double
, int
, char
and so on. For each value type there is also a corresponding refference boxing type like Double
, Integer
, Character
. Since Java language is an object oriented language, it should be apparent the reason for existence of the corresponding refference types: abstractization and encapsulation. If we add how the generics were implemented in language on those reasons we understood why we have collections implemented as they are, with their rich offerings in terms of functionality. Still, value types are part of the language together with the construct called array. They seem so dulled and non idiomatic, so one may ask why there are still part of the language. The main reason of their existence is performance: both in terms of memory and computation.
Rapaio uses under the hood value types and arrays to achieve speed and memory efficience, carrying the burden of writting much more non idiomatic code. For example there are many vaue types array features in utilitary classes like rapaio.util.collection.DoubleArrays
and similar. Those utilitary classes exists in order to offer support for high order constructs with compact and performant implementations.
Working directly with value type arrays, even having at one's disposal a rich set of features, claims a lot of effort and attention. The resulting code would be probably very hard to write or read. Thus, Rapaio uses for most of the algorithms implemented higher constructs like rapaio.data.Var
and rapaio.data.Frame
to offer a rich and flexible set of features, easy to be used and leveraged further. Those constructs hides the complexity and intricaties implied at implementation times dues to working directly with value types vectors, without performance penalties.
Because those abstractions works directly with value type arrays, a lot of care was spent to offer many easy ways to transform data from value type array into variables, frames, vectors and matrices and viceversa, sometimes avoiding copying the data entirely. Java collections are also well conected with Rapaio data structures, thus being very easy to transfer data to and from those collections.
Var and Frame¶
There are two ubiquous data structures used all over the place in this library: variables and frames. A variable is a list of values of the same type which implements the Var
interface. You can think of a variable as a higher abstractization of an array because all values have the same type and there is random acces available. In fact, sometimes, the arrays are the only data storage for a Var
implementation. Another way to look at a variable is like a column of a table.
A set of variables makes a data frame, described by the interface rapaio.data.Frame
. A frame is a table having observations as rows and variables as columns.
Let's take a simple example. We will load the iris data set, which is provided by the the library.
Frame df = Datasets.loadIrisDataset();
df.printSummary();
Frame Summary ============= * rowCount: 150 * complete: 150/150 * varCount: 5 * varNames: 0. sepal-length : dbl | 3. petal-width : dbl | 1. sepal-width : dbl | 4. class : nom | 2. petal-length : dbl | * summary: sepal-length [dbl] sepal-width [dbl] petal-length [dbl] petal-width [dbl] class [nom] Min. : 4.3000000 Min. : 2.0000000 Min. : 1.0000000 Min. : 0.1000000 versicolor : 50 1st Qu. : 5.1000000 1st Qu. : 2.8000000 1st Qu. : 1.6000000 1st Qu. : 0.3000000 setosa : 50 Median : 5.8000000 Median : 3.0000000 Median : 4.3500000 Median : 1.3000000 virginica : 50 Mean : 5.8433333 Mean : 3.0573333 Mean : 3.7580000 Mean : 1.1993333 2nd Qu. : 6.4000000 2nd Qu. : 3.3000000 2nd Qu. : 5.1000000 2nd Qu. : 1.8000000 Max. : 7.9000000 Max. : 4.4000000 Max. : 6.9000000 Max. : 2.5000000
Frame summary is a simple way to see some general information about a data frame. We see the data frame contains $150$ observations/rows and $5$ variables/columns.
The listing continues with name and type of the variables. Notice that there are four double variables and one nominal variable, named class
.
The summary listing ends with a section which describes each variable. For double variables the summary contains the well-known six number summary. We have there the minimum and maximum values, median, first and third quartile and the mean. For nominal values we have an enumeration of the first most frequent levels and the associated counts. For our class
variable we see that there are three levels, each with $50$ instances.
Variables¶
In statistics a variable has multiple meanings. A random variable can be thought as a process which produces values according to a distribution. Var
objects models the concept of values drawn from a unidimensional random variable, in other words a sample of values. As a consequence a Var
object has a size and uses integer indices to access values from sample.
Var
interface declares methods useful for various kinds of tasks:
- manipulate values from the variable by adding, removing, inserting and updating with different representations
- naming a variable offers an alternate way to identify a variable into a frame and it is also useful as output information
- manipulate sets of values by with concatenation and mapping
- streaming allows traversal of variables by java streams
- numeric computations like mathematical operations or variuos statistical interesting characteristics
- unique value and grouping with a flexible grammar and short syntax
- other tools like deep copy, deep compare, summary, etc
VarType: storage and representation of a variable¶
There are two main concepts which have to be understood when working with variables: storage and representation. All the variables are able to store data internally using a Java data types like double
, int
, String
, etc. In the same time, the data from variables can be represented in different ways, all of them being available through the Var
interface for all types of variables.
However not all the representations are possible for all types of variables, because some of them does not make sense. For example double floating values can be represented as strings, which is fine, however strings in general cannot be represented as double values.
These are the following data representations all the Var
-iables can implement:
- double - double
- int - int
- long - long
- instant - Instant
- label - String
The Var
interface offers methods to get/update/insert values for all those data representations. Not all data representations are available for all variables. For example the label representation is available for all sort of variables. This is acceptable, since when storing information into a text-like data format, any data type should be transformed into a string and also should be able to be read from a string representation.
To accomodate all those legal possibilities, the rapaio library has a set of predefined variable types, which can be found in the enum VarType
.
The defined variable types:
VType | Var class | Description |
---|---|---|
BINARY | VarBinary | Binary variable represented as int values, internally uses bitsets for efficient memory usage |
INT | VarInt | Integer variable represented and stored internally as int |
NOMINAL | VarNominal | Categorical variable represented as string from a predefined set, with no ordering (for example: male, female) |
DOUBLE | VarDouble | Double variable represented and stored internally as double precision floating point values |
LONG | VarLong | Long variable represented and stored internally as long 8-byte signed integer values |
INSTANT | VarInstant | Instant variable represented and stored as datetime instant |
STRING | VarString | String variable used for manipulation of text with free form |
A data type is important for the following reasons:
- gives a certain useful meaning for variables in such a way that machine learning or statistical algorithms can leverage to maximum potential the meta information about variables
- encapsulates the stored data type artifacts and hide those details from the user, while allowing the usage of a single unified interface for all variables
Frames¶
The rapaio.data.Frame
is the most common way to handle data. A data frame is a collection of variables organized in rows and columns. One can see the rows of a data frame as instances or observations having various attributes. The columns of a data frame are variables described by rapaio.data.Var
.
There is a rich collection of methods offered by a data frame, allowing one to manipulate data in various ways for many purposes. Among those facilities to manipulate data one can have: views by data filtering, frame composition by merging rows or variables or other frames, grouping and group aggregates, streaming and so on.
Solid frames and view frames¶
There are more than one implementation of a data frame for different purposes. The most encountered data frame is a SolidFrame
. Solid data frames represents data stored in dense solid variables. This allows fast operations since data is stored compact in memory and it is ussually produced when data is loaded from a data storage, by allocating space for new data or by creating a new copy of data from a view frame. A view frame is a type of data frame which does not store itself data, but it is a wrapper on data stored somewhere else, usually in other data frames.
During data handling operations the library tries to not create copies of data, if possible. This give some freedom to the user to decide when data should be copied. We can illustrate with an example:
// create a solid data frame with one variable
var solid = SolidFrame.byVars(VarDouble.copy(1, 2, 3, 4, 5, 6).name("x"));
solid.printString();
// filter some rows based on values and obtain a view filter
var view = solid.stream().filter(s -> s.getDouble("x") % 2 == 0).toMappedFrame();
view.printString();
// we will bind the rows with old data frame
var bound = solid.bindRows(view);
bound.printString();
// we can create o copy of the data to not alter the original one with updates
var copy = bound.copy();
copy.printString();
rapaio.data.SolidFrame(rowCount=6, varCount=1) VarDouble [name:"x", rowCount:6, values: 1.0, 2.0, 3.0, 4.0, 5.0, 6.0] rapaio.data.MappedFrame(rowCount=3, varCount=1) MappedVar[type=dbl, name:x, rowCount:3] rapaio.data.BoundFrame(rowCount=9, varCount=1) BoundVar(type=dbl) [name:"x", rowCount:9, values: 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 2.0, 4.0, 6.0] rapaio.data.SolidFrame(rowCount=9, varCount=1) VarDouble [name:"x", rowCount:9, values: 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 2.0, 4.0, 6.0]
Frames allows manipulation of data sets in a flexible way to infer various knowledge from data. Those abjects are used for most of the machine learning models and for input and output facilities.
Variables and frames are also used in almost all tools from statistical hypothesis tests to graphical plots. There is also the possibility to work with vectors and matrices, objects used in linear algebra. Transition between those types of data structures can be done easily, even if sometimes data transformation can happen due to linear algebra object constrains (all values from a vector or matrix should have the same type).