Reading and Writing data

Reading and Writing data#

%load ../../rapaio-bootstrap.ipynb

Adding dependency io.github.padreati:rapaio-lib:7.0.1
Solving dependencies
Resolved artifacts count: 1
Add to classpath: /home/ati/work/rapaio-jupyter-kernel/target/mima_cache/io/github/padreati/rapaio-lib/7.0.1/rapaio-lib-7.0.1.jar

CSV - reading and writing#

Reading and writing CSV files is an important feature for any system which works with data. The reason for its importance is the simplicity of the file format and its popularity.

Rapaio library offers support for both reading and writing operations. It has a lot of features and allows flexibility. However we read a file only into a data frame, and we write a csv file only from a data frame. This might look like a constraint in the beginning, but it comes natural since both are tabular data. The only difference is the fact that one operates in the memory of a program and the other one is persisted on disk.

Simple read/write data frames from/into csv files#

We can read a file with the default options simply by instantiating a rapaio.io.Csv object and calls one of read methods.

Frame iris = Csv.instance().read(Datasets.class, "iris-r.csv");

We select few rows and inspect what it is inside:

// use only few rows
iris.mapRows(0, 1, 50, 51, 100, 101).printContent(textWidth(-1));

    sepal-length sepal-width petal-length petal-width   class    
[0]     5.1          3.5         1.4          0.2         setosa 
[1]     4.9          3           1.4          0.2         setosa 
[2]     7            3.2         4.7          1.4     versicolor 
[3]     6.4          3.2         4.5          1.5     versicolor 
[4]     6.3          3.3         6            2.5      virginica 
[5]     5.8          2.7         5.1          1.9      virginica 

Persisting a data frame into csv file format is also simple. We instantiate a rapaio.io.Csv object and call one of implementation of write methods:

Csv.instance().write(iris, "/tmp/iris.csv");

If we open the /tmp/iris.csv file with an editor, we can discover that it will have the following content:

sepal-length,sepal-width,petal-length,petal-width,class
1,3.5,1.4,0.2,setosa
9,3,1.4,0.2,setosa
7,3.2,4.7,1.4,versicolor
4,3.2,4.5,1.5,versicolor
3,3.3,6,2.5,virginica
8,2.7,5.1,1.9,virginica

Various parameters#

A lof of customization is possible on reading and wrinting csv files. We describe here some of them:

stripSpaces: Boolean flag which configures white space trimming for field values. If the white space trimming is enabled, the field values are trimmed at start and end of white char values.
header: Boolean flag which if set it considers the first row of the csv file as containing the variable’s names. Default value is false.
quotes: Boolean flag which specifies if the values are quoted. If eanbled than quote characters are trimmed at read and added at write time. Default values is false.
separatorChar: Character used to separate field values.
escapeChar: Escape character used. If this feature is turned on, the escape chars are discarded ar read time. This is useful if the separator char is used inside string field values, for example.
types: Specific type fields which overrides the automatic type field detection
naValues: Values used to identify a missing value placeholders. Default values includes: “?”, “”, ” “, “na”, “N/A”, “NaN”), “naValues”
defaultTypes: List of automated field types to be tried in the given order during automatic field type detection
startRow: Specifies the first row number to be collected from csv file. By default, this value is 0, which means it will collect starting from the first row. If the value is greater than 0 it will skip the first {@code startRow-1} rows.
endRow: Specifies the last row number to be collected from csv file. By default, this is value is {@code Integer.MAX_VALUE}, which means all rows from file.
skipRows: Skip rows predicate used to filter rows to be read. All row indexes matched by this predicate will not be read.
skipCols: Skip columns predicate used to filter columns to be read. All column indexes matched by this predicate will not be read.
template: Optional frame templated used to define variable names and type for reading. This overrides auto detection of field names and field types.

Various read and write methods for Csv#

Java has a nice abstraction over data named input and output streams. This is enough to make any tool to read or write data from anywhere. We followed that line of thinking by having

public Frame read(InputStream inputStream) throws IOException
public void write(Frame df, OutputStream os) throws IOException

Implemented on Csv class. With these two methods we basically can read from anywhere and can write to anywhere.

To simplify some common tasks there are some specialized forms of read and write:

Read from a file giving a File instance
Read from a file giving a String for path name
Read from a gz archive File instance
Read from a resource giving Class and String for class and name of the resource (this is useful when loading data from a loaded jar or for test)
Write …

TODO