Data visualization

Data visualization#

We illustrate the core ideas behind the graphic system and to provide enough examples to have a fast and productive start of effective usage.

The design of the graphical components of this library is influenced by many ideas from existing popular graphical systems. Among the main inspirations there are some which deserves appropriate consideration: R standard graphical library, ggplot2 package and matplotlib from Python stack.

Workspace and Printer#

A printer can be instantiated by itself by a handy way of using it is by the means of the utility class called rapaio.sys.WS (which comes from Work Space). This is an utility class which offers static shortcut methods yo allow one to simulate a working session. This class has an instance of a printer implementation (which can be changes by WS.setPrinter method) and shortcuts for text output and drawing.

Note that it is not necessary to work with a workspace, the solely purpose is to offer shortcuts to various objects which one usually does not bother to manage during an interactive session, for example.

Drawing of graphical figures or text output is handled by classes which implements interface rapaio.printer.Printer. There are some built-in implementations of this interface. The default one is StandardPrinter, which allows printing text at console and drawing graphics in a Swing modal window. There are also some graphical tools which allows one to output the grated graph as a BufferedImage which can be desplayed in jupyter notebooks, like this one. Another used implementation is IdeaPrinter, which is similar with standard printer for text output, but for graphics drawing implements a TCP/IP protocol with serialization which allows plug-ins like rapaio-studio to handle graphics drawing (the rapaio-studio is now deprecated in favor or working with jupyter notebooks).

We are interested here in methods for drawing, so the workspace contains methods for this purpose. All of those methods works with a rapaio.graphics.Figure instance which contains all the indications regarding what kind of graphic should be built. Additionally, there are methods which allows one to obtain an image directly from a figure, or store an image directly on a disk file.

Here are the shortcut methods implemented in WS utility:

void draw(Figure figure) - draws a figure in a printing system; the size of drawing is either adaptive of has the default size of a graphical image contained in the printer implementation.
void draw(Figure figure, int width, int height) - draws a figure with dimensions specified by width and height
BufferedImage image(Figure figure) - builds an image from a figure with dimensions specified as default width and default height from printer system
BufferedImage image(Figure figure, int width, int height) - builds an image from a figure with specified dimensions
void saveImage(Figure figure, int width, int height, String fileName) - builds an image with specified dimensions by width and height from a given figure, and save the image into a png file with the specified file name
void saveImage(Figure figure, int width, int height, OutputStream os) - builds an image with specified dimensions by width and height, and send the image in png format to the specified output stream (could be a file or could be sent over network)

%load ../../rapaio-bootstrap.ipynb

Adding dependency io.github.padreati:rapaio-lib:7.0.1
Solving dependencies
Resolved artifacts count: 1
Add to classpath: /home/ati/work/rapaio-jupyter-kernel/target/mima_cache/io/github/padreati/rapaio-lib/7.0.1/rapaio-lib-7.0.1.jar

Now in order to build figures one can instantiate and compose figures directly in a pure Java way. As a general pattern, the libarry offers a shortcut graphical tool to make the life during an interactive analysis as easier as possible. The graphical factory shortcut tool is named rapaio.graphics.Plotter and contains a lot of helper methods. Here is an example:

// instantiate a data set from build in collection
Frame iris = Datasets.loadIrisDataset();

// draw a some overlapped histograms for some variables in the data set
var bins = bins(30);
WS.image(hist(iris.rvar("sepal-length"), 1, 8, bins, fill(NColor.navy), prob(true), alpha(0.5f))
        .hist(iris.rvar("petal-length"), 1, 8, bins, fill(NColor.darkorange), prob(true), alpha(0.5f))
        .legend(6, 20, labels("sepal-length", "petal-length"), fill(NColor.navy, NColor.darkorange))
        , 800, 600);

../_images/7a0cb3aab718c6458d0bd5b2e94e65dded4e9d78d399d1e86bbfed199f233e4c.png

Figure types#

Box plots#

A box plot is a standard way of displaying information about the distribution of a continuous data variable based on five data summary. The five data summary consists of: minimum, first quartile, median (second quartile), third quartile and maximum number summaries.

Box plots might be constructed in different manners by different authors. The common characteristics for all types of box-plots is the box. Which means that in all cases the bottom margin of the box lies on first quartile, the top margin of the box lies on third quartile and the line inside the box lies on median values.

The extensions which comes from the box might differ and rapaio system implements the version which it’s usually named: Tuckey’s box plot. What is specific to this box plot is that the whiskers lies at the datum still within 1.5 IQR (Interquartile range).

Any other points above or below whiskers are outliers. Outliers are of two different types:

extreme outliers - outliers which are at a distance greater or equal than 3*IQR
outliers - outliers which are at a distance greater or equal than 1.5*IQR

Example 1#

Goal: Draw one box plot for each numerical variable from iris data set. We want each box plot to have a different color and we want some de-saturated colors.

Solution:

WS.image(boxplot(iris.mapVars("0~3"), fill(1, 2, 3, 4), alpha(0.5f)))

../_images/ac1937dc4f9140d8b6be9ef63781cb2fea90a57a1b3ac8541547261fbb80bf99.png

iris.mapVars("0~3") - obtain a data set from iris data set, by keeping only the first 4 variables. We do that using the range notation (index of the start variable, concatenation symbol ~, index of the last variable inclusive). Variable indexes are 0 based.
color(1, 2, 3, 4) - we use colors from the current color palette, indexed with the specified integer values.
alpha(0.5f) - we de-saturate the drawing keeping only 0.3 of the actual color.

Example 2#

Goal: In order to identify the overlap between values of sepal-length variable from iris data set, we draw one box plot for each segment of the nominal class variable, and add a title

Solution:

WS.image(boxplot(iris.rvar("sepal-length"), iris.rvar("class"), fill(NColor.tab_blue))
         .title("sepal-length separation"));

../_images/e01212ffc0304924d0f9b3a1ad5fcb38dfd230af94218d504d3ac9bd99bffed3.png

iris.var("sepal-length") - is the variable named sepal-length from
iris.var("class") - specifies the segment discriminator, depending on the levels of this variable, the same number of levels will be created
.title("..") - adds a title to the box plot

XY Scatter plots#

We can study the relation between two numerical variables by drawing one point for each instance.

Example 1#

Goal: Study which is the relation between petal-length and sepal-length from iris data set.

Solution:

WS.image(points(iris.rvar("petal-length"), iris.rvar("petal-width")));

../_images/286d7213494cf4c51407f749f06406946225e22bfdf81d4d9b5a632677a071c8.png

iris.var("petal-length") - variable used to define horizontal axis
iris.var("petal-width") - variable used to define vertical axis

Example 2#

Goal: Study which is the relation between petal-length and sepal-length from iris data set. Color each point with a different color corresponding with value from variable class and add a legend for colors. The values will have additional noise.

Solution:

// make an altered copy of dataset with jitter noise added
Frame df = iris.copy().fapply(Jitter.on(new Random(123), Normal.of(0, 0.1), VarRange.of("0~3")));
// draw scatter plot 
WS.image(points(df.rvar("petal-length"), df.rvar("petal-width"), fill(iris.rvar("class")), pch.circleFull())
         .legend(1.5, 2.2, labels("setosa", "versicolor", "virginica")));

../_images/d23e7b0d08224ea0374c64b68ee17d7216cd345d95588ad1c31c472fb77de3fb.png

iris.getVar("petal-length") - variable used to define horizontal axis
iris.getVar("petal-width") - variable used to define vertical axis
color(iris.getVar("class")) - nominal variable which provides indexes to select colors from current palette
pch(2) - select the type of figure used to draw points (in this case is a circle filled with solid color and a black border)
legend(1.5, 2.2, labels("setosa", "versicolor", "virginica")) - adds a legend at the specific position specified in the data range; labels are specified by parameter, colors are taken as default from current palette starting with 1

Histogram#

A histogram is a graphical representation of the distribution of a continuous variable.

The histogram is only an estimation of the distribution. To construct a histogram you have to bin the range of values from the variable in a sequence of equal length intervals, and later on counting the values from each bin. Histograms can display counts, or can display proportions which are counts divided by the total number of values (use prob(true) parameter to have this effect).

Histograms use bins as input parameter. The bin’s width is computed. To compute the bin’s width we need also the range of values. This is computed automatically from data or it can be specified when the histogram is built.

We can omit the number of bins, in which case its value is estimated also from data. For estimation is used the Freedman-Diaconis rule. See Freedman-Diaconis wikipedia page for more details.

Example 1#

Goal: Build a histogram with default values to estimate the pdf of sepal-length variable from iris data set.

Solution:

WS.image(hist(iris.rvar("sepal-length"), fill(NColor.tab_green)));

../_images/fd4426c9c6f7f11789be778db0f0e7f691bc8f1c0ca922ac21c8e89ae93eba64.png

Example 2#

Goal: Build two overlapped histograms with default values to estimate the pdf of sepal-length and petal-length variables from iris data set. We want to get bins in range (0-10) of width 0.25, colored with red, and blue, with a big transparency for visibility

Solution:

WS.image(plot(alpha(0.7f))
.hist(iris.rvar("sepal-length"), 0, 10, bins(40), fill(NColor.tab_blue), alpha(0.5f))
.hist(iris.rvar("petal-length"), 0, 10, bins(40), fill(NColor.tab_red), alpha(0.5f))
.legend(8, 20, labels("sepal-length", "petal-length"), fill(NColor.tab_blue, NColor.tab_red))
.xLab("variable"));

../_images/a4efc627132d9565be63aea17428da26d8a19bac33008c54d92b32a512399398.png

plot(alpha(0.3f)) - builds an empty plot; this is used only to pass default values for alpha for all plot components, otherwise the plot construct would not be needed
hist - adds a histogram to the current plot
iris.getVar("sepal-length") - variable used to build histogram
0, 10 - specifies the range used to compute bins
bins(40) - specifies the number of bins for histogram
fill(1) - specifies the color to draw the histogram, which is the color indexed with 1 in color palette (in this case is red)
legend(7, 20, ...) - draws a legend at the specified coordinates, values are in the units specified by data
labels(..) - specifies labels for legend
color(1, 2) - specifies color for legend
xLab = specifies label text for horizontal axis

Density line#

A density line is a graphical representation of the distribution of a continuous variable. The density line is similar with a histogram in purpose but it has a different strategy to build the estimate. The density line plot component implements a kernel density estimator which is basically a non-parametric smoothing method, named also Parzen-Rosenblatt window method.

There are two main parameters for a density line: bandwidth and base density kernel function. The default bandwidth is computed according with Silverman’s rule of thumb (more details on Wikipedia kernel density page). The default kernel function is the Gaussian pdf.

Kernel function estimators can be constructed using various kernel functions like Gaussian, uniform, triangular, Epanechnikov, cosine, tricube, triweight, biweight. All of them are available in rapaio library and also some custom can be built.

Example 1#

Goal: Illustrate the process of building the KDE estimation

Solution:

// this is our sample
VarDouble x = VarDouble.wrap(-2.1, -1.3, -0.4, 1.9, 5.1, 6.2);

// declare a bandwidth for smoothing
double bw = 1.25;

// build a plot with a density line artist
Plot p = densityLine(x, bw);

// for each point draw a normal distribution
x.stream().forEach(s -> p.funLine(xi -> Normal.of(s.getDouble(), bw).pdf(xi) / x.size(), color(1), fill(2)));
WS.image(p);

../_images/fd1fd330f8ac9d868237798d523da81ff353e14abeb6a8037aeffa131f87c8b6.png

With red are depicted the kernel functions used to spread probability around sample points. With black is depicted the kernel density estimation which is the sum of all individual kernel functions.

Example 2#

Goal: Estimate density of iris sepal-length variable by histogram and density function

Solution:

WS.image(hist(iris.rvar("sepal-length"), bins(30), prob(true))
.densityLine(iris.rvar("sepal-length"), 0.1));

../_images/d05df34de63a4079e2d561c1498c5931db9e18cbe7ddca688dcc054c9956f2eb.png

hist(..) - builds a histogram
iris.getVar("sepal-length") - variable used to build histogram and also to build the density line
prob(true) - parameter which specifies to a histogram to use probabilities (approximated by frequency ratios)
densityLine(..) - builds a density line

Example 3#

Goal: Build multiple kernel density estimates for various bandwidth values

Solution:

// make a plot
Plot p2 = plot();
// generate a stream of values for bandwidths and for each add a density line with given bandwidth to the plot
DoubleStream.iterate(0.05, xi -> xi + 0.02).limit(20).forEach(v -> p2.densityLine(iris.rvar("sepal-length"), v));
// draw resulting figure
WS.image(p2);

../_images/6ed0c90e6cc66e0dd95235502e9a652e04b1f45e04bdcca7aba397ad443162f2.png

GridLayer - a simple way to create multiple plots in the same figure#

A GridLayer splits the drawing surface in a two dimensional array of cells and allows plots to be added on those cels.

Example 1#

Goal: Display a scatter plot for two variables and marginal densities as histograms

Solution:

var grid = gridLayer(2, 2, widths(0.7, 0.3), heights(0.3, 0.7))
    .add(1, 1, hist(df.rvar("petal-length"), bins(30), horizontal(true)).xLab("").yLab(""))
    .add(0, 0, hist(df.rvar("petal-width"), bins(30)).xLab("").yLab(""))
    .add(1, 0, points(df.rvar("petal-width"), df.rvar("petal-length"), pch.circleFull(), fill(12)));
WS.image(grid, 500, 500);

../_images/8d81c2af0169a4ee8de43154b2e8d4db0c2c410648d1609b3fd11ca079bb7928.png

gridLayer(2, 2, ...) - creates a grid layer with 2 rows and 2 columns
widths(0.7, 0.3) - specifies widths for cells as 70% of space for first columns and 30% for the second one
heights(0.3, 0.7) - specifies rows heights in percentages (see above)
add(x, y, ...) - adds a plot to the cell x and columns y, zero index based

Even if the code required to create and populate a composed grid layer is simple, for some often used constructs the plotter tool offers a shortcut. For example we have scatters, which uses behind a grid layer and some high level code, but a shortcut is offered.

var df = iris.mapVars(VarRange.onlyTypes(VarType.DOUBLE));
WS.image(scatters(df, pch.circleFull(), fill(3)), 1200, 1000);

../_images/091ef32ead5baea8719e0d03536563ee73cffd38e243c28fb20998bbfb061a71.png

Other visual artists#

There are other components as well, besides the one presented. To illustrate some of them we will provide here some examples without pretending to exhaust the list.

Random random = new Random(42);
DArray<Double> m = DArrayManager.base().random(DType.DOUBLE, Shape.of(10, 10), random);
m.apply_(Math::abs);
// 0 is yellow (60), max value is 0 (red)
WS.image(matrix(m, palette(Palette.hue(60, 0, 0, m.amax()))));

../_images/3f56f209b189300dcc733cfbb07e6c534c64d32a70a457f6f9857247999ceb93.png

int steps = 256;
Grid2D mg = Grid2D.fromFunction((x, y) ->
        (1. + Math.pow(x + y + 1, 2) *
                (19 - 14*x + 3*x*x - 14 * y + 6 * x * x + 3 * y * y))
                * (30 + Math.pow(2*x-3*y, 2)*(18 - 32*x + 12 * x * x + 48*y - 36 * x * y + 27 * y* y)), -2, 2, -2, 1, steps);

double[] q = mg.quantiles(VarDouble.seq(0, 1, 0.04).elements());
WS.image(isoCurves(mg, q, palette(Palette.gray(/*0, 240,*/ mg.minValue(), mg.maxValue())), color(NColor.white), lwd(1f)).title("Goldstein iso curves"));

../_images/de1500395dc74cd8947b51c9131b7dccfbe82f63d4ad307a06d00485b811c2f8.png

var x = VarDouble.seq(0, 1, 0.004).name("x");

var green = Normal.of(0.24, 0.08);
var blue = Normal.of(0.37, 0.15);
var orange = Normal.of(0.59, 0.13);
var red = Normal.of(0.80, 0.06);

Color cgreen = Color.decode("0x2ca02c");
Color cblue = Color.decode("0x1f77b4");
Color corange = Color.decode("0xff7f0e");
Color cred = Color.decode("0xd62728");

var ygreen = VarDouble.from(x, green::pdf).name("y");
var yblue = VarDouble.from(x, blue::pdf).name("y");
var yorange = VarDouble.from(x, orange::pdf).name("y");
var yred = VarDouble.from(x, red::pdf).name("y");

float alpha = 0.5f;
float lwd = 5f;

Plot up = plot();

up.polyfill(x, yblue, fill(cblue), alpha(alpha));
up.polyfill(x, yorange, fill(corange), alpha(alpha));
up.polyfill(x, ygreen, fill(cgreen), alpha(alpha));
up.polyfill(x, yred, fill(cred), alpha(alpha));

up.polyline(false, x, yblue, color(cblue), lwd(lwd));
up.polyline(false, x, yorange, color(corange), lwd(lwd));
up.polyline(false, x, ygreen, color(cgreen), lwd(lwd));
up.polyline(false, x, yred, color(cred), lwd(lwd));

up.xLim(0, 1);
up.leftThick(false);
up.leftMarkers(false);
up.bottomThick(false);
up.bottomMarkers(false);

Plot down = plot();

down.leftThick(false);
down.leftMarkers(false);
down.bottomThick(false);
down.bottomMarkers(false);

down.text(0.5, 0.6, "rapaio", font("DejaVu Sans", Font.BOLD, 110),
        halign.center(), valign.center(), color(Color.decode("0x096b87")));
down.xLim(0, 1);
down.yLim(0, 1);

WS.image(gridLayer(2, 1, heights(0.7, 0.3)).add(up).add(down));

../_images/45aaffecd4ff30c3e60a3cc553d38cc9b264cc59f4cf8f050e6bc49447272a47.png

Data visualization

Contents

Data visualization#

Workspace and Printer#

Figure types#

Box plots#

Example 1#

Example 2#

XY Scatter plots#

Example 1#

Example 2#

Histogram#

Example 1#

Example 2#

Density line#

Example 1#

Example 2#

Example 3#

GridLayer - a simple way to create multiple plots in the same figure#

Example 1#

Other visual artists#