Data visualization¶
We illustrate the core ideas behind the graphic system and to provide enough examples to have a fast and productive start of effective usage.
The design of the graphical components of this library is influenced by many ideas from existing popular graphical systems. Among the main inspirations there are some which deserves appropriate consideration: R
standard graphical library, ggplot2
package and matplotlib
from Python stack.
Workspace and Printer¶
A printer can be instantiated by itself by a handy way of using it is by the means of the utility class called rapaio.sys.WS
(which comes from Work Space). This is an utility class which offers static shortcut methods yo allow one to simulate a working session. This class has an instance of a printer implementation (which can be changes by WS.setPrinter
method) and shortcuts for text output and drawing.
Note that it is not necessary to work with a workspace, the solely purpose is to offer shortcuts to various objects which one usually does not bother to manage during an interactive session, for example.
Drawing of graphical figures or text output is handled by classes which implements interface rapaio.printer.Printer
. There are some built-in implementations of this interface. The default one is StandardPrinter
, which allows printing text at console and drawing graphics in a Swing modal window. There are also some graphical tools which allows one to output the grated graph as a BufferedImage
which can be desplayed in jupyter notebooks, like this one. Another used implementation is IdeaPrinter
, which is similar with standard printer for text output, but for graphics drawing implements a TCP/IP protocol with serialization which allows plug-ins like rapaio-studio
to handle graphics drawing (the rapaio-studio is now deprecated in favor or working with jupyter notebooks).
We are interested here in methods for drawing, so the workspace contains methods for this purpose. All of those methods works with a rapaio.graphics.Figure
instance which contains all the indications regarding what kind of graphic should be built. Additionally, there are methods which allows one to obtain an image directly from a figure, or store an image directly on a disk file.
Here are the shortcut methods implemented in WS
utility:
void draw(Figure figure)
- draws a figure in a printing system; the size of drawing is either adaptive of has the default size of a graphical image contained in the printer implementation.void draw(Figure figure, int width, int height)
- draws a figure with dimensions specified by width and heightBufferedImage image(Figure figure)
- builds an image from a figure with dimensions specified as default width and default height from printer systemBufferedImage image(Figure figure, int width, int height)
- builds an image from a figure with specified dimensionsvoid saveImage(Figure figure, int width, int height, String fileName)
- builds an image with specified dimensions by width and height from a given figure, and save the image into a png file with the specified file namevoid saveImage(Figure figure, int width, int height, OutputStream os)
- builds an image with specified dimensions by width and height, and send the image in png format to the specified output stream (could be a file or could be sent over network)
%load ../../rapaio-bootstrap.ipynb
Add /home/ati/work/rapaio/rapaio-core/target/rapaio-core-5.0.1.jar to classpath
Now in order to build figures one can instantiate and compose figures directly in a pure Java way. As a general pattern, the libarry offers a shortcut graphical tool to make the life during an interactive analysis as easier as possible. The graphical factory shortcut tool is named rapaio.graphics.Plotter
and contains a lot of helper methods. Here is an example:
// instantiate a data set from build in collection
Frame iris = Datasets.loadIrisDataset();
// draw a some overlapped histograms for some variables in the data set
var bins = bins(30);
WS.image(hist(iris.rvar("sepal-length"), 1, 8, bins, fill(NColor.navy), prob(true), alpha(0.5f))
.hist(iris.rvar("petal-length"), 1, 8, bins, fill(NColor.darkorange), prob(true), alpha(0.5f))
.legend(6, 0.6, labels("sepal-length", "petal-length"), fill(NColor.navy, NColor.darkorange))
, 800, 600);
Figure types¶
Box plots¶
A box plot is a standard way of displaying information about the distribution of a continuous data variable based on five data summary. The five data summary consists of: minimum, first quartile, median (second quartile), third quartile and maximum number summaries.
Box plots might be constructed in different manners by different authors. The common characteristics for all types of box-plots is the box. Which means that in all cases the bottom margin of the box lies on first quartile, the top margin of the box lies on third quartile and the line inside the box lies on median values.
The extensions which comes from the box might differ and rapaio
system implements the version which it's usually named: Tuckey's box plot. What is specific to this box plot is that the whiskers lies at the datum still within 1.5 IQR (Interquartile range).
Any other points above or below whiskers are outliers. Outliers are of two different types:
- extreme outliers - outliers which are at a distance greater or equal than 3*IQR
- outliers - outliers which are at a distance greater or equal than 1.5*IQR
Example 1¶
Goal: Draw one box plot for each numerical variable from iris data set. We want each box plot to have a different color and we want some de-saturated colors.
Solution:
WS.image(boxplot(iris.mapVars("0~3"), fill(1, 2, 3, 4), alpha(0.5f)))
iris.mapVars("0~3")
- obtain a data set fromiris
data set, by keeping only the first 4 variables. We do that using the range notation (index of the start variable, concatenation symbol~
, index of the last variable inclusive). Variable indexes are 0 based.color(1, 2, 3, 4)
- we use colors from the current color palette, indexed with the specified integer values.alpha(0.5f)
- we de-saturate the drawing keeping only 0.3 of the actual color.
Example 2¶
Goal: In order to identify the overlap between values of sepal-length
variable from iris data set, we draw one box plot for each segment of the nominal class
variable, and add a title
Solution:
WS.image(boxplot(iris.rvar("sepal-length"), iris.rvar("class"), fill(NColor.tab_blue))
.title("sepal-length separation"));
iris.var("sepal-length")
- is the variable namedsepal-length
fromiris.var("class")
- specifies the segment discriminator, depending on the levels of this variable, the same number of levels will be created.title("..")
- adds a title to the box plot
XY Scatter plots¶
We can study the relation between two numerical variables by drawing one point for each instance.
Example 1¶
Goal: Study which is the relation between petal-length
and sepal-length
from iris
data set.
Solution:
WS.image(points(iris.rvar("petal-length"), iris.rvar("petal-width")));
iris.var("petal-length")
- variable used to define horizontal axisiris.var("petal-width")
- variable used to define vertical axis
Example 2¶
Goal: Study which is the relation between petal-length
and sepal-length
from iris
data set. Color each point with a different color corresponding with value from variable class
and add a legend for colors. The values will have additional noise.
Solution:
// make an altered copy of dataset with jitter noise added
Frame df = iris.copy().fapply(Jitter.on(new Random(123), Normal.of(0, 0.1), VarRange.of("0~3")));
// draw scatter plot
WS.image(points(df.rvar("petal-length"), df.rvar("petal-width"), fill(iris.rvar("class")), pch.circleFull())
.legend(1.5, 2.2, labels("setosa", "versicolor", "virginica")));
iris.getVar("petal-length")
- variable used to define horizontal axisiris.getVar("petal-width")
- variable used to define vertical axiscolor(iris.getVar("class"))
- nominal variable which provides indexes to select colors from current palettepch(2)
- select the type of figure used to draw points (in this case is a circle filled with solid color and a black border)legend(1.5, 2.2, labels("setosa", "versicolor", "virginica"))
- adds a legend at the specific position specified in the data range; labels are specified by parameter, colors are taken as default from current palette starting with 1
Histogram¶
A histogram is a graphical representation of the distribution of a continuous variable.
The histogram is only an estimation of the distribution. To construct a histogram you have to bin the range of values from the variable in a sequence of equal length intervals, and later on counting the values from each bin. Histograms can display counts, or can display proportions which are counts divided by the total number of values (use prob(true)
parameter to have this effect).
Histograms use bins as input parameter. The bin's width is computed. To compute the bin's width we need also the range of values. This is computed automatically from data or it can be specified when the histogram is built.
We can omit the number of bins, in which case its value is estimated also from data. For estimation is used the Freedman-Diaconis rule. See Freedman-Diaconis wikipedia page for more details.
Example 1¶
Goal: Build a histogram with default values to estimate the pdf of sepal-length
variable from iris
data set.
Solution:
WS.image(hist(iris.rvar("sepal-length"), fill(NColor.tab_green)));
Example 2¶
Goal: Build two overlapped histograms with default values to estimate the pdf of sepal-length
and petal-length
variables from iris
data set. We want to get bins in range (0-10) of width 0.25, colored with red, and blue, with a big transparency for visibility
Solution:
WS.image(plot(alpha(0.7f))
.hist(iris.rvar("sepal-length"), 0, 10, bins(40), fill(NColor.tab_blue))
.hist(iris.rvar("petal-length"), 0, 10, bins(40), fill(NColor.tab_red))
.legend(8, 20, labels("sepal-length", "petal-length"), color(NColor.tab_blue, NColor.tab_red))
.xLab("variable"));
plot(alpha(0.3f))
- builds an empty plot; this is used only to pass default values for alpha for all plot components, otherwise the plot construct would not be neededhist
- adds a histogram to the current plotiris.getVar("sepal-length")
- variable used to build histogram0, 10
- specifies the range used to compute binsbins(40)
- specifies the number of bins for histogramfill(1)
- specifies the color to draw the histogram, which is the color indexed with 1 in color palette (in this case is red)legend(7, 20, ...)
- draws a legend at the specified coordinates, values are in the units specified by datalabels(..)
- specifies labels for legendcolor(1, 2)
- specifies color for legendxLab
= specifies label text for horizontal axis
Density line¶
A density line is a graphical representation of the distribution of a continuous variable. The density line is similar with a histogram in purpose but it has a different strategy to build the estimate. The density line plot component implements a kernel density estimator which is basically a non-parametric smoothing method, named also Parzen-Rosenblatt window method.
There are two main parameters for a density line: bandwidth and base density kernel function. The default bandwidth is computed according with Silverman's rule of thumb (more details on Wikipedia kernel density page). The default kernel function is the Gaussian pdf.
Kernel function estimators can be constructed using various kernel functions like Gaussian, uniform, triangular, Epanechnikov, cosine, tricube, triweight, biweight. All of them are available in rapaio
library and also some custom can be built.
// this is our sample
VarDouble x = VarDouble.wrap(-2.1, -1.3, -0.4, 1.9, 5.1, 6.2);
// declare a bandwidth for smoothing
double bw = 1.25;
// build a plot with a density line artist
Plot p = densityLine(x, bw);
// for each point draw a normal distribution
x.stream().forEach(s -> p.funLine(xi -> Normal.of(s.getDouble(), bw).pdf(xi) / x.size(), color(1), fill(2)));
WS.image(p);
With red are depicted the kernel functions used to spread probability around sample points. With black is depicted the kernel density estimation which is the sum of all individual kernel functions.
Example 2¶
Goal: Estimate density of iris sepal-length
variable by histogram and density function
Solution:
WS.image(hist(iris.rvar("sepal-length"), bins(30), prob(true))
.densityLine(iris.rvar("sepal-length"), 0.1));
hist(..)
- builds a histogramiris.getVar("sepal-length")
- variable used to build histogram and also to build the density lineprob(true)
- parameter which specifies to a histogram to use probabilities (approximated by frequency ratios)densityLine(..)
- builds a density line
// make a plot
Plot p2 = plot();
// generate a stream of values for bandwidths and for each add a density line with given bandwidth to the plot
DoubleStream.iterate(0.05, xi -> xi + 0.02).limit(20).forEach(v -> p2.densityLine(iris.rvar("sepal-length"), v));
// draw resulting figure
WS.image(p2);
var grid = gridLayer(2, 2, widths(0.7, 0.3), heights(0.3, 0.7))
.add(1, 1, hist(df.rvar("petal-length"), bins(30), horizontal(true)).xLab("").yLab(""))
.add(0, 0, hist(df.rvar("petal-width"), bins(30)).xLab("").yLab(""))
.add(1, 0, points(df.rvar("petal-width"), df.rvar("petal-length"), pch.circleFull(), fill(12)));
WS.image(grid, 500, 500);
gridLayer(2, 2, ...)
- creates a grid layer with 2 rows and 2 columnswidths(0.7, 0.3)
- specifies widths for cells as70%
of space for first columns and30%
for the second oneheights(0.3, 0.7)
- specifies rows heights in percentages (see above)add(x, y, ...)
- adds a plot to the cell x and columns y, zero index based
Even if the code required to create and populate a composed grid layer is simple, for some often used constructs the plotter tool offers a shortcut. For example we have scatters, which uses behind a grid layer and some high level code, but a shortcut is offered.
var df = iris.mapVars(VarRange.onlyTypes(VarType.DOUBLE));
WS.image(scatters(df, pch.circleFull(), fill(3)), 1200, 1000);
Other visual artists¶
There are other components as well, besides the one presented. To illustrate some of them we will provide here some examples without pretending to exhaust the list.
DMatrix m = DMatrix.random(10, 10);
m.apply(Math::abs);
// 0 is yellow (60), max value is 0 (red)
WS.image(matrix(m, palette(Palette.hue(60, 0, 0, m.max(1).max()))));
int steps = 256;
Grid2D mg = Grid2D.fromFunction((x, y) ->
(1. + Math.pow(x + y + 1, 2) *
(19 - 14*x + 3*x*x - 14 * y + 6 * x * x + 3 * y * y))
* (30 + Math.pow(2*x-3*y, 2)*(18 - 32*x + 12 * x * x + 48*y - 36 * x * y + 27 * y* y)), -2, 2, -2, 1, steps);
double[] q = mg.quantiles(VarDouble.seq(0, 1, 0.04).elements());
WS.image(isoCurves(mg, q, palette(Palette.hue(0, 240, mg.minValue(), mg.maxValue())), color(NColor.white), lwd(1f)).title("Goldstein iso curves"));
var x = VarDouble.seq(0, 1, 0.004).name("x");
var green = Normal.of(0.24, 0.08);
var blue = Normal.of(0.37, 0.15);
var orange = Normal.of(0.59, 0.13);
var red = Normal.of(0.80, 0.06);
Color cgreen = Color.decode("0x2ca02c");
Color cblue = Color.decode("0x1f77b4");
Color corange = Color.decode("0xff7f0e");
Color cred = Color.decode("0xd62728");
var ygreen = VarDouble.from(x, green::pdf).name("y");
var yblue = VarDouble.from(x, blue::pdf).name("y");
var yorange = VarDouble.from(x, orange::pdf).name("y");
var yred = VarDouble.from(x, red::pdf).name("y");
float alpha = 0.5f;
float lwd = 5f;
Plot up = plot();
up.polyfill(x, yblue, fill(cblue), alpha(alpha));
up.polyfill(x, yorange, fill(corange), alpha(alpha));
up.polyfill(x, ygreen, fill(cgreen), alpha(alpha));
up.polyfill(x, yred, fill(cred), alpha(alpha));
up.polyline(false, x, yblue, color(cblue), lwd(lwd));
up.polyline(false, x, yorange, color(corange), lwd(lwd));
up.polyline(false, x, ygreen, color(cgreen), lwd(lwd));
up.polyline(false, x, yred, color(cred), lwd(lwd));
up.xLim(0, 1);
up.leftThick(false);
up.leftMarkers(false);
up.bottomThick(false);
up.bottomMarkers(false);
Plot down = plot();
down.leftThick(false);
down.leftMarkers(false);
down.bottomThick(false);
down.bottomMarkers(false);
down.text(0.5, 0.6, "rapaio", font("DejaVu Sans", Font.BOLD, 110),
halign.center(), valign.center(), color(Color.decode("0x096b87")));
down.xLim(0, 1);
down.yLim(0, 1);
WS.image(gridLayer(2, 1, heights(0.7, 0.3)).add(up).add(down));