In [1]:

                
                    Copied!
                    
%load ../../rapaio-bootstrap.ipynb
%load ../../rapaio-bootstrap.ipynb

Add /home/ati/work/rapaio/rapaio-core/target/rapaio-core-5.0.1.jar to classpath

Built-in Data sets¶

For learning purposes some well-known data sets are already incorporated into rapaio library. All built in data sets are available via rapaio.datasets.Datasets class. This is an utility class which provides some standard data sets used in many statistical and machine learning text books. Some of them are described below

Iris data set¶

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphological variation of Iris flowers of three related species.

Two of the three species were collected in the Gaspé Peninsula all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). There are for measures for each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

In [2]:

                
                    Copied!
                    
var iris = Datasets.loadIrisDataset();
iris.printSummary();
var iris = Datasets.loadIrisDataset();
iris.printSummary();

Frame Summary
=============
* rowCount: 150
* complete: 150/150
* varCount: 5
* varNames: 

0. sepal-length : dbl | 3.  petal-width : dbl | 
1.  sepal-width : dbl | 4.        class : nom | 
2. petal-length : dbl | 

* summary: 
 sepal-length [dbl]      sepal-width [dbl]      petal-length [dbl]      petal-width [dbl]            class [nom] 
       Min. : 4.3000000       Min. : 2.0000000        Min. : 1.0000000       Min. : 0.1000000 versicolor :    50 
    1st Qu. : 5.1000000    1st Qu. : 2.8000000     1st Qu. : 1.6000000    1st Qu. : 0.3000000     setosa :    50 
     Median : 5.8000000     Median : 3.0000000      Median : 4.3500000     Median : 1.3000000  virginica :    50 
       Mean : 5.8433333       Mean : 3.0573333        Mean : 3.7580000       Mean : 1.1993333                    
    2nd Qu. : 6.4000000    2nd Qu. : 3.3000000     2nd Qu. : 5.1000000    2nd Qu. : 1.8000000                    
       Max. : 7.9000000       Max. : 4.4000000        Max. : 6.9000000       Max. : 2.5000000

Pearson's Height Data¶

This simple data set comes from a famous experiment by Karl Pearson around 1903. The number of cases is 1078. The original data values were rounded to produce heights to the nearest $0.1$ inch.

In [3]:

                
                    Copied!
                    
var df = Datasets.loadPearsonHeightDataset();
df.printSummary();
var df = Datasets.loadPearsonHeightDataset();
df.printSummary();

Frame Summary
=============
* rowCount: 1078
* complete: 1078/1078
* varCount: 2
* varNames: 

0. Father : dbl | 
1.    Son : dbl | 

* summary: 
   Father [dbl]            Son [dbl]         Mean : 67.6868275    Mean : 68.6842301 
   Min. : 59.0000000    Min. : 58.5000000 2nd Qu. : 69.6000000 2nd Qu. : 70.5000000 
1st Qu. : 65.8000000 1st Qu. : 66.9000000    Max. : 75.4000000    Max. : 78.4000000 
 Median : 67.8000000  Median : 68.6000000

In [7]:

                
                    Copied!
                    
WS.image(points(df.rvar("Father"), df.rvar("Son"), pch.circleFull(), fill(1)));
WS.image(points(df.rvar("Father"), df.rvar("Son"), pch.circleFull(), fill(1)));

Out[7]:

Advertising data set¶

This data set is one of the first data sets used in Introduction to Statistical Learning book to illustrate various topics for linear regression. It contains observations which relates sales as an assumed result of advertising into various types of media communication like TV, radio or newspapers.

In [8]:

                
                    Copied!
                    
var df = Datasets.loadISLAdvertising()
var df = Datasets.loadISLAdvertising()

In [9]:

                
                    Copied!
                    
df.printSummary()
df.printSummary()

Frame Summary
=============
* rowCount: 200
* complete: 200/200
* varCount: 4
* varNames: 

0.        TV : dbl | 
1.     Radio : dbl | 
2. Newspaper : dbl | 
3.     Sales : dbl | 

* summary: 
       TV [dbl]           Radio [dbl]       Newspaper [dbl]           Sales [dbl]      
   Min. :   0.7000000    Min. :  0.0000000     Min. :   0.3000000    Min. :  1.6000000 
1st Qu. :  74.3750000 1st Qu. :  9.9750000  1st Qu. :  12.7500000 1st Qu. : 10.3750000 
 Median : 149.7500000  Median : 22.9000000   Median :  25.7500000  Median : 12.9000000 
   Mean : 147.0425000    Mean : 23.2640000     Mean :  30.5540000    Mean : 14.0225000 
2nd Qu. : 218.8250000 2nd Qu. : 36.5250000  2nd Qu. :  45.1000000 2nd Qu. : 17.4000000 
   Max. : 296.4000000    Max. : 49.6000000     Max. : 114.0000000    Max. : 27.0000000