Wednesday, August 25, 2010

Drawing boxplots, violin plots using R Part 1

A box plot is a graphical display of the distribution of data, showing all the quartiles an possible possible outliers. Assuming the box plot is drawn
vertically, The rectangular box lower edge denotes the first quartile, while the upper edge denotes the third quartile. The median is denoted by a line inside the box. Some versions also indicate the position of the arithemetic mean by a cross or a dot. Whiskers are drawn up to the data within 1.5(fs) of the lower and upper quartiles. where fs is the fourth spread, the difference of Q3 and Q1. Data points beyond these minimum and upper ranges are drawn for each data beyond these range and are labelled outliers. The box plot however cannot display the distribution of the data especially for multimodal data.
A Boxplot can be drawn for each column of a matrix.

The violin plot removes any shortcomings of the boxplot by adding a KDE (kernel density estimator to outline the distribution of the data. R usually draws
only a boxplot for one vector only. There are at least two libraries which offers violinplots. One is the violinplot function from the UsingR package of Verzanni. Another is the vioplot library which offers the vioplot function.

Here is an illustration of the differences between boxplot, violinplot and vioplot

library(UsingR)
library(vioplot)

png("box-viol.png", 6*72, 6*72)
X <- rbind(rnorm(50, 5, 2), rnorm(25, 1), rnorm(10, 3))
X <- as.vector(X)
violinplot(X,X,X)
vioplot(X, at=2,col="green", add = T)
boxplot(X, at=1,col="red", add = T)
dev.off()
Three violinplots are shown and the boxplot and vioplot are superimposed on the first and second plot respectively.
Box and Violin Plots example
Notice that in the desire to look more a violin, the vioplot will sometimes cut off at the Q3 + 1.5 fs or at the Q1-1.5fs, which may hide any outlier points!






orientation? positioning? outliers? matrix?dataframe?
boxplotbothyesyesyesyes
violinplotvertical onlyno*nonoyes
vioplotbothyesnonono


In orientation, the box plot and vioplot can be drawn horizontally and each ca

n be positioned at a specific location on the x or y axes using the graphics parameter at="value". As we can see in the above figure, for outliers, the vioplot may stop at the fence values creating a flat top or flat bottom. and hiding the extreme values specifically the minimum and maximum value in the data.The violin plot does show the minimum and maximum of data, but it is hard to know where the fs spreads lie. Both violin plot and vioplot cannot handle input matrix data. You have to specify each column of the matrix to these functions.

The boxplot may have an optional notch to emphasize the location of the median.

In my opinion, a violin plot with a box plot superimposed is the current best way to show distribution and any muliple modalities of the data.

We are still wondering what input format we shall make for our online solver at extreme-solvers.blogspot.com, which we shall show in Part 2 of this article.

We hope that the developers of these plots will implement other features available in the others, like vioplot able to do dataframes.

1 comment:

  1. Help solving system of linear equations using Cramer's rule?
    Well, I don't usually ask for HW help on this site, but I am stuck and my dad won't be back until after this class ends.

    Use Cramer's rule to solve the linear equations.
    5x-4y+4z=18
    -x+3y-2z=0
    4x-2y+7z=3

    I plugged it in to an online solver I found and got the values as x=6, y=0, z=-3. I went to do the problem, but I keep getting x=390/25, y=328/25, and z=-147/25. My answers are pretty off. How do I do this problem?

    ReplyDelete