R-Sessions 13: Overlapping Data Points


In many cases, multiple points on a scatterplot have exactly the same coordinates. When these are simply plotted, the visual representation of the data may be unsatisfactory. Today’s R-Session is on how to present this type of data in neatly arranged plots in R-Project.

Introduction

In many cases, multiple points on a scatterplot have exactly the same coordinates. When these are simply plotted, the visual representation of the data may be unsatisfactory. For instance, regard the following data and plot:

x <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 4, 5, 3, 4)
y <- c(5, 5, 5, 4, 4, 4, 4, 4, 3, 4, 2, 2, 2)

data.frame (x, y)

> data.frame(x,y)
   x y
1  1 5
2  1 5
3  2 4
4  2 4
5  2 4
6  2 4
7  2 4
8  3 3
9  4 4
10 5 2
11 1 5
12 3 2
13 4 2

plot (x, y, main=”Multiple points on coordinate”)

No Jitter

On this first plot, we see that only seven points are projected, although there are thirteen data-points available for plotting. The reason for this, is that some of the points overlap each other. There are actually three points on the coordinate [x=1, y=5] and five points on the coordinate [x=2, y=4]. The standard plot function of R does not take action to show all your data.

Fortunately, several methods are available for making all data-points in a plot visible. Consecutively, the following will be described and shown:

  • Jitter {base}
  • Sunflower {graphics}
  • Cluster.overplot {plotrix}
  • Count.overplot {plotrix}
  • Sizeplot {plotrix}

Jitter {base}

The jitter function adds a slight amount of irregular ‘movement’ to a vector of data. Some functions, such as stripchart, have a jitter-argument built in. The general plot-function does not, so we have to change the data the plot is based on. Therefor, the jitter-function is not applied to the x=argument of the plot function. This will result, as shown, in some variation on the x-axis, thereby revealing all the available data-points. This is done by:

plot(jitter(x), y, main=”Using Jitter on x-axis”)

Jitter X

As we can see, all three data-points on x=1 are clearly visible. But, the points on x=2 still clutter together. So, when to many points overlap each other, jittering on just one axis might be not enough. Fortunately, we can jitter more than just one axis:

plot(jitter(x), jitter(y), main=”Using Jitter on x- and y-axis”)

Jitter XY

Now, we see the overlapping points varying slightly over both the x-axis and the y-axis. All of the points are now clearly visible. Nevertheless, if many more data-points were plotted, again cluttering would occur. But, although not all individual points will then be shown, using jitter still allows for a better impression of the density of points in a region.

Sunflower {graphics}

sunflowerplot(x, y, main=”Using Sunflowers”)

Sunflower are often seen in the graphics produced by statistical packages. When more than a one point is to be drawn on a single coordinate, a number of ‘leafs’ of the sunflower are drawn, instead of the points that is to be expected. The advantage of this is the increased accuracy, but the back-draw is that is works only when relatively few points need to be drawn on one coordinate. Another back-draw of the method is that the sunflowers take quite a lot of place, so overlapping might occur if several points are to be plotted very close to each other.

Sunflower

Cluster.overplot {plotrix}

The next three examples are coming from functions inside the plotrix package. The first of these functions is cluster.overplot(). This function clusters up to nine overlapping points of data. Therefor, this function is ideal for relatively small data-sets. Due to the tight clustering, the plot is not easily mistaken for showing randomness that is ‘real’ in the data.

The functions itself does not plot data, but will return a list with ‘new’ coordinates which can be plotted succeedingly. In the code below, first the plotrix package is loaded. Next, the list with new coordinates will be shown. Finally, cluster.overplot() function is nested in the plot()-function, which leeds to the following plot:

require(plotrix)
cluster.overplot(x,y)
plot(cluster.overplot(x, y ), main=”Using cluster.overplot”)

Cluster Overplot

Count.overplot {plotrix}

We have seen, that all of the methods that were described above still rely on a visual representation of each overlapping point of data. This still can result in very dense plots that are hard to interpret. The next few methods try to solve this problem.

The function count.overplot tries to give a more accurate representation of overlapping data by not plotting every point on slightly altered coordinates, but by placing a numerical count of the overlapping data-points on the right coordinate. This results in a very accurate plot, which still may be difficult to interpret, though. Using this method will result in a plot that does not give us a feel of the localized density of the data and thereby may be misrepresenting the data. It should only be used when described extensively.

count.overplot(x, y, main=”Using count.overplot”, xlab=”X-Axis”, ylab=”Y-Axis”)

Count Overplot

Sizeplot {plotrix}

The next method adjusts the size of the plotted points. Since the relation between number of overlapping points and the increase in size can be adjusted, this method is suitable for large sets of data.

sizeplot(x, y, main=”Using sizeplot”, xlab=”X-Axis”, ylab=”Y-Axis”)

Sizeplot

– – — — —– ——–

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

2 comment on “R-Sessions 13: Overlapping Data Points

Leave a Reply to Time Cancel reply