상위 50 개의 ggplot2 시각화 – 마스터 목록 (전체 R 코드 포함)

What type of visualization to use for what sort of problem? This tutorial helps you choose the right type of chart for your specific objectives and how to implement it in R using ggplot2.

This is part 3 of a three part tutorial on ggplot2, an aesthetically pleasing (and very popular) graphics framework in R. This tutorial is primarily geared towards those having some basic knowledge of the R programming language and want to make complex and nice looking charts with R ggplot2.

Top 50 ggplot2 Visualizations – The Master List

An effective chart is one that:

  1. Conveys the right information without distorting facts.
  2. Is simple but elegant. It should not force you to think much in order to get it.
  3. Aesthetics supports information rather that overshadow it.
  4. Is not overloaded with information.

The list below sorts the visualizations based on its primary purpose. Primarily, there are 8 types of objectives you may construct plots. So, before you actually make the plot, try and figure what findings and relationships you would like to convey or examine through the visualization. Chances are it will fall under one (or sometimes more) of these 8 categories.

  1. Correlation
  2. Deviation
  3. Ranking
  4. Distribution
  5. Composition
  6. Change
  7. Groups
  8. Spatial

1. Correlation

The following plots help to examine how well correlated two variables are.

Scatterplot

The most frequently used plot for data analysis is undoubtedly the scatterplot. Whenever you want to understand the nature of relationship between two variables, invariably the first choice is the scatterplot.

It can be drawn using

geom_point()

. Additionally,

geom_smooth

which draws a smoothing line (based on loess) by default, can be tweaked to draw the line of best fit by setting

method='lm'

.

ggplot2 Scatterplot

[Back to Top]

Scatterplot With Encircling

When presenting the results, sometimes I would encirlce certain special group of points or region in the chart so as to draw the attention to those peculiar cases. This can be conveniently done using the

geom_encircle()

in

ggalt

package.

Within

geom_encircle()

, set the

data

to a new dataframe that contains only the points (rows) or interest. Moreover, You can

expand

the curve so as to pass just outside the points. The

color

and

size

(thickness) of the curve can be modified as well. See below example.

ggplot2 Scatterplot With Encircling[Back to Top]

Jitter Plot

Let’s look at a new data to draw the scatterplot. This time, I will use the

mpg

dataset to plot city mileage (

cty

) vs highway mileage (

hwy

).

ggplot2 Scatterplot With Hidden Data points

What we have here is a scatterplot of city and highway mileage in

mpg

dataset. We have seen a similar scatterplot and this looks neat and gives a clear idea of how the city mileage (

cty

) and highway mileage (

hwy

) are well correlated.

But, this innocent looking plot is hiding something. Can you find out?

The original data has 234 data points but the chart seems to display fewer points. What has happened? This is because there are many overlapping points appearing as a single dot. The fact that both

cty

and

hwy

are integers in the source dataset made it all the more convenient to hide this detail. So just be extra careful the next time you make scatterplot with integers.

So how to handle this? There are few options. We can make a jitter plot with

jitter_geom()

. As the name suggests, the overlapping points are randomly jittered around its original position based on a threshold controlled by the

width

argument.

ggplot2 Jitter Plot More points are revealed now. More the

width

, more the points are moved jittered from their original position.

[Back to Top]

Counts Chart

The second option to overcome the problem of data points overlap is to use what is called a counts chart. Whereever there is more points overlap, the size of the circle gets bigger.

ggplot2 Counts Plot

[Back to Top]

Bubble plot

While scatterplot lets you compare the relationship between 2 continuous variables, bubble chart serves well if you want to understand relationship within the underlying groups based on:

  1. A Categorical variable (by changing the color) and
  2. Another continuous variable (by changing the size of points).

In simpler words, bubble charts are more suitable if you have 4-Dimensional data where two of them are numeric (X and Y) and one other categorical (color) and another numeric variable (size).

The bubble chart clearly distinguishes the range of

displ

between the manufacturers and how the slope of lines-of-best-fit varies, providing a better visual comparison between the groups.

ggplot2 Bubble Plot

[Back to Top]

Animated Bubble chart

An animated bubble chart can be implemented using the

gganimate

package. It is same as the bubble chart, but, you have to show how the values change over a fifth dimension (typically time).

The key thing to do is to set the

aes(frame)

to the desired column on which you want to animate. Rest of the procedure related to plot construction is the same. Once the plot is constructed, you can animate it using

gganimate()

by setting a chosen

interval

.

ggplot2 Animated Bubble Plot

[Back to Top]

Marginal Histogram / Boxplot

If you want to show the relationship as well as the distribution in the same chart, use the marginal histogram. It has a histogram of the X and Y variables at the margins of the scatterplot.

This can be implemented using the

ggMarginal()

function from the ‘

ggExtra

’ package. Apart from a

histogram

, you could choose to draw a marginal

boxplot

or

density

plot by setting the respective

type

option.

ggplot2 Marginal Histogram

ggplot2 Marginal Histogram

[Back to Top]

Correlogram

Correlogram let’s you examine the corellation of multiple continuous variables present in the same dataframe. This is conveniently implemented using the

ggcorrplot

package.

ggplot2 Correlogram

[Back to Top]

2. Deviation

Compare variation in values between small number of items (or categories) with respect to a fixed reference.

Diverging bars

Diverging Bars is a bar chart that can handle both negative and positive values. This can be implemented by a smart tweak with

geom_bar()

. But the usage of

geom_bar()

can be quite confusing. Thats because, it can be used to make a bar chart as well as a histogram. Let me explain.

By default,

geom_bar()

has the

stat

set to

count

. That means, when you provide just a continuous X variable (and no Y variable), it tries to make a histogram out of the data.

In order to make a bar chart create bars instead of histogram, you need to do two things.

  1. Set stat=identity
  2. Provide both x and y inside aes() where, x is either character or factor and y is numeric.

In order to make sure you get diverging bars instead of just bars, make sure, your categorical variable has 2 categories that changes values at a certain threshold of the continuous variable. In below example, the

mpg

from mtcars dataset is normalised by computing the z score. Those vehicles with mpg above zero are marked green and those below are marked red.

ggplot2 Diverging Bars

[Back to Top]

Diverging Lollipop Chart

Lollipop chart conveys the same information as bar chart and diverging bar. Except that it looks more modern. Instead of geom_bar, I use

geom_point

and

geom_segment

to get the lollipops right. Let’s draw a lollipop using the same data I prepared in the previous example of diverging bars.

ggplot2 Lollipop Plot

[Back to Top]

Diverging Dot Plot

Dot plot conveys similar information. The principles are same as what we saw in Diverging bars, except that only point are used. Below example uses the same data prepared in the diverging bars example.

ggplot2 Dotplot

[Back to Top]

Area Chart

Area charts are typically used to visualize how a particular metric (such as % returns from a stock) performed compared to a certain baseline. Other types of %returns or %change data are also commonly used. The

geom_area()

implements this.

ggplot2 Area Chart

[Back to Top]

3. Ranking

Used to compare the position or performance of multiple items with respect to each other. Actual values matters somewhat less than the ranking.

Ordered Bar Chart

Ordered Bar Chart is a Bar Chart that is ordered by the Y axis variable. Just sorting the dataframe by the variable of interest isn’t enough to order the bar chart. In order for the bar chart to retain the order of the rows, the X axis variable (i.e. the categories) has to be converted into a factor.

Let’s plot the mean city mileage for each manufacturer from

mpg

dataset. First, aggregate the data and sort it before you draw the plot. Finally, the X variable is converted to a factor.

Let’s see how that is done.

The X variable is now a

factor

, let’s plot.

ggplot2 Ordered Barchart

[Back to Top]

Lollipop Chart

Lollipop charts conveys the same information as in bar charts. By reducing the thick bars into thin lines, it reduces the clutter and lays more emphasis on the value. It looks nice and modern.

ggplot2 Lollipop Barchart

[Back to Top]

Dot Plot

Dot plots are very similar to lollipops, but without the line and is flipped to horizontal position. It emphasizes more on the rank ordering of items with respect to actual values and how far apart are the entities with respect to each other.

ggplot2 Dot Plot

[Back to Top]

Slope Chart

Slope charts are an excellent way of comparing the positional placements between 2 points on time. At the moment, there is no builtin function to construct this. Following code serves as a pointer about how you may approach this.

ggplot2 Slope Chart

[Back to Top]

Dumbbell Plot

Dumbbell charts are a great tool if you wish to: 1. Visualize relative positions (like growth and decline) between two points in time. 2. Compare distance between two categories.

In order to get the correct ordering of the dumbbells, the Y variable should be a factor and the levels of the factor variable should be in the same order as it should appear in the plot.

ggplot2 Dumbbell Chart

[Back to Top]

4. Distribution

When you have lots and lots of data points and want to study where and how the data points are distributed.

Histogram

By default, if only one variable is supplied, the

geom_bar()

tries to calculate the count. In order for it to behave like a bar chart, the

stat=identity

option has to be set and

x

and

y

values must be provided.

Histogram on a continuous variable

Histogram on a continuous variable can be accomplished using either

geom_bar()

or

geom_histogram()

. When using

geom_histogram()

, you can control the number of bars using the

bins

option. Else, you can set the range covered by each bin using

binwidth

. The value of

binwidth

is on the same scale as the continuous variable on which histogram is built. Since,

geom_histogram

gives facility to control both number of

bins

as well as

binwidth

, it is the preferred option to create histogram on continuous variables.

ggplot2 Histogram on Numeric Variable ggplot2 Histogram with 5 Bins - Spectral

[Back to Top]

Histogram on a categorical variable

Histogram on a categorical variable would result in a frequency chart showing bars for each category. By adjusting

width

, you can adjust the thickness of the bars.

ggplot2 Histogram on Categorical Variable

[Back to Top]

Density plot

ggplot2 Density Plot

[Back to Top]

Box Plot

Box plot is an excellent tool to study the distribution. It can also show the distributions within multiple groups, along with the median, range and outliers if any.

The dark line inside the box represents the median. The top of box is 75%ile and bottom of box is 25%ile. The end points of the lines (aka whiskers) is at a distance of 1.5*IQR, where IQR or Inter Quartile Range is the distance between 25th and 75th percentiles. The points outside the whiskers are marked as dots and are normally considered as extreme points.

Setting

varwidth=T

adjusts the width of the boxes to be proportional to the number of observation it contains.

ggplot2 BoxPlot

ggplot2 Grouped BoxPlot

[Back to Top]

Dot + Box Plot

On top of the information provided by a box plot, the dot plot can provide more clear information in the form of summary statistics by each group. The dots are staggered such that each dot represents one observation. So, in below chart, the number of dots for a given manufacturer will match the number of rows of that manufacturer in source data.

ggplot2 Box and DotPlot

[Back to Top]

Tufte Boxplot

Tufte box plot, provided by

ggthemes

package is inspired by the works of Edward Tufte. Tufte’s Box plot is just a box plot made minimal and visually appealing.

ggplot2 Tufte Boxplot

[Back to Top]

Violin Plot

A violin plot is similar to box plot but shows the density within groups. Not much info provided as in boxplots. It can be drawn using

geom_violin()

.

ggplot2 Violin Plot

[Back to Top]

Population Pyramid

Population pyramids offer a unique way of visualizing how much population or what percentage of population fall under a certain category. The below pyramid is an excellent example of how many users are retained at each stage of a email marketing campaign funnel.

Population Pyramid With Ggplot

[Back to Top]

5. Composition

Waffle Chart

Waffle charts is a nice way of showing the categorical composition of the total population. Though there is no direct function, it can be articulated by smartly maneuvering the ggplot2 using

geom_tile()

function. The below template should help you create your own waffle.

Waffle Chart With Ggplot

[Back to Top]

Pie Chart

Pie chart, a classic way of showing the compositions is equivalent to the waffle chart in terms of the information conveyed. But is a slightly tricky to implement in ggplot2 using the

coord_polar()

.

Pie Chart With Ggplot

[Back to Top]

Treemap

Treemap is a nice way of displaying hierarchical data by using nested rectangles. The

treemapify

package provides the necessary functions to convert the data in desired format (

treemapify

) as well as draw the actual plot (

ggplotify

).

In order to create a treemap, the data must be converted to desired format using

treemapify()

. The important requirement is, your data must have one variable each that describes the

area

of the tiles, variable for

fill

color, variable that has the tile’s

label

and finally the parent

group

.

Once the data formatting is done, just call

ggplotify()

on the treemapified data.

Treemap With Ggplot

[Back to Top]

Bar Chart

By default,

geom_bar()

has the

stat

set to

count

. That means, when you provide just a continuous X variable (and no Y variable), it tries to make a histogram out of the data.

In order to make a bar chart create bars instead of histogram, you need to do two things.

  1. Set stat=identity
  2. Provide both x and y inside aes() where, x is either character or factor and y is numeric.

A bar chart can be drawn from a categorical column variable or from a separate frequency table. By adjusting

width

, you can adjust the thickness of the bars. If your data source is a frequency table, that is, if you don’t want ggplot to compute the counts, you need to set the

stat=identity

inside the

geom_bar()

.

Bar Chart With Ggplot

It can be computed directly from a column variable as well. In this case, only X is provided and

stat=identity

is not set.

Bar Chart With Multiple Categories in Ggplot

[Back to Top]

6. Change

Time Series Plot From a Time Series Object ( ts)

The

ggfortify

package allows autoplot to automatically plot directly from a time series object (

ts

).

Time series in ggplot with ts object

Time Series Plot From a Data Frame

Using

geom_line()

, a time series (or line chart) can be drawn from a

data.frame

as well. The X axis breaks are generated by default. In below example, the breaks are formed once every 10 years.

Default X Axis Labels

Time series in ggplot from Dataframe

Time Series Plot For a Monthly Time Series

If you want to set your own time intervals (breaks) in X axis, you need to set the breaks and labels using

scale_x_date()

.