In the realm of data visualization, scatter plots serve as a fundamental tool for examining the relationship between two continuous variables. Harnessing the power of R to create and customize scatter plots enables a deeper understanding of data insights. In this blog post, we’ll walk you through how to use R to tailor various elements of a scatter plot, followed by an enhanced example incorporating data point labeling and jitter to minimize overlap.
The Basics of Scatter Plots in R
The mtcars dataset, a staple in R's datasets package, provides a rich playground. With variables like miles per gallon (mpg) and horsepower (hp), it's ideal for practicing scatter plot creation. Here's how R allows you to alter the visual components of your plot:
- Color (col): By changing the col parameter, you can differentiate data points, adding a layer of information. For example, using different colors for different categories within your data.
- Shape (pch): R provides a plethora of shapes, accessible via the pch parameter. Whether it's a circle, triangle, or square, each shape can symbolize a unique data category.
- Axes Titles (xlab and ylab): Clarity is key in data visualization. Custom axis titles allow you to clearly state what each axis represents.
- Plot Title (main): The title of your scatter plot provides context and summarizes the narrative of the visualization.
Incorporating these elements not only enhances the aesthetic appeal of your scatter plot but also amplifies its communicative value.
Try this code now, and please look closely at the hashtag comments within to get a better idea of how it works. You can use these concepts in your own R code.
# Load the mtcars dataset
data(mtcars)
# Create a scatter plot using plot() function
# mpg ~ hp reads as 'mpg as a function of hp'
# col defines the color, pch defines the shape of points
plot(mtcars$hp, mtcars$mpg,
col = "blue", # Change point colors to blue
pch = 19, # Change point shapes to solid circle
main = "MPG vs Horsepower", # Add main title
xlab = "Horsepower", # Custom x-axis title
ylab = "Miles/(US) gallon") # Custom y-axis title
# If you want to have multiple colors and shapes, you can specify a vector for them
# Let's say we want to color cars with more than 4 cylinders in red and others in green
colors <- ifelse(mtcars$cyl > 4, "red", "green")
# And let's have different shapes based on the number of gears
shapes <- ifelse(mtcars$gear == 4, 17, ifelse(mtcars$gear == 5, 18, 8))
# Create the scatter plot with these vectors
plot(mtcars$hp, mtcars$mpg,
col = colors, # Set colors based on number of cylinders
pch = shapes, # Set shapes based on number of gears
main = "MPG vs Horsepower", # Add main title
xlab = "Horsepower", # Custom x-axis title
ylab = "Miles/(US) gallon") # Custom y-axis title
Here’s what you get:
Advanced Customization: Labeling and Jitter
While basic customization offers clarity, advanced techniques like point labeling and jitter bring precision to crowded plots. Labels offer a direct identification of data points, which is particularly beneficial when presenting individual data points with significant meaning.
However, direct labeling can lead to text overlap, especially in densely populated areas of the plot. This is where 'jitter' comes in. Jitter adds a small, random displacement to each point, reducing overlap and making labels more legible. We can also use 'cex' to make text size smaller and avoid overlap.
Let’s rewrite our initial code to include these advanced customization options:
# Load the mtcars dataset
data(mtcars)
# Base scatter plot with jitter
plot(jitter(mtcars$hp), jitter(mtcars$mpg),
col = "blue",
pch = 19,
main = "MPG vs Horsepower",
xlab = "Horsepower",
ylab = "Miles/(US) gallon")
# Customizing colors and shapes by criteria
colors <- ifelse(mtcars$cyl > 4, "red", "green")
shapes <- ifelse(mtcars$gear == 4, 17, ifelse(mtcars$gear == 5, 18, 8))
# Adding jitter to the points
hp_jitter <- jitter(mtcars$hp)
mpg_jitter <- jitter(mtcars$mpg)
# Create the scatter plot with jittered data
plot(hp_jitter, mpg_jitter,
col = colors,
pch = shapes,
main = "MPG vs Horsepower",
xlab = "Horsepower",
ylab = "Miles/(US) gallon")
# Label each data point with its row name
text(hp_jitter, mpg_jitter, labels = rownames(mtcars), cex = 0.6, pos = 4)
And now you get:
In this modified code, the amount argument in the jitter() function is increased to apply more jitter. The xlim and ylim parameters are set to ensure that the increased jitter doesn't cause any points to fall outside the plot area. The cex parameter is set to 0.5 to reduce the text size, potentially decreasing the overlap further.
If labels still overlap, consider the following additional methods:
- Manually adjust the position of specific labels that overlap.
- Use the ggrepel package, which provides a geom_text_repel function specifically designed to avoid text overlap in plots.
- Increase plotting area size to give more room for labels.
Here is an example using ggrepel with ggplot2, which can handle overlapping text more elegantly:
# Check and install 'ggplot2' if not already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
# Check and install 'ggrepel' if not already installed
if (!requireNamespace("ggrepel", quietly = TRUE)) {
install.packages("ggrepel")
}
# Load the libraries
library(ggplot2)
library(ggrepel)
# Use ggplot2 and ggrepel to create the scatter plot with labeled points
ggplot(mtcars, aes(x = hp, y = mpg, label = rownames(mtcars))) +
geom_point(aes(color = factor(cyl), shape = factor(gear))) +
geom_text_repel(size = 3,
box.padding = unit(0.35, "lines"), # Padding around the text
point.padding = unit(0.5, "lines")) + # Padding around the points
scale_color_manual(values = c("4" = "green", "6" = "blue", "8" = "red")) +
labs(main = "MPG vs Horsepower",
x = "Horsepower",
y = "Miles/(US) gallon",
color = "Cylinders",
shape = "Gears") +
theme_minimal()
And here is what you get this time:
In the changed code,
- The requireNamespace() function is used to check for each package.
- If the package is not installed, install.packages() is called to install it.
- Once the necessary packages are available, the library() function loads them for use.
- The ggplot function then creates the scatter plot, using geom_point to plot the points and geom_text_repel from the ggrepel package to add labels with automatic repulsion to minimize overlap.
- The scale_color_manual() function customizes the color of the points based on the number of cylinders, and labs() is used to add a main title and axis labels.
- Padding is added around the text and points to further minimize overlap.