Introduction to Plotting with R
Welcome to our comprehensive lecture on creating plots from
dataframes in R! Data visualization is a crucial skill in data analysis,
allowing us to communicate complex information clearly and efficiently.
In this session, we’ll explore various plotting techniques using
built-in R dataframes and the powerful ggplot2
library.
Throughout this lecture, we’ll cover different types of plots, their
purposes, and when to use them. Each section will include detailed
explanations and practical exercises to reinforce your learning.
1. Basic Plotting with Base R
Introduction
We’ll start our journey with R’s built-in plotting functions. These
functions provide a quick and straightforward way to visualize data.
While they may not be as flexible as more advanced libraries,
understanding base R plotting is fundamental and can be useful for quick
data exploration.
Purpose and Utility
Scatter plots, which we’ll create in this section, are excellent for
visualizing relationships between two continuous variables. They’re
particularly useful when you want to: - Identify correlations between
variables - Detect outliers or unusual patterns in your data -
Understand the distribution of data points across two dimensions
Scatter plots are widely used in various fields, including: -
Economics: plotting GDP against life expectancy - Biology: comparing
gene expression levels - Environmental science: examining the
relationship between temperature and pollution levels
# Load the mtcars dataset
data(mtcars)
# Create a simple scatter plot
plot(mtcars$wt, mtcars$mpg,
main = "Car Weight vs. Miles Per Gallon",
xlab = "Weight (1000 lbs)",
ylab = "Miles Per Gallon",
pch = 19,
col = "blue")
In this example, we: 1. Load the mtcars
dataset, which
is built into R. 2. Use the plot()
function to create a
scatter plot. 3. Set the main title with main
, x-axis label
with xlab
, and y-axis label with ylab
. 4. Use
pch = 19
for solid circle points and
col = "blue"
for blue color.
Exercises
Create a scatter plot using the mtcars
dataset to
visualize the relationship between horsepower (hp
) and
quarter-mile time (qsec
). Use red triangles for the
points.
Using the iris
dataset (another built-in R dataset),
create a scatter plot of sepal length vs. sepal width. Color the points
based on the species. Hint: You’ll need to use the col
parameter with a vector of colors corresponding to the species.
2. Introduction to ggplot2
Introduction
Now we’ll dive into ggplot2
, a powerful and flexible
plotting library in R. ggplot2
is based on the Grammar of
Graphics, a coherent system for describing and building graphs. This
system allows for highly customizable and layered graphics.
Purpose and Utility
The ggplot2
library offers several advantages over base
R plotting: - Consistent and intuitive syntax - Layered approach to
building complex graphics - Beautiful default aesthetics - Extensive
customization options
ggplot2
is particularly useful when: - Creating
publication-quality graphics - Building complex, multi-layered plots -
Needing to quickly change aesthetic properties of plots - Working with
large datasets
# Install and load ggplot2 if not already installed
# install.packages("ggplot2")
library(ggplot2)
# Create a scatter plot using ggplot2
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Car Weight vs. Miles Per Gallon",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon") +
theme_minimal()
Here’s what we did: 1. Load the ggplot2
library. 2. Use
ggplot()
to initialize the plot, specifying the data and
aesthetics. 3. Add points with geom_point()
. 4. Set labels
with labs()
. 5. Apply a minimal theme with
theme_minimal()
.
Exercises
Using ggplot2
and the economics
dataset
(comes with ggplot2), create a line plot of unemployment over time. Use
the date
column for the x-axis and unemploy
for the y-axis. Add appropriate labels and a title.
With the mpg
dataset (also included in ggplot2),
create a scatter plot of engine displacement (displ
)
vs. highway miles per gallon (hwy
). Color the points by the
class
of the vehicle. Add a title and appropriate axis
labels.
3. Enhancing Plots with Color and Shape
Introduction
In this section, we’ll explore how to enhance our plots by
incorporating additional variables through color and shape. This
technique allows us to display multidimensional data in a
two-dimensional plot, increasing the information density of our
visualizations.
Purpose and Utility
Adding color and shape to plots serves several important purposes: -
Grouping: It helps viewers quickly identify different categories or
groups within the data. - Pattern recognition: It makes it easier to
spot trends or patterns specific to certain groups. - Information
density: It allows for the representation of additional variables
without adding more dimensions to the plot.
This technique is particularly useful when: - Comparing multiple
categories within a dataset - Identifying how different factors interact
with the main variables being plotted - Presenting complex,
multivariable data in a single, comprehensible visualization
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), shape = factor(am))) +
geom_point(size = 3) +
labs(title = "Car Weight vs. MPG",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon",
color = "Cylinders",
shape = "Transmission") +
scale_color_brewer(palette = "Set1") +
theme_light()
In this example: 1. We use color = factor(cyl)
to color
points by number of cylinders. 2. shape = factor(am)
changes point shapes based on transmission type. 3.
scale_color_brewer()
applies a color palette from
ColorBrewer. 4. theme_light()
gives a light background
theme.
Exercises
Using the diamonds
dataset (included in ggplot2),
create a scatter plot of price vs. carat. Use color to represent the cut
quality and shape to represent the clarity. Add appropriate labels and a
title.
With the iris
dataset, create a scatter plot of
petal length vs. petal width. Use color to represent the species.
Instead of using different shapes, vary the size of the points based on
the sepal width. Add a legend for both color and size.
4. Creating Bar Plots
Introduction
Bar plots are one of the most common and effective ways to visualize
categorical data. They allow for easy comparison of quantities across
different categories or groups.
Purpose and Utility
Bar plots are particularly useful for: - Comparing quantities or
frequencies across different categories - Displaying the distribution of
a categorical variable - Showing changes in a quantity over time (when
categories are time periods) - Presenting survey results or other
categorical data
You might use bar plots when: - Analyzing market share across
different products or companies - Comparing sales figures across
different regions - Visualizing the distribution of responses in a
survey - Presenting budget allocations across different departments
# Prepare data
cylinders <- as.data.frame(table(mtcars$cyl))
colnames(cylinders) <- c("Cylinders", "Count")
ggplot(cylinders, aes(x = Cylinders, y = Count, fill = Cylinders)) +
geom_bar(stat = "identity") +
labs(title = "Number of Cars by Cylinder Count",
x = "Number of Cylinders",
y = "Count") +
theme_classic() +
scale_fill_brewer(palette = "Pastel1")
Here’s what we did: 1. Create a summary dataframe of cylinder counts.
2. Use geom_bar()
with stat = "identity"
to
create bars of specified heights. 3. Fill bars with different colors
based on cylinder count. 4. Apply a classic theme and a pastel color
palette.
Exercises
Using the mpg
dataset, create a bar plot showing the
count of cars for each manufacturer. Order the bars from highest to
lowest count. Add appropriate labels and a title.
With the diamonds
dataset, create a stacked bar plot
showing the proportion of different cuts (fair, good, very good,
premium, ideal) for each clarity category. Use different colors for each
cut. Add a legend and appropriate labels.
5. Box Plots for Comparing Distributions
Introduction
Box plots, also known as box-and-whisker plots, are an excellent tool
for visualizing the distribution of a continuous variable across
different categories. They provide a concise summary of the data’s
central tendency, spread, and potential outliers.
Purpose and Utility
Box plots are particularly useful for: - Comparing distributions
across different groups or categories - Identifying the median,
quartiles, and potential outliers in a dataset - Detecting skewness in
the data distribution - Comparing the spread of data across different
groups
You might use box plots when: - Comparing salary distributions across
different departments - Analyzing the distribution of test scores across
different schools - Examining the variability of measurement data in
scientific experiments - Comparing the performance of different
algorithms or methods
ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
geom_boxplot() +
labs(title = "Distribution of MPG by Number of Cylinders",
x = "Number of Cylinders",
y = "Miles Per Gallon") +
theme_bw() +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none")
In this example: 1. We create box plots using
geom_boxplot()
. 2. Group and fill by number of cylinders.
3. Remove the legend as it’s redundant with x-axis labels.
Exercises
Using the diamonds
dataset, create a box plot
showing the distribution of price for each cut category. Add color to
the boxes based on the cut. Include appropriate labels and a
title.
With the gapminder
dataset (you may need to install
the gapminder package), create a box plot showing the distribution of
life expectancy for each continent. Arrange the continents in descending
order of median life expectancy. Add color and appropriate
labels.
6. Histograms and Density Plots
Introduction
Histograms and density plots are powerful tools for visualizing the
distribution of a single continuous variable. They provide insights into
the shape, central tendency, and spread of the data.
Purpose and Utility
Histograms and density plots are particularly useful for: -
Visualizing the overall distribution of a continuous variable -
Identifying the mode(s) of a distribution - Detecting skewness or
unusual patterns in the data - Comparing the distribution of a variable
across different groups
You might use these plots when: - Analyzing the distribution of ages
in a population - Examining the distribution of response times in a
psychology experiment - Investigating the distribution of prices in a
real estate market - Comparing the distribution of a variable before and
after an intervention
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(aes(y = ..density..), binwidth = 2, fill = "skyblue", color = "black") +
geom_density(color = "red", size = 1) +
labs(title = "Distribution of Miles Per Gallon",
x = "Miles Per Gallon",
y = "Density") +
theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Here’s what we did: 1. Create a histogram with
geom_histogram()
, setting y = ..density..
for
density scale. 2. Overlay a density curve with
geom_density()
. 3. Customize colors and labels for
clarity.
Exercises
Using the diamonds
dataset, create a histogram of
the ‘price’ variable. Experiment with different bin widths to see how it
affects the visualization. Add a density curve on top of the histogram.
Include appropriate labels and a title.
With the faithful
dataset (built into R), create two
density plots on the same graph: one for eruption duration and one for
waiting time between eruptions. Use different colors for each density
curve and add a legend. Normalize the scales so that both curves use the
same y-axis. Add appropriate labels and a title.
7. Faceting for Multi-panel Plots
Introduction
Faceting is a powerful technique in data visualization that allows
you to create multiple panels or subplots based on categorical
variables. This approach is particularly useful when you want to compare
patterns across different subgroups of your data.
Purpose and Utility
Faceting is especially useful for: - Comparing trends or patterns
across different categories - Visualizing how the relationship between
variables changes across different groups - Displaying multiple aspects
of a dataset in a single, organized figure - Reducing overplotting in
complex datasets
You might use faceting when: - Comparing sales trends across
different regions over time - Analyzing how the relationship between two
variables varies across different categories - Visualizing multiple
related metrics for different groups - Exploring how a distribution
changes based on one or more categorical variables
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(am))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~cyl, nrow = 1) +
labs(title = "Weight vs. MPG by Cylinders and Transmission",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon",
color = "Transmission") +
theme_bw() +
scale_color_brewer(palette = "Set1", labels = c("Automatic", "Manual"))
`geom_smooth()` using formula = 'y ~ x'
In this example: 1. We use facet_wrap()
to create
separate panels for each cylinder count. 2. Add trend lines with
geom_smooth()
. 3. Color points and lines by transmission
type. 4. Customize labels and theme for better readability.
Exercises
Using the diamonds
dataset, create a scatter plot of
price vs. carat. Facet the plot by cut, creating a 2x3 grid of subplots.
Color the points by clarity. Add a smooth trend line to each facet.
Include appropriate labels and a title.
With the mpg
dataset, create a box plot of highway
fuel efficiency (hwy) for different car classes. Facet the plot by the
number of cylinders (cyl). Color the boxes by the type of drive (drv).
Arrange the facets in a single row. Add appropriate labels and a
title.
Conclusion
This lecture has covered a range of plotting techniques in R, from
basic scatter plots to more complex, multi-layered visualizations.
Remember, the key to effective data visualization is choosing the right
plot type for your data and research question. Practice with different
datasets and experiment with various ggplot2
functions to
become proficient in creating informative and visually appealing
plots.
Additional Resources
Happy plotting!
