Introduction to Plotting with R

Welcome to our comprehensive lecture on creating plots from dataframes in R! Data visualization is a crucial skill in data analysis, allowing us to communicate complex information clearly and efficiently. In this session, we’ll explore various plotting techniques using built-in R dataframes and the powerful ggplot2 library.

Throughout this lecture, we’ll cover different types of plots, their purposes, and when to use them. Each section will include detailed explanations and practical exercises to reinforce your learning.

1. Basic Plotting with Base R

Introduction

We’ll start our journey with R’s built-in plotting functions. These functions provide a quick and straightforward way to visualize data. While they may not be as flexible as more advanced libraries, understanding base R plotting is fundamental and can be useful for quick data exploration.

Purpose and Utility

Scatter plots, which we’ll create in this section, are excellent for visualizing relationships between two continuous variables. They’re particularly useful when you want to: - Identify correlations between variables - Detect outliers or unusual patterns in your data - Understand the distribution of data points across two dimensions

Scatter plots are widely used in various fields, including: - Economics: plotting GDP against life expectancy - Biology: comparing gene expression levels - Environmental science: examining the relationship between temperature and pollution levels

# Load the mtcars dataset
data(mtcars)

# Create a simple scatter plot
plot(mtcars$wt, mtcars$mpg, 
     main = "Car Weight vs. Miles Per Gallon",
     xlab = "Weight (1000 lbs)", 
     ylab = "Miles Per Gallon",
     pch = 19, 
     col = "blue")

In this example, we: 1. Load the mtcars dataset, which is built into R. 2. Use the plot() function to create a scatter plot. 3. Set the main title with main, x-axis label with xlab, and y-axis label with ylab. 4. Use pch = 19 for solid circle points and col = "blue" for blue color.

Exercises

  1. Create a scatter plot using the mtcars dataset to visualize the relationship between horsepower (hp) and quarter-mile time (qsec). Use red triangles for the points.

  2. Using the iris dataset (another built-in R dataset), create a scatter plot of sepal length vs. sepal width. Color the points based on the species. Hint: You’ll need to use the col parameter with a vector of colors corresponding to the species.

2. Introduction to ggplot2

Introduction

Now we’ll dive into ggplot2, a powerful and flexible plotting library in R. ggplot2 is based on the Grammar of Graphics, a coherent system for describing and building graphs. This system allows for highly customizable and layered graphics.

Purpose and Utility

The ggplot2 library offers several advantages over base R plotting: - Consistent and intuitive syntax - Layered approach to building complex graphics - Beautiful default aesthetics - Extensive customization options

ggplot2 is particularly useful when: - Creating publication-quality graphics - Building complex, multi-layered plots - Needing to quickly change aesthetic properties of plots - Working with large datasets

# Install and load ggplot2 if not already installed
# install.packages("ggplot2")
library(ggplot2)

# Create a scatter plot using ggplot2
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Car Weight vs. Miles Per Gallon",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon") +
  theme_minimal()

Here’s what we did: 1. Load the ggplot2 library. 2. Use ggplot() to initialize the plot, specifying the data and aesthetics. 3. Add points with geom_point(). 4. Set labels with labs(). 5. Apply a minimal theme with theme_minimal().

Exercises

  1. Using ggplot2 and the economics dataset (comes with ggplot2), create a line plot of unemployment over time. Use the date column for the x-axis and unemploy for the y-axis. Add appropriate labels and a title.

  2. With the mpg dataset (also included in ggplot2), create a scatter plot of engine displacement (displ) vs. highway miles per gallon (hwy). Color the points by the class of the vehicle. Add a title and appropriate axis labels.

3. Enhancing Plots with Color and Shape

Introduction

In this section, we’ll explore how to enhance our plots by incorporating additional variables through color and shape. This technique allows us to display multidimensional data in a two-dimensional plot, increasing the information density of our visualizations.

Purpose and Utility

Adding color and shape to plots serves several important purposes: - Grouping: It helps viewers quickly identify different categories or groups within the data. - Pattern recognition: It makes it easier to spot trends or patterns specific to certain groups. - Information density: It allows for the representation of additional variables without adding more dimensions to the plot.

This technique is particularly useful when: - Comparing multiple categories within a dataset - Identifying how different factors interact with the main variables being plotted - Presenting complex, multivariable data in a single, comprehensible visualization

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), shape = factor(am))) +
  geom_point(size = 3) +
  labs(title = "Car Weight vs. MPG",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders",
       shape = "Transmission") +
  scale_color_brewer(palette = "Set1") +
  theme_light()

In this example: 1. We use color = factor(cyl) to color points by number of cylinders. 2. shape = factor(am) changes point shapes based on transmission type. 3. scale_color_brewer() applies a color palette from ColorBrewer. 4. theme_light() gives a light background theme.

Exercises

  1. Using the diamonds dataset (included in ggplot2), create a scatter plot of price vs. carat. Use color to represent the cut quality and shape to represent the clarity. Add appropriate labels and a title.

  2. With the iris dataset, create a scatter plot of petal length vs. petal width. Use color to represent the species. Instead of using different shapes, vary the size of the points based on the sepal width. Add a legend for both color and size.

4. Creating Bar Plots

Introduction

Bar plots are one of the most common and effective ways to visualize categorical data. They allow for easy comparison of quantities across different categories or groups.

Purpose and Utility

Bar plots are particularly useful for: - Comparing quantities or frequencies across different categories - Displaying the distribution of a categorical variable - Showing changes in a quantity over time (when categories are time periods) - Presenting survey results or other categorical data

You might use bar plots when: - Analyzing market share across different products or companies - Comparing sales figures across different regions - Visualizing the distribution of responses in a survey - Presenting budget allocations across different departments

# Prepare data
cylinders <- as.data.frame(table(mtcars$cyl))
colnames(cylinders) <- c("Cylinders", "Count")

ggplot(cylinders, aes(x = Cylinders, y = Count, fill = Cylinders)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Cars by Cylinder Count",
       x = "Number of Cylinders",
       y = "Count") +
  theme_classic() +
  scale_fill_brewer(palette = "Pastel1")

Here’s what we did: 1. Create a summary dataframe of cylinder counts. 2. Use geom_bar() with stat = "identity" to create bars of specified heights. 3. Fill bars with different colors based on cylinder count. 4. Apply a classic theme and a pastel color palette.

Exercises

  1. Using the mpg dataset, create a bar plot showing the count of cars for each manufacturer. Order the bars from highest to lowest count. Add appropriate labels and a title.

  2. With the diamonds dataset, create a stacked bar plot showing the proportion of different cuts (fair, good, very good, premium, ideal) for each clarity category. Use different colors for each cut. Add a legend and appropriate labels.

5. Box Plots for Comparing Distributions

Introduction

Box plots, also known as box-and-whisker plots, are an excellent tool for visualizing the distribution of a continuous variable across different categories. They provide a concise summary of the data’s central tendency, spread, and potential outliers.

Purpose and Utility

Box plots are particularly useful for: - Comparing distributions across different groups or categories - Identifying the median, quartiles, and potential outliers in a dataset - Detecting skewness in the data distribution - Comparing the spread of data across different groups

You might use box plots when: - Comparing salary distributions across different departments - Analyzing the distribution of test scores across different schools - Examining the variability of measurement data in scientific experiments - Comparing the performance of different algorithms or methods

ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
  geom_boxplot() +
  labs(title = "Distribution of MPG by Number of Cylinders",
       x = "Number of Cylinders",
       y = "Miles Per Gallon") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")

In this example: 1. We create box plots using geom_boxplot(). 2. Group and fill by number of cylinders. 3. Remove the legend as it’s redundant with x-axis labels.

Exercises

  1. Using the diamonds dataset, create a box plot showing the distribution of price for each cut category. Add color to the boxes based on the cut. Include appropriate labels and a title.

  2. With the gapminder dataset (you may need to install the gapminder package), create a box plot showing the distribution of life expectancy for each continent. Arrange the continents in descending order of median life expectancy. Add color and appropriate labels.

6. Histograms and Density Plots

Introduction

Histograms and density plots are powerful tools for visualizing the distribution of a single continuous variable. They provide insights into the shape, central tendency, and spread of the data.

Purpose and Utility

Histograms and density plots are particularly useful for: - Visualizing the overall distribution of a continuous variable - Identifying the mode(s) of a distribution - Detecting skewness or unusual patterns in the data - Comparing the distribution of a variable across different groups

You might use these plots when: - Analyzing the distribution of ages in a population - Examining the distribution of response times in a psychology experiment - Investigating the distribution of prices in a real estate market - Comparing the distribution of a variable before and after an intervention

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(aes(y = ..density..), binwidth = 2, fill = "skyblue", color = "black") +
  geom_density(color = "red", size = 1) +
  labs(title = "Distribution of Miles Per Gallon",
       x = "Miles Per Gallon",
       y = "Density") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Here’s what we did: 1. Create a histogram with geom_histogram(), setting y = ..density.. for density scale. 2. Overlay a density curve with geom_density(). 3. Customize colors and labels for clarity.

Exercises

  1. Using the diamonds dataset, create a histogram of the ‘price’ variable. Experiment with different bin widths to see how it affects the visualization. Add a density curve on top of the histogram. Include appropriate labels and a title.

  2. With the faithful dataset (built into R), create two density plots on the same graph: one for eruption duration and one for waiting time between eruptions. Use different colors for each density curve and add a legend. Normalize the scales so that both curves use the same y-axis. Add appropriate labels and a title.

7. Faceting for Multi-panel Plots

Introduction

Faceting is a powerful technique in data visualization that allows you to create multiple panels or subplots based on categorical variables. This approach is particularly useful when you want to compare patterns across different subgroups of your data.

Purpose and Utility

Faceting is especially useful for: - Comparing trends or patterns across different categories - Visualizing how the relationship between variables changes across different groups - Displaying multiple aspects of a dataset in a single, organized figure - Reducing overplotting in complex datasets

You might use faceting when: - Comparing sales trends across different regions over time - Analyzing how the relationship between two variables varies across different categories - Visualizing multiple related metrics for different groups - Exploring how a distribution changes based on one or more categorical variables

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(am))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~cyl, nrow = 1) +
  labs(title = "Weight vs. MPG by Cylinders and Transmission",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Transmission") +
  theme_bw() +
  scale_color_brewer(palette = "Set1", labels = c("Automatic", "Manual"))
`geom_smooth()` using formula = 'y ~ x'

In this example: 1. We use facet_wrap() to create separate panels for each cylinder count. 2. Add trend lines with geom_smooth(). 3. Color points and lines by transmission type. 4. Customize labels and theme for better readability.

Exercises

  1. Using the diamonds dataset, create a scatter plot of price vs. carat. Facet the plot by cut, creating a 2x3 grid of subplots. Color the points by clarity. Add a smooth trend line to each facet. Include appropriate labels and a title.

  2. With the mpg dataset, create a box plot of highway fuel efficiency (hwy) for different car classes. Facet the plot by the number of cylinders (cyl). Color the boxes by the type of drive (drv). Arrange the facets in a single row. Add appropriate labels and a title.

Conclusion

This lecture has covered a range of plotting techniques in R, from basic scatter plots to more complex, multi-layered visualizations. Remember, the key to effective data visualization is choosing the right plot type for your data and research question. Practice with different datasets and experiment with various ggplot2 functions to become proficient in creating informative and visually appealing plots.

Additional Resources

Happy plotting!

