Introduction
This notebook covers various aspects of working with data, including
data visualization, facets, time series analysis, linear regression, and
classification. We’ll use R and its popular libraries to demonstrate
these concepts, provide detailed explanations, and offer exercises for
practice.
1. Data Visualization
Explanation
Data visualization is the graphical representation of information and
data. It uses statistical graphics, plots, and other visual elements to
communicate complex data relationships and patterns effectively. The
human brain processes visual information much faster than text, making
data visualization an essential tool for understanding and presenting
data.
In R, the ‘ggplot2’ package is widely used for creating data
visualizations. It’s based on the Grammar of Graphics, a layered
approach to describing and constructing visualizations.
Key components of a ggplot: 1. Data: The dataset you’re visualizing
2. Aesthetics (aes): Mapping of variables to visual properties (e.g.,
x-axis, y-axis, color, size) 3. Geometries (geom): The type of plot
(e.g., points, lines, bars) 4. Scales: How the data is mapped to the
visual properties 5. Facets: Division of the plot into subplots based on
categorical variables 6. Theme: Overall visual style of the plot
Example
Let’s create a more complex visualization using the mtcars
dataset:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), size = hp)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_viridis_d() +
labs(title = "Car Weight vs. MPG",
subtitle = "Colored by number of cylinders, size represents horsepower",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon",
color = "Cylinders",
size = "Horsepower") +
theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size.
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
This plot shows: - The relationship between weight (x-axis) and miles
per gallon (y-axis) - The number of cylinders (color) - The horsepower
(size of points) - A linear regression line for each cylinder group
Exercises
Using the built-in diamonds
dataset, create a
scatter plot of price vs. carat. Color the points by the cut quality and
add a title and proper axis labels.
With the mpg
dataset, create a box plot showing the
distribution of highway miles per gallon (hwy) for different car classes
(class). Add color to the boxes and include a title and axis
labels.
Using the economics
dataset, create a line plot
showing both personal savings rate (psavert) and unemployment rate
(uempmed) over time. Use two different y-axes (hint: look up
sec_axis()
). Include a legend, title, and appropriate
labels.
2. Facets
Explanation
Faceting is a technique used to split a plot into multiple subplots
based on one or more categorical variables. This allows for easy
comparison of patterns across different subgroups in the data. In
ggplot2, there are two main faceting functions:
facet_wrap()
: Arranges a series of plots into a 2D
grid, “wrapping” around with a specified number of rows or columns.
facet_grid()
: Creates a grid of plots with rows and
columns based on different categorical variables.
Faceting is particularly useful when you want to: - Compare trends or
patterns across different categories - Visualize how relationships
change across subgroups - Display multiple related plots in a compact
format
Example
Let’s create a faceted plot using the mpg
dataset:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class), alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~ manufacturer, scales = "free") +
labs(title = "Highway MPG vs. Engine Displacement",
subtitle = "Faceted by manufacturer, colored by vehicle class",
x = "Engine Displacement (L)",
y = "Highway MPG",
color = "Vehicle Class") +
theme_minimal() +
theme(legend.position = "bottom")
`geom_smooth()` using formula = 'y ~ x'
This plot shows: - The relationship between engine displacement and
highway MPG - Facets for each car manufacturer - Colors representing
different vehicle classes - A linear regression line for each
manufacturer
Exercises
Using the diamonds
dataset, create a faceted
histogram of diamond prices. Facet by cut quality and fill the
histograms by diamond color. Add appropriate labels and a
title.
With the mpg
dataset, create a faceted scatter plot
of city MPG vs. highway MPG. Facet by the number of cylinders and color
the points by the type of drive (drv). Include a regression line for
each facet.
Using the txhousing
dataset, create a faceted line
plot showing the median housing price over time for different cities.
Use facet_wrap()
to create a 3x3 grid of plots for the top
9 cities by median housing price. Add appropriate labels and a
title.
3. Time Series
Explanation
Time series analysis involves studying data points collected over
time, typically at regular intervals. It’s used to identify trends,
seasonal patterns, and other time-dependent structures in the data. Key
components of time series analysis include:
- Trend: Long-term increase or decrease in the data
- Seasonality: Repeating patterns at fixed intervals
- Cyclical patterns: Fluctuations not tied to a fixed period
- Random variations: Unexplained variability in the data
In R, time series data is often stored in specialized objects like
ts
(base R) or xts
(from the xts package). The
lubridate
package is useful for handling dates and times,
while forecast
provides functions for time series
forecasting.
Common techniques in time series analysis include: - Moving averages
- Exponential smoothing - ARIMA (Autoregressive Integrated Moving
Average) models - Decomposition of time series into trend, seasonal, and
residual components
Example
Let’s create a more complex time series example using the
AirPassengers
dataset:
data("AirPassengers")
ap_df <- data.frame(
date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "month"),
passengers = as.numeric(AirPassengers)
)
# Decompose the time series
ap_decomp <- decompose(AirPassengers)
# Plot the original series and its components
par(mfrow = c(4, 1), mar = c(2, 2, 2, 2))
plot(ap_decomp$x, main = "Original Time Series", ylab = "Passengers")
plot(ap_decomp$trend, main = "Trend", ylab = "Trend")
plot(ap_decomp$seasonal, main = "Seasonal", ylab = "Seasonal")
plot(ap_decomp$random, main = "Random", ylab = "Random")
This example shows: - The original time series of airline passengers
- The decomposition of the series into trend, seasonal, and random
components
Exercises
Using the economics
dataset, create a time series
plot of the personal savings rate (psavert) from 1967 to 2015. Add a
trend line using a simple moving average with a window of 12
months.
With the nhtemp
dataset (average yearly temperatures
in New Haven), create a time series plot and decompose it into trend,
seasonal, and random components. Plot each component
separately.
Using the co2
dataset (atmospheric CO2
concentrations), create a time series forecast for the next 24 months
using an appropriate model (e.g., ARIMA). Plot the original data, the
fitted values, and the forecast with prediction intervals.
4. Linear Regression
Explanation
Linear regression is a statistical method used to model the
relationship between a dependent variable and one or more independent
variables. In its simplest form (simple linear regression), it assumes a
linear relationship between two variables:
y = β₀ + β₁x + ε
Where: - y is the dependent variable - x is the independent variable
- β₀ is the y-intercept - β₁ is the slope - ε is the error term
Multiple linear regression extends this to multiple independent
variables:
y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε
Key concepts in linear regression: 1. Coefficients: Estimated using
the method of least squares 2. R-squared: Measure of how well the model
fits the data 3. Residuals: Differences between observed and predicted
values 4. Assumptions: Linearity, independence, homoscedasticity,
normality of residuals
Example
Let’s perform a multiple linear regression using the mtcars
dataset:
# Fit a multiple linear regression model
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)
# Print the summary
summary(model)
Call:
lm(formula = mpg ~ wt + hp + qsec, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.8591 -1.6418 -0.4636 1.1940 5.6092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.61053 8.41993 3.279 0.00278 **
wt -4.35880 0.75270 -5.791 3.22e-06 ***
hp -0.01782 0.01498 -1.190 0.24418
qsec 0.51083 0.43922 1.163 0.25463
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.578 on 28 degrees of freedom
Multiple R-squared: 0.8348, Adjusted R-squared: 0.8171
F-statistic: 47.15 on 3 and 28 DF, p-value: 4.506e-11
# Plot diagnostics
par(mfrow = c(2, 2))
plot(model)
This example shows: - A multiple linear regression model predicting
MPG from weight, horsepower, and quarter-mile time - The summary output
with coefficients, R-squared, and p-values - Diagnostic plots for
assessing model assumptions
Exercises
Using the boston
dataset from the MASS package,
create a simple linear regression model to predict median house value
(medv) based on the average number of rooms (rm). Plot the data points
and the regression line, and interpret the coefficients.
With the mtcars
dataset, build a multiple linear
regression model to predict quarter-mile time (qsec) using horsepower
(hp), weight (wt), and transmission type (am). Interpret the results and
check the model assumptions using diagnostic plots.
Using the diamonds
dataset, create a linear
regression model to predict the price of a diamond based on its carat,
cut, color, and clarity. Use dummy variables for categorical predictors.
Interpret the results and discuss which factors have the most
significant impact on diamond prices.
5. Classification
Explanation
Classification is a supervised machine learning technique used to
predict categorical outcomes. It involves training a model on labeled
data to learn the relationship between features and class labels, then
using that model to predict the class of new, unseen instances.
Common classification algorithms include: 1. Logistic Regression 2.
k-Nearest Neighbors (k-NN) 3. Decision Trees 4. Random Forests 5.
Support Vector Machines (SVM) 6. Neural Networks
Key concepts in classification: - Features: Input variables used to
make predictions - Target variable: The categorical outcome we’re trying
to predict - Training and testing sets: Data used to build and evaluate
the model - Confusion matrix: Table showing correct and incorrect
predictions - Accuracy, precision, recall, and F1-score: Metrics for
evaluating model performance
Example
Let’s perform a more complex classification task using the iris
dataset and a Random Forest classifier:
# Load necessary library
library(randomForest)
# Split the data
set.seed(123)
train_indices <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]
# Train a Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100)
# Make predictions
predictions <- predict(rf_model, newdata = test_data)
# Evaluate the model
confusionMatrix(predictions, test_data$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 15 0 0
versicolor 0 14 2
virginica 0 1 13
Overall Statistics
Accuracy : 0.9333
95% CI : (0.8173, 0.986)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9333 0.8667
Specificity 1.0000 0.9333 0.9667
Pos Pred Value 1.0000 0.8750 0.9286
Neg Pred Value 1.0000 0.9655 0.9355
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3111 0.2889
Detection Prevalence 0.3333 0.3556 0.3111
Balanced Accuracy 1.0000 0.9333 0.9167
# Plot feature importance
varImpPlot(rf_model, main = "Feature Importance in Random Forest Model")
This example demonstrates: - Training a Random Forest classifier on
the iris dataset - Making predictions on a test set - Evaluating the
model using a confusion matrix - Visualizing feature importance
Exercises
Using the mtcars
dataset, create a logistic
regression model to predict whether a car has an automatic or manual
transmission (am) based on other features. Split the data into training
and testing sets, fit the model, make predictions, and evaluate its
performance using a confusion matrix and ROC curve.
With the wine
dataset from the rattle package, build
a k-Nearest Neighbors classifier to predict the wine type. Experiment
with different values of k and use cross-validation to select the best
model. Evaluate the final model’s performance on a held-out test
set.
Using the Default
dataset from the ISLR package,
create a decision tree classifier to predict whether a customer will
default on their credit card payment. Visualize the tree, interpret the
results, and evaluate the model’s performance. Then, compare its
performance to a random forest classifier on the same data.
---
title: "Working with Data: A Comprehensive Guide with Exercises"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(lubridate)
library(caret)
library(tidyr)
library(forecast)
library(randomForest) 
```

## Introduction

This notebook covers various aspects of working with data, including data visualization, facets, time series analysis, linear regression, and classification. We'll use R and its popular libraries to demonstrate these concepts, provide detailed explanations, and offer exercises for practice.

## 1. Data Visualization

### Explanation

Data visualization is the graphical representation of information and data. It uses statistical graphics, plots, and other visual elements to communicate complex data relationships and patterns effectively. The human brain processes visual information much faster than text, making data visualization an essential tool for understanding and presenting data.

In R, the 'ggplot2' package is widely used for creating data visualizations. It's based on the Grammar of Graphics, a layered approach to describing and constructing visualizations.

Key components of a ggplot:
1. Data: The dataset you're visualizing
2. Aesthetics (aes): Mapping of variables to visual properties (e.g., x-axis, y-axis, color, size)
3. Geometries (geom): The type of plot (e.g., points, lines, bars)
4. Scales: How the data is mapped to the visual properties
5. Facets: Division of the plot into subplots based on categorical variables
6. Theme: Overall visual style of the plot

### Example

Let's create a more complex visualization using the mtcars dataset:

```{r}
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), size = hp)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_viridis_d() +
  labs(title = "Car Weight vs. MPG",
       subtitle = "Colored by number of cylinders, size represents horsepower",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders",
       size = "Horsepower") +
  theme_minimal()
```

This plot shows:
- The relationship between weight (x-axis) and miles per gallon (y-axis)
- The number of cylinders (color)
- The horsepower (size of points)
- A linear regression line for each cylinder group

### Exercises

1. Using the built-in `diamonds` dataset, create a scatter plot of price vs. carat. Color the points by the cut quality and add a title and proper axis labels.

2. With the `mpg` dataset, create a box plot showing the distribution of highway miles per gallon (hwy) for different car classes (class). Add color to the boxes and include a title and axis labels.

3. Using the `economics` dataset, create a line plot showing both personal savings rate (psavert) and unemployment rate (uempmed) over time. Use two different y-axes (hint: look up `sec_axis()`). Include a legend, title, and appropriate labels.

## 2. Facets

### Explanation

Faceting is a technique used to split a plot into multiple subplots based on one or more categorical variables. This allows for easy comparison of patterns across different subgroups in the data. In ggplot2, there are two main faceting functions:

1. `facet_wrap()`: Arranges a series of plots into a 2D grid, "wrapping" around with a specified number of rows or columns.
2. `facet_grid()`: Creates a grid of plots with rows and columns based on different categorical variables.

Faceting is particularly useful when you want to:
- Compare trends or patterns across different categories
- Visualize how relationships change across subgroups
- Display multiple related plots in a compact format

### Example

Let's create a faceted plot using the `mpg` dataset:

```{r}
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  facet_wrap(~ manufacturer, scales = "free") +
  labs(title = "Highway MPG vs. Engine Displacement",
       subtitle = "Faceted by manufacturer, colored by vehicle class",
       x = "Engine Displacement (L)",
       y = "Highway MPG",
       color = "Vehicle Class") +
  theme_minimal() +
  theme(legend.position = "bottom")
```

This plot shows:
- The relationship between engine displacement and highway MPG
- Facets for each car manufacturer
- Colors representing different vehicle classes
- A linear regression line for each manufacturer

### Exercises

1. Using the `diamonds` dataset, create a faceted histogram of diamond prices. Facet by cut quality and fill the histograms by diamond color. Add appropriate labels and a title.

2. With the `mpg` dataset, create a faceted scatter plot of city MPG vs. highway MPG. Facet by the number of cylinders and color the points by the type of drive (drv). Include a regression line for each facet.

3. Using the `txhousing` dataset, create a faceted line plot showing the median housing price over time for different cities. Use `facet_wrap()` to create a 3x3 grid of plots for the top 9 cities by median housing price. Add appropriate labels and a title.

## 3. Time Series

### Explanation

Time series analysis involves studying data points collected over time, typically at regular intervals. It's used to identify trends, seasonal patterns, and other time-dependent structures in the data. Key components of time series analysis include:

1. Trend: Long-term increase or decrease in the data
2. Seasonality: Repeating patterns at fixed intervals
3. Cyclical patterns: Fluctuations not tied to a fixed period
4. Random variations: Unexplained variability in the data

In R, time series data is often stored in specialized objects like `ts` (base R) or `xts` (from the xts package). The `lubridate` package is useful for handling dates and times, while `forecast` provides functions for time series forecasting.

Common techniques in time series analysis include:
- Moving averages
- Exponential smoothing
- ARIMA (Autoregressive Integrated Moving Average) models
- Decomposition of time series into trend, seasonal, and residual components

### Example

Let's create a more complex time series example using the `AirPassengers` dataset:

```{r}
data("AirPassengers")
ap_df <- data.frame(
  date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "month"),
  passengers = as.numeric(AirPassengers)
)

# Decompose the time series
ap_decomp <- decompose(AirPassengers)

# Plot the original series and its components
par(mfrow = c(4, 1), mar = c(2, 2, 2, 2))
plot(ap_decomp$x, main = "Original Time Series", ylab = "Passengers")
plot(ap_decomp$trend, main = "Trend", ylab = "Trend")
plot(ap_decomp$seasonal, main = "Seasonal", ylab = "Seasonal")
plot(ap_decomp$random, main = "Random", ylab = "Random")
```

This example shows:
- The original time series of airline passengers
- The decomposition of the series into trend, seasonal, and random components

### Exercises

1. Using the `economics` dataset, create a time series plot of the personal savings rate (psavert) from 1967 to 2015. Add a trend line using a simple moving average with a window of 12 months.

2. With the `nhtemp` dataset (average yearly temperatures in New Haven), create a time series plot and decompose it into trend, seasonal, and random components. Plot each component separately.

3. Using the `co2` dataset (atmospheric CO2 concentrations), create a time series forecast for the next 24 months using an appropriate model (e.g., ARIMA). Plot the original data, the fitted values, and the forecast with prediction intervals.

## 4. Linear Regression

### Explanation

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. In its simplest form (simple linear regression), it assumes a linear relationship between two variables:

y = β₀ + β₁x + ε

Where:
- y is the dependent variable
- x is the independent variable
- β₀ is the y-intercept
- β₁ is the slope
- ε is the error term

Multiple linear regression extends this to multiple independent variables:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε

Key concepts in linear regression:
1. Coefficients: Estimated using the method of least squares
2. R-squared: Measure of how well the model fits the data
3. Residuals: Differences between observed and predicted values
4. Assumptions: Linearity, independence, homoscedasticity, normality of residuals

### Example

Let's perform a multiple linear regression using the mtcars dataset:

```{r}
# Fit a multiple linear regression model
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)

# Print the summary
summary(model)

# Plot diagnostics
par(mfrow = c(2, 2))
plot(model)
```

This example shows:
- A multiple linear regression model predicting MPG from weight, horsepower, and quarter-mile time
- The summary output with coefficients, R-squared, and p-values
- Diagnostic plots for assessing model assumptions

### Exercises

1. Using the `boston` dataset from the MASS package, create a simple linear regression model to predict median house value (medv) based on the average number of rooms (rm). Plot the data points and the regression line, and interpret the coefficients.

2. With the `mtcars` dataset, build a multiple linear regression model to predict quarter-mile time (qsec) using horsepower (hp), weight (wt), and transmission type (am). Interpret the results and check the model assumptions using diagnostic plots.

3. Using the `diamonds` dataset, create a linear regression model to predict the price of a diamond based on its carat, cut, color, and clarity. Use dummy variables for categorical predictors. Interpret the results and discuss which factors have the most significant impact on diamond prices.

## 5. Classification

### Explanation

Classification is a supervised machine learning technique used to predict categorical outcomes. It involves training a model on labeled data to learn the relationship between features and class labels, then using that model to predict the class of new, unseen instances.

Common classification algorithms include:
1. Logistic Regression
2. k-Nearest Neighbors (k-NN)
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks

Key concepts in classification:
- Features: Input variables used to make predictions
- Target variable: The categorical outcome we're trying to predict
- Training and testing sets: Data used to build and evaluate the model
- Confusion matrix: Table showing correct and incorrect predictions
- Accuracy, precision, recall, and F1-score: Metrics for evaluating model performance

### Example

Let's perform a more complex classification task using the iris dataset and a Random Forest classifier:

```{r}
# Load necessary library
library(randomForest)

# Split the data
set.seed(123)
train_indices <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Train a Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100)

# Make predictions
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model
confusionMatrix(predictions, test_data$Species)

# Plot feature importance
varImpPlot(rf_model, main = "Feature Importance in Random Forest Model")
```

This example demonstrates:
- Training a Random Forest classifier on the iris dataset
- Making predictions on a test set
- Evaluating the model using a confusion matrix
- Visualizing feature importance

### Exercises

1. Using the `mtcars` dataset, create a logistic regression model to predict whether a car has an automatic or manual transmission (am) based on other features. Split the data into training and testing sets, fit the model, make predictions, and evaluate its performance using a confusion matrix and ROC curve.

2. With the `wine` dataset from the rattle package, build a k-Nearest Neighbors classifier to predict the wine type. Experiment with different values of k and use cross-validation to select the best model. Evaluate the final model's performance on a held-out test set.

3. Using the `Default` dataset from the ISLR package, create a decision tree classifier to predict whether a customer will default on their credit card payment. Visualize the tree, interpret the results, and evaluate the model's performance. Then, compare its performance to a random forest classifier on the same data.

