Introduction

This notebook covers various aspects of working with data, including data visualization, facets, time series analysis, linear regression, and classification. We’ll use R and its popular libraries to demonstrate these concepts, provide detailed explanations, and offer exercises for practice.

1. Data Visualization

Explanation

Data visualization is the graphical representation of information and data. It uses statistical graphics, plots, and other visual elements to communicate complex data relationships and patterns effectively. The human brain processes visual information much faster than text, making data visualization an essential tool for understanding and presenting data.

In R, the ‘ggplot2’ package is widely used for creating data visualizations. It’s based on the Grammar of Graphics, a layered approach to describing and constructing visualizations.

Key components of a ggplot: 1. Data: The dataset you’re visualizing 2. Aesthetics (aes): Mapping of variables to visual properties (e.g., x-axis, y-axis, color, size) 3. Geometries (geom): The type of plot (e.g., points, lines, bars) 4. Scales: How the data is mapped to the visual properties 5. Facets: Division of the plot into subplots based on categorical variables 6. Theme: Overall visual style of the plot

Example

Let’s create a more complex visualization using the mtcars dataset:

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), size = hp)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_viridis_d() +
  labs(title = "Car Weight vs. MPG",
       subtitle = "Colored by number of cylinders, size represents horsepower",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders",
       size = "Horsepower") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size.
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

This plot shows: - The relationship between weight (x-axis) and miles per gallon (y-axis) - The number of cylinders (color) - The horsepower (size of points) - A linear regression line for each cylinder group

Exercises

  1. Using the built-in diamonds dataset, create a scatter plot of price vs. carat. Color the points by the cut quality and add a title and proper axis labels.

  2. With the mpg dataset, create a box plot showing the distribution of highway miles per gallon (hwy) for different car classes (class). Add color to the boxes and include a title and axis labels.

  3. Using the economics dataset, create a line plot showing both personal savings rate (psavert) and unemployment rate (uempmed) over time. Use two different y-axes (hint: look up sec_axis()). Include a legend, title, and appropriate labels.

2. Facets

Explanation

Faceting is a technique used to split a plot into multiple subplots based on one or more categorical variables. This allows for easy comparison of patterns across different subgroups in the data. In ggplot2, there are two main faceting functions:

  1. facet_wrap(): Arranges a series of plots into a 2D grid, “wrapping” around with a specified number of rows or columns.
  2. facet_grid(): Creates a grid of plots with rows and columns based on different categorical variables.

Faceting is particularly useful when you want to: - Compare trends or patterns across different categories - Visualize how relationships change across subgroups - Display multiple related plots in a compact format

Example

Let’s create a faceted plot using the mpg dataset:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  facet_wrap(~ manufacturer, scales = "free") +
  labs(title = "Highway MPG vs. Engine Displacement",
       subtitle = "Faceted by manufacturer, colored by vehicle class",
       x = "Engine Displacement (L)",
       y = "Highway MPG",
       color = "Vehicle Class") +
  theme_minimal() +
  theme(legend.position = "bottom")
`geom_smooth()` using formula = 'y ~ x'

This plot shows: - The relationship between engine displacement and highway MPG - Facets for each car manufacturer - Colors representing different vehicle classes - A linear regression line for each manufacturer

Exercises

  1. Using the diamonds dataset, create a faceted histogram of diamond prices. Facet by cut quality and fill the histograms by diamond color. Add appropriate labels and a title.

  2. With the mpg dataset, create a faceted scatter plot of city MPG vs. highway MPG. Facet by the number of cylinders and color the points by the type of drive (drv). Include a regression line for each facet.

  3. Using the txhousing dataset, create a faceted line plot showing the median housing price over time for different cities. Use facet_wrap() to create a 3x3 grid of plots for the top 9 cities by median housing price. Add appropriate labels and a title.

3. Time Series

Explanation

Time series analysis involves studying data points collected over time, typically at regular intervals. It’s used to identify trends, seasonal patterns, and other time-dependent structures in the data. Key components of time series analysis include:

  1. Trend: Long-term increase or decrease in the data
  2. Seasonality: Repeating patterns at fixed intervals
  3. Cyclical patterns: Fluctuations not tied to a fixed period
  4. Random variations: Unexplained variability in the data

In R, time series data is often stored in specialized objects like ts (base R) or xts (from the xts package). The lubridate package is useful for handling dates and times, while forecast provides functions for time series forecasting.

Common techniques in time series analysis include: - Moving averages - Exponential smoothing - ARIMA (Autoregressive Integrated Moving Average) models - Decomposition of time series into trend, seasonal, and residual components

Example

Let’s create a more complex time series example using the AirPassengers dataset:

data("AirPassengers")
ap_df <- data.frame(
  date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "month"),
  passengers = as.numeric(AirPassengers)
)

# Decompose the time series
ap_decomp <- decompose(AirPassengers)

# Plot the original series and its components
par(mfrow = c(4, 1), mar = c(2, 2, 2, 2))
plot(ap_decomp$x, main = "Original Time Series", ylab = "Passengers")
plot(ap_decomp$trend, main = "Trend", ylab = "Trend")
plot(ap_decomp$seasonal, main = "Seasonal", ylab = "Seasonal")
plot(ap_decomp$random, main = "Random", ylab = "Random")

This example shows: - The original time series of airline passengers - The decomposition of the series into trend, seasonal, and random components

Exercises

  1. Using the economics dataset, create a time series plot of the personal savings rate (psavert) from 1967 to 2015. Add a trend line using a simple moving average with a window of 12 months.

  2. With the nhtemp dataset (average yearly temperatures in New Haven), create a time series plot and decompose it into trend, seasonal, and random components. Plot each component separately.

  3. Using the co2 dataset (atmospheric CO2 concentrations), create a time series forecast for the next 24 months using an appropriate model (e.g., ARIMA). Plot the original data, the fitted values, and the forecast with prediction intervals.

4. Linear Regression

Explanation

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. In its simplest form (simple linear regression), it assumes a linear relationship between two variables:

y = β₀ + β₁x + ε

Where: - y is the dependent variable - x is the independent variable - β₀ is the y-intercept - β₁ is the slope - ε is the error term

Multiple linear regression extends this to multiple independent variables:

y = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ + ε

Key concepts in linear regression: 1. Coefficients: Estimated using the method of least squares 2. R-squared: Measure of how well the model fits the data 3. Residuals: Differences between observed and predicted values 4. Assumptions: Linearity, independence, homoscedasticity, normality of residuals

Example

Let’s perform a multiple linear regression using the mtcars dataset:

# Fit a multiple linear regression model
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)

# Print the summary
summary(model)

Call:
lm(formula = mpg ~ wt + hp + qsec, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8591 -1.6418 -0.4636  1.1940  5.6092 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 27.61053    8.41993   3.279  0.00278 ** 
wt          -4.35880    0.75270  -5.791 3.22e-06 ***
hp          -0.01782    0.01498  -1.190  0.24418    
qsec         0.51083    0.43922   1.163  0.25463    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.578 on 28 degrees of freedom
Multiple R-squared:  0.8348,    Adjusted R-squared:  0.8171 
F-statistic: 47.15 on 3 and 28 DF,  p-value: 4.506e-11
# Plot diagnostics
par(mfrow = c(2, 2))
plot(model)

This example shows: - A multiple linear regression model predicting MPG from weight, horsepower, and quarter-mile time - The summary output with coefficients, R-squared, and p-values - Diagnostic plots for assessing model assumptions

Exercises

  1. Using the boston dataset from the MASS package, create a simple linear regression model to predict median house value (medv) based on the average number of rooms (rm). Plot the data points and the regression line, and interpret the coefficients.

  2. With the mtcars dataset, build a multiple linear regression model to predict quarter-mile time (qsec) using horsepower (hp), weight (wt), and transmission type (am). Interpret the results and check the model assumptions using diagnostic plots.

  3. Using the diamonds dataset, create a linear regression model to predict the price of a diamond based on its carat, cut, color, and clarity. Use dummy variables for categorical predictors. Interpret the results and discuss which factors have the most significant impact on diamond prices.

5. Classification

Explanation

Classification is a supervised machine learning technique used to predict categorical outcomes. It involves training a model on labeled data to learn the relationship between features and class labels, then using that model to predict the class of new, unseen instances.

Common classification algorithms include: 1. Logistic Regression 2. k-Nearest Neighbors (k-NN) 3. Decision Trees 4. Random Forests 5. Support Vector Machines (SVM) 6. Neural Networks

Key concepts in classification: - Features: Input variables used to make predictions - Target variable: The categorical outcome we’re trying to predict - Training and testing sets: Data used to build and evaluate the model - Confusion matrix: Table showing correct and incorrect predictions - Accuracy, precision, recall, and F1-score: Metrics for evaluating model performance

Example

Let’s perform a more complex classification task using the iris dataset and a Random Forest classifier:

# Load necessary library
library(randomForest)

# Split the data
set.seed(123)
train_indices <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Train a Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100)

# Make predictions
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model
confusionMatrix(predictions, test_data$Species)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         14         2
  virginica       0          1        13

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.8173, 0.986)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9333           0.8667
Specificity                 1.0000            0.9333           0.9667
Pos Pred Value              1.0000            0.8750           0.9286
Neg Pred Value              1.0000            0.9655           0.9355
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3111           0.2889
Detection Prevalence        0.3333            0.3556           0.3111
Balanced Accuracy           1.0000            0.9333           0.9167
# Plot feature importance
varImpPlot(rf_model, main = "Feature Importance in Random Forest Model")

This example demonstrates: - Training a Random Forest classifier on the iris dataset - Making predictions on a test set - Evaluating the model using a confusion matrix - Visualizing feature importance

Exercises

  1. Using the mtcars dataset, create a logistic regression model to predict whether a car has an automatic or manual transmission (am) based on other features. Split the data into training and testing sets, fit the model, make predictions, and evaluate its performance using a confusion matrix and ROC curve.

  2. With the wine dataset from the rattle package, build a k-Nearest Neighbors classifier to predict the wine type. Experiment with different values of k and use cross-validation to select the best model. Evaluate the final model’s performance on a held-out test set.

  3. Using the Default dataset from the ISLR package, create a decision tree classifier to predict whether a customer will default on their credit card payment. Visualize the tree, interpret the results, and evaluate the model’s performance. Then, compare its performance to a random forest classifier on the same data.

---
title: "Working with Data: A Comprehensive Guide with Exercises"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(lubridate)
library(caret)
library(tidyr)
library(forecast)
library(randomForest) 
```

## Introduction

This notebook covers various aspects of working with data, including data visualization, facets, time series analysis, linear regression, and classification. We'll use R and its popular libraries to demonstrate these concepts, provide detailed explanations, and offer exercises for practice.

## 1. Data Visualization

### Explanation

Data visualization is the graphical representation of information and data. It uses statistical graphics, plots, and other visual elements to communicate complex data relationships and patterns effectively. The human brain processes visual information much faster than text, making data visualization an essential tool for understanding and presenting data.

In R, the 'ggplot2' package is widely used for creating data visualizations. It's based on the Grammar of Graphics, a layered approach to describing and constructing visualizations.

Key components of a ggplot:
1. Data: The dataset you're visualizing
2. Aesthetics (aes): Mapping of variables to visual properties (e.g., x-axis, y-axis, color, size)
3. Geometries (geom): The type of plot (e.g., points, lines, bars)
4. Scales: How the data is mapped to the visual properties
5. Facets: Division of the plot into subplots based on categorical variables
6. Theme: Overall visual style of the plot

### Example

Let's create a more complex visualization using the mtcars dataset:

```{r}
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), size = hp)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_viridis_d() +
  labs(title = "Car Weight vs. MPG",
       subtitle = "Colored by number of cylinders, size represents horsepower",
       x = "Weight (1000 lbs)",
       y = "Miles Per Gallon",
       color = "Cylinders",
       size = "Horsepower") +
  theme_minimal()
```

This plot shows:
- The relationship between weight (x-axis) and miles per gallon (y-axis)
- The number of cylinders (color)
- The horsepower (size of points)
- A linear regression line for each cylinder group

### Exercises

1. Using the built-in `diamonds` dataset, create a scatter plot of price vs. carat. Color the points by the cut quality and add a title and proper axis labels.

2. With the `mpg` dataset, create a box plot showing the distribution of highway miles per gallon (hwy) for different car classes (class). Add color to the boxes and include a title and axis labels.

3. Using the `economics` dataset, create a line plot showing both personal savings rate (psavert) and unemployment rate (uempmed) over time. Use two different y-axes (hint: look up `sec_axis()`). Include a legend, title, and appropriate labels.

## 2. Facets

### Explanation

Faceting is a technique used to split a plot into multiple subplots based on one or more categorical variables. This allows for easy comparison of patterns across different subgroups in the data. In ggplot2, there are two main faceting functions:

1. `facet_wrap()`: Arranges a series of plots into a 2D grid, "wrapping" around with a specified number of rows or columns.
2. `facet_grid()`: Creates a grid of plots with rows and columns based on different categorical variables.

Faceting is particularly useful when you want to:
- Compare trends or patterns across different categories
- Visualize how relationships change across subgroups
- Display multiple related plots in a compact format

### Example

Let's create a faceted plot using the `mpg` dataset:

```{r}
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  facet_wrap(~ manufacturer, scales = "free") +
  labs(title = "Highway MPG vs. Engine Displacement",
       subtitle = "Faceted by manufacturer, colored by vehicle class",
       x = "Engine Displacement (L)",
       y = "Highway MPG",
       color = "Vehicle Class") +
  theme_minimal() +
  theme(legend.position = "bottom")
```

This plot shows:
- The relationship between engine displacement and highway MPG
- Facets for each car manufacturer
- Colors representing different vehicle classes
- A linear regression line for each manufacturer

### Exercises

1. Using the `diamonds` dataset, create a faceted histogram of diamond prices. Facet by cut quality and fill the histograms by diamond color. Add appropriate labels and a title.

2. With the `mpg` dataset, create a faceted scatter plot of city MPG vs. highway MPG. Facet by the number of cylinders and color the points by the type of drive (drv). Include a regression line for each facet.

3. Using the `txhousing` dataset, create a faceted line plot showing the median housing price over time for different cities. Use `facet_wrap()` to create a 3x3 grid of plots for the top 9 cities by median housing price. Add appropriate labels and a title.

## 3. Time Series

### Explanation

Time series analysis involves studying data points collected over time, typically at regular intervals. It's used to identify trends, seasonal patterns, and other time-dependent structures in the data. Key components of time series analysis include:

1. Trend: Long-term increase or decrease in the data
2. Seasonality: Repeating patterns at fixed intervals
3. Cyclical patterns: Fluctuations not tied to a fixed period
4. Random variations: Unexplained variability in the data

In R, time series data is often stored in specialized objects like `ts` (base R) or `xts` (from the xts package). The `lubridate` package is useful for handling dates and times, while `forecast` provides functions for time series forecasting.

Common techniques in time series analysis include:
- Moving averages
- Exponential smoothing
- ARIMA (Autoregressive Integrated Moving Average) models
- Decomposition of time series into trend, seasonal, and residual components

### Example

Let's create a more complex time series example using the `AirPassengers` dataset:

```{r}
data("AirPassengers")
ap_df <- data.frame(
  date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "month"),
  passengers = as.numeric(AirPassengers)
)

# Decompose the time series
ap_decomp <- decompose(AirPassengers)

# Plot the original series and its components
par(mfrow = c(4, 1), mar = c(2, 2, 2, 2))
plot(ap_decomp$x, main = "Original Time Series", ylab = "Passengers")
plot(ap_decomp$trend, main = "Trend", ylab = "Trend")
plot(ap_decomp$seasonal, main = "Seasonal", ylab = "Seasonal")
plot(ap_decomp$random, main = "Random", ylab = "Random")
```

This example shows:
- The original time series of airline passengers
- The decomposition of the series into trend, seasonal, and random components

### Exercises

1. Using the `economics` dataset, create a time series plot of the personal savings rate (psavert) from 1967 to 2015. Add a trend line using a simple moving average with a window of 12 months.

2. With the `nhtemp` dataset (average yearly temperatures in New Haven), create a time series plot and decompose it into trend, seasonal, and random components. Plot each component separately.

3. Using the `co2` dataset (atmospheric CO2 concentrations), create a time series forecast for the next 24 months using an appropriate model (e.g., ARIMA). Plot the original data, the fitted values, and the forecast with prediction intervals.

## 4. Linear Regression

### Explanation

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. In its simplest form (simple linear regression), it assumes a linear relationship between two variables:

y = β₀ + β₁x + ε

Where:
- y is the dependent variable
- x is the independent variable
- β₀ is the y-intercept
- β₁ is the slope
- ε is the error term

Multiple linear regression extends this to multiple independent variables:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε

Key concepts in linear regression:
1. Coefficients: Estimated using the method of least squares
2. R-squared: Measure of how well the model fits the data
3. Residuals: Differences between observed and predicted values
4. Assumptions: Linearity, independence, homoscedasticity, normality of residuals

### Example

Let's perform a multiple linear regression using the mtcars dataset:

```{r}
# Fit a multiple linear regression model
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)

# Print the summary
summary(model)

# Plot diagnostics
par(mfrow = c(2, 2))
plot(model)
```

This example shows:
- A multiple linear regression model predicting MPG from weight, horsepower, and quarter-mile time
- The summary output with coefficients, R-squared, and p-values
- Diagnostic plots for assessing model assumptions

### Exercises

1. Using the `boston` dataset from the MASS package, create a simple linear regression model to predict median house value (medv) based on the average number of rooms (rm). Plot the data points and the regression line, and interpret the coefficients.

2. With the `mtcars` dataset, build a multiple linear regression model to predict quarter-mile time (qsec) using horsepower (hp), weight (wt), and transmission type (am). Interpret the results and check the model assumptions using diagnostic plots.

3. Using the `diamonds` dataset, create a linear regression model to predict the price of a diamond based on its carat, cut, color, and clarity. Use dummy variables for categorical predictors. Interpret the results and discuss which factors have the most significant impact on diamond prices.

## 5. Classification

### Explanation

Classification is a supervised machine learning technique used to predict categorical outcomes. It involves training a model on labeled data to learn the relationship between features and class labels, then using that model to predict the class of new, unseen instances.

Common classification algorithms include:
1. Logistic Regression
2. k-Nearest Neighbors (k-NN)
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks

Key concepts in classification:
- Features: Input variables used to make predictions
- Target variable: The categorical outcome we're trying to predict
- Training and testing sets: Data used to build and evaluate the model
- Confusion matrix: Table showing correct and incorrect predictions
- Accuracy, precision, recall, and F1-score: Metrics for evaluating model performance

### Example

Let's perform a more complex classification task using the iris dataset and a Random Forest classifier:

```{r}
# Load necessary library
library(randomForest)

# Split the data
set.seed(123)
train_indices <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Train a Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100)

# Make predictions
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model
confusionMatrix(predictions, test_data$Species)

# Plot feature importance
varImpPlot(rf_model, main = "Feature Importance in Random Forest Model")
```

This example demonstrates:
- Training a Random Forest classifier on the iris dataset
- Making predictions on a test set
- Evaluating the model using a confusion matrix
- Visualizing feature importance

### Exercises

1. Using the `mtcars` dataset, create a logistic regression model to predict whether a car has an automatic or manual transmission (am) based on other features. Split the data into training and testing sets, fit the model, make predictions, and evaluate its performance using a confusion matrix and ROC curve.

2. With the `wine` dataset from the rattle package, build a k-Nearest Neighbors classifier to predict the wine type. Experiment with different values of k and use cross-validation to select the best model. Evaluate the final model's performance on a held-out test set.

3. Using the `Default` dataset from the ISLR package, create a decision tree classifier to predict whether a customer will default on their credit card payment. Visualize the tree, interpret the results, and evaluate the model's performance. Then, compare its performance to a random forest classifier on the same data.

