R Notebook: Working with DataFrames

Introduction

Welcome to this comprehensive lecture on working with DataFrames in R! DataFrames are one of the most fundamental and versatile data structures in R, making them essential for any data analysis task. In this session, we’ll explore various operations and concepts related to DataFrames using a deck of cards as our primary example.

We’ve chosen a deck of cards as our dataset because it provides a relatable and intuitive structure that can demonstrate many DataFrame operations effectively. Each card in our deck will be represented as a row in our DataFrame, with attributes like face, suit, and value as columns.

Let’s begin by loading the necessary libraries and our deck of cards:

library(dplyr)
library(readr)

# Load the deck of cards
deck <- read_csv(url("https://nayelbettache.github.io/documents/STSCI_2120/deck.csv"))

Let’s break down what’s happening here:

  1. library(dplyr): This loads the dplyr package, which provides a set of tools for efficiently manipulating datasets in R. dplyr is part of the tidyverse ecosystem and offers functions like filter(), select(), and mutate() that we’ll use throughout this notebook.

  2. library(readr): This loads the readr package, which provides a fast and friendly way to read rectangular data (like CSV files). It’s generally faster than base R’s read.csv() function and automatically handles many common CSV formats.

  3. deck <- read_csv("deck_of_cards.csv"): This line reads our CSV file containing the deck of cards data and stores it in a DataFrame called deck. The read_csv() function from readr is being used here.

Now, let’s take a look at the first few rows of our deck:

head(deck)

The head() function gives us a quick preview of our DataFrame. It shows the first 6 rows by default. This is always a good first step when working with a new dataset to understand its structure and content.

Our deck DataFrame has three columns: - face: The face value of the card (e.g., “Ace”, “2”, “King”) - suit: The suit of the card (Hearts, Diamonds, Clubs, Spades) - value: The numerical value of the card (e.g., Ace might be 1, King might be 13)

Selecting Values

In R, there are multiple ways to select values from a DataFrame. Understanding these methods is crucial for effective data manipulation. Let’s explore three common approaches:

  1. Using column names:
deck$suit
 [1] "spades"   "spades"   "spades"   "spades"   "spades"   "spades"   "spades"   "spades"   "spades"  
[10] "spades"   "spades"   "spades"   "spades"   "clubs"    "clubs"    "clubs"    "clubs"    "clubs"   
[19] "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "diamonds"
[28] "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds"
[37] "diamonds" "diamonds" "diamonds" "hearts"   "hearts"   "hearts"   "hearts"   "hearts"   "hearts"  
[46] "hearts"   "hearts"   "hearts"   "hearts"   "hearts"   "hearts"   "hearts"  
deck[["face"]]
 [1] "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four"  "three" "two"  
[13] "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four"  "three"
[25] "two"   "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four" 
[37] "three" "two"   "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five" 
[49] "four"  "three" "two"   "ace"  

Here’s what’s happening:

  • deck$suit: This uses the $ operator to access the “suit” column of the deck DataFrame. It returns a vector containing all the values in the “suit” column.
  • deck[["face"]]: This uses double square brackets [[]] to access the “face” column. Like $, it returns a vector of all values in the “face” column.

The main difference between these two methods is that [[]] can be used with variables containing column names, while $ cannot. For example:

column_name <- "value"
deck[[column_name]]  # This works
 [1] 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7
[34]  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2  1
# deck$column_name   # This would not work as expected
  1. Using indices:
deck[1:5, 2]  # First 5 rows, second column

This method uses R’s built-in indexing. The format is dataframe[rows, columns]: - 1:5 specifies rows 1 through 5 - 2 specifies the second column

This approach is very flexible: - deck[1:5, ] would select the first 5 rows and all columns - deck[, 2] would select all rows of the second column - deck[1:5, 1:2] would select the first 5 rows of the first two columns

  1. Using dplyr’s select function:
select(deck, face, suit)

The select() function from dplyr provides a more intuitive and readable way to choose columns: - The first argument is the DataFrame - Subsequent arguments are the names of columns you want to select

This method is particularly useful when you need to select multiple columns or use more complex selection criteria. For example:

# Select columns that start with "s"
select(deck, starts_with("s"))

# Select all columns except "value"
select(deck, -value)

Exercise 1: Selecting Values

Now, let’s practice these selection methods:

  1. Select all the faces from the deck.
  2. Select the first 10 rows and all columns of the deck.
  3. Use dplyr to select only the ‘value’ column.
# Your code here
Click to see solution
# 1. Select all the faces
deck$face
 [1] "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four"  "three" "two"  
[13] "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four"  "three"
[25] "two"   "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four" 
[37] "three" "two"   "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five" 
[49] "four"  "three" "two"   "ace"  
# 2. Select first 10 rows and all columns
deck[1:10, ]

# 3. Use dplyr to select only the 'value' column
select(deck, value)

Explanation: 1. We use the $ operator to select the ‘face’ column, which returns a vector of all faces. 2. We use bracket notation [] with 1:10 to select the first 10 rows, and leave the column part empty to select all columns. 3. We use dplyr’s select() function to choose only the ‘value’ column. This returns a DataFrame with one column, not a vector.

Deal a Card

In card games, dealing is a fundamental operation. Let’s create a function to simulate dealing a card from our deck:

deal_card <- function(deck) {
  card <- deck[sample(nrow(deck), 1), ]
  return(card)
}

dealt_card <- deal_card(deck)
print(dealt_card)

Let’s break down this deal_card() function:

  1. function(deck): This defines a function that takes one argument, our deck DataFrame.

  2. sample(nrow(deck), 1):

    • nrow(deck) returns the number of rows in the deck
    • sample(x, size) randomly samples size numbers from 1 to x
    • So this generates one random row number
  3. deck[sample(nrow(deck), 1), ]: This selects the randomly chosen row from the deck. The comma with nothing after it means we select all columns for this row.

  4. The selected row (representing a single card) is assigned to card.

  5. return(card) sends this randomly selected card back as the output of the function.

When we call deal_card(deck), it runs this function and returns a single, randomly selected card from our deck.

Exercise 2: Dealing Cards

Let’s extend our dealing functionality:

  1. Modify the deal_card function to deal multiple cards at once.
  2. Deal a hand of 5 cards and display them.
# Your code here
Click to see solution
# 1. Modify deal_card function
deal_cards <- function(deck, n) {
  cards <- deck[sample(nrow(deck), n), ]
  return(cards)
}

# 2. Deal a hand of 5 cards
hand <- deal_cards(deck, 5)
print(hand)

Explanation: 1. We modify the function to take an additional argument n, which is the number of cards to deal. 2. We use sample(nrow(deck), n) to get n random row numbers. 3. We select these rows from the deck to create our hand of cards. 4. We then use this new function to deal a hand of 5 cards and display it.

Shuffle the Deck

Shuffling is another crucial operation in card games. In DataFrame terms, this means randomly reordering our rows:

shuffle_deck <- function(deck) {
  shuffled_deck <- deck[sample(nrow(deck)), ]
  rownames(shuffled_deck) <- NULL
  return(shuffled_deck)
}

shuffled_deck <- shuffle_deck(deck)
head(shuffled_deck)

Let’s break down this shuffle_deck() function:

  1. sample(nrow(deck)): This creates a random permutation of the numbers from 1 to the number of rows in the deck. For example, if we have 52 cards, this might return something like [23, 7, 52, 1, 18, ...].

  2. deck[sample(nrow(deck)), ]: This uses the random permutation to reorder all rows of the deck. It’s equivalent to randomly shuffling the cards.

  3. rownames(shuffled_deck) <- NULL: This resets the row names of the shuffled deck. When we reorder the rows, R keeps the original row names by default, which can be confusing. Setting them to NULL causes R to use sequential numbers as row names.

  4. The function returns this shuffled deck.

After shuffling, we use head() to view the first few rows of the shuffled deck, demonstrating that the order has indeed changed.

Exercise 3: Shuffling

Now, let’s combine our shuffling and dealing operations:

  1. Shuffle the deck and deal the top 3 cards.
  2. Create a function that shuffles the deck and deals a specified number of hands with a specified number of cards each.
# Your code here
Click to see solution
# 1. Shuffle and deal top 3 cards
top_3 <- head(shuffle_deck(deck), 3)

# 2. Function to shuffle and deal multiple hands
shuffle_and_deal <- function(deck, num_hands, cards_per_hand) {
  shuffled <- shuffle_deck(deck)
  hands <- list()
  for (i in 1:num_hands) {
    start <- (i - 1) * cards_per_hand + 1
    end <- i * cards_per_hand
    hands[[i]] <- shuffled[start:end, ]
  }
  return(hands)
}

# Example usage:
game_hands <- shuffle_and_deal(deck, 4, 5)  # 4 hands, 5 cards each

Explanation: 1. We first shuffle the deck using our shuffle_deck() function, then use head() to get the first 3 cards. 2. For the second part: - We create a function that takes the deck, number of hands, and cards per hand as arguments. - We shuffle the deck first. - We create an empty list to store the hands. - We use a for loop to deal the appropriate number of cards to each hand. - We use list indexing to add each hand to our list of hands. - Finally, we return the list of hands.

Dollar Signs and Double Brackets

In R, we can access DataFrame columns using $ or [[]]. Understanding the differences between these methods is crucial for effective R programming:

# Using $
deck$suit
 [1] "spades"   "spades"   "spades"   "spades"   "spades"   "spades"   "spades"   "spades"   "spades"  
[10] "spades"   "spades"   "spades"   "spades"   "clubs"    "clubs"    "clubs"    "clubs"    "clubs"   
[19] "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "clubs"    "diamonds"
[28] "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds"
[37] "diamonds" "diamonds" "diamonds" "hearts"   "hearts"   "hearts"   "hearts"   "hearts"   "hearts"  
[46] "hearts"   "hearts"   "hearts"   "hearts"   "hearts"   "hearts"   "hearts"  
# Using [[]]
deck[["face"]]
 [1] "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four"  "three" "two"  
[13] "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four"  "three"
[25] "two"   "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five"  "four" 
[37] "three" "two"   "ace"   "king"  "queen" "jack"  "ten"   "nine"  "eight" "seven" "six"   "five" 
[49] "four"  "three" "two"   "ace"  
# Using [[]] with variables
column_name <- "value"
deck[[column_name]]
 [1] 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7
[34]  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2  1

Let’s break down these methods:

  1. deck$suit:
    • This uses the $ operator to directly access the “suit” column.
    • It’s quick and intuitive, but has limitations.
    • It doesn’t work with variable column names.
    • It can sometimes cause issues in complex operations or function calls.
  2. deck[["face"]]:
    • This uses double square brackets [[]] to access the “face” column.
    • It’s more flexible than the $ operator.
    • It works with variable column names (as shown in the third example).
    • It’s generally safer in function calls and complex operations.
    • It clearly indicates that you’re extracting a single column.
  3. deck[[column_name]]:
    • This demonstrates using [[]] with a variable containing the column name.
    • This flexibility is particularly useful when writing functions that need to work with different columns.

Both $ and [[]] return a vector of values from the specified column. The main difference is in their flexibility and how they behave in certain contexts (like inside functions or with variable column names).

Exercise 4: Column Access

Let’s practice using these access methods:

  1. Create a function that takes a column name as an argument and returns the unique values in that column.
# Your code here
Click to see solution
get_unique_values <- function(df, column_name) {
  unique(df[[column_name]])
}

# Example usage:
unique_suits <- get_unique_values(deck, "suit")
print(unique_suits)
[1] "spades"   "clubs"    "diamonds" "hearts"  

Explanation: - We define a function get_unique_values that takes two arguments: a DataFrame df and a column_name. - Inside the function, we use df[[column_name]] to access the specified column. We can’t use $ here because we need to use the variable column_name. - We wrap this in the unique() function, which returns only the unique values from the vector. - This function is flexible - it can be used with any DataFrame and any column name. - In the example usage, we get the unique suits from our deck.

Modifying Values

Modifying values in a DataFrame is a common operation in data cleaning and transformation. Let’s look at how we can change values in our deck:

# Change the value of the first card to 100
deck$value[1] <- 100

# Change the face of the last card to "Joker"
deck$face[nrow(deck)] <- "Joker"

# View the changes
head(deck, 1)
tail(deck, 1)

Let’s break down what’s happening here:

  1. deck$value[1] <- 100:
    • This accesses the ‘value’ column of the deck using $.
    • [1] selects the first element of this column.
    • We assign the value 100 to this element, changing the value of the first card.
  2. deck$face[nrow(deck)] <- "Joker":
    • Similar to the first operation, but we’re changing the ‘face’ column.
    • nrow(deck) gives us the number of rows in the deck, effectively selecting the last row.
    • We change the face of the last card to “Joker”.
  3. We use head(deck, 1) and tail(deck, 1) to view the first and last rows of the deck, confirming our changes.

This method of direct assignment is straightforward but should be used cautiously. It’s easy to accidentally modify data you didn’t intend to change.

Exercise 5: Value Modification

Now, let’s try some more complex modifications:

  1. Change all the “Jack” cards to have a value of 11.
  2. Add a new column called “color” based on the suit (red for Hearts and Diamonds, black for Clubs and Spades).
# Your code here
Click to see solution
# 1. Change Jack values to 11
deck$value[deck$face == "Jack"] <- 11

# 2. Add color column
deck$color <- ifelse(deck$suit %in% c("Hearts", "Diamonds"), "Red", "Black")

Explanation: 1. deck$value[deck$face == "Jack"] <- 11: - deck$face == "Jack" creates a logical vector, TRUE for Jack cards, FALSE for others. - We use this to index deck$value, selecting only the values for Jack cards. - We assign 11 to these selected values.

Changing Values in Place

mutate() is a verb function in dplyr that allows you to add new columns or modify existing ones in a data frame. The basic syntax is:

mutate(data, new_column = expression)

Where:

  • data is the data frame you want to modify.
  • new_column is the name of the new column you want to create or the existing column you want to modify.
  • expression is the operation you want to perform to create or modify the column.

Let’s use dplyr’s mutate function:

deck <- mutate(deck, 
               value = ifelse(face == "King", 13, value),
               value = ifelse(face == "Queen", 12, value))

# View the changes
filter(deck, face %in% c("King", "Queen"))
Using mutate to Update the value Column

The code uses the mutate function from the dplyr package to update the value column in the deck dataframe.

The mutate Function

The mutate function takes a dataframe as input and returns a new dataframe with the modified columns. In this case, the mutate function is used to update the value column in the deck dataframe.

The ifelse Function

The ifelse function is a vectorized conditional statement that checks a condition and returns one value if the condition is TRUE and another value if the condition is FALSE.

In the first ifelse statement, the condition is face == “King”. If this condition is TRUE, the value returned is 13; otherwise, the original value in the value column is returned.

In the second ifelse statement, the condition is face == “Queen”. If this condition is TRUE, the value returned is 12; otherwise, the original value in the value column is returned (which may have already been updated by the previous ifelse statement).

The mutate Statement

The mutate statement is used to update the value column in the deck dataframe. The value column is updated twice, first to assign a value of 13 to the “King” cards, and then to assign a value of 12 to the “Queen” cards.

However, there’s a subtle issue with this code. The second ifelse statement will only update the “Queen” cards if they haven’t already been updated by the first ifelse statement. Since the first ifelse statement doesn’t update the “Queen” cards, the second ifelse statement will work as expected.

The filter Function

The filter function is used to select a subset of rows from the deck dataframe where the face column is either “King” or “Queen”. This allows us to view the changes made to the value column for these specific cards.

Exercise 6: In-Place Modifications

  1. Use mutate to add a new column “is_face_card” that is TRUE for Jack, Queen, and King, and FALSE otherwise.
# Your code here
Click to see solution
deck <- mutate(deck,
               is_face_card = face %in% c("Jack", "Queen", "King"))

Logical Subsetting

Let’s practice logical subsetting:

# Get all Hearts
hearts <- deck[deck$suit == "Hearts", ]

# Get all face cards
face_cards <- deck[deck$face %in% c("Jack", "Queen", "King"), ]

# View results
head(hearts)
head(face_cards)

The code uses logical subsetting to extract specific rows from the deck dataframe based on certain conditions.

Getting All Hearts

The first line of code uses the following syntax to get all rows where the suit column is “Hearts”:

Here’s what’s happening:

  • deck$suit == “Hearts” is a logical expression that checks if the value in the suit column is equal to “Hearts”. This will return a vector of TRUE and FALSE values, where TRUE indicates that the row has a suit of “Hearts”.
  • The square brackets [] are used to subset the deck dataframe based on this logical expression. The comma inside the brackets indicates that we want to select rows (if there were no comma, it would select columns).
  • The resulting subset of rows is assigned to a new dataframe called hearts. Getting All Face Cards

The second line of code uses the following syntax to get all rows where the face column is “Jack”, “Queen”, or “King”:

Here’s what’s happening:

  • deck$face %in% c(“Jack”, “Queen”, “King”) is a logical expression that checks if the value in the face column is one of the values in the vector c(“Jack”, “Queen”, “King”). This will return a vector of TRUE and FALSE values, where TRUE indicates that the row has a face that is one of the specified values.
  • The rest of the syntax is the same as before: the square brackets [] are used to subset the deck dataframe based on this logical expression, and the resulting subset of rows is assigned to a new dataframe called face_cards.
Viewing the Results

The final two lines of code use the head() function to view the first few rows of the hearts and face_cards dataframes

This allows us to verify that the subsetting worked correctly and see the resulting dataframes.

Exercise 7: Logical Subsetting

  1. Create a subset of the deck containing only cards with values greater than 10.
  2. Create a subset of red cards (Hearts and Diamonds) with odd values.
# Your code here
Click to see solution
# 1. Cards with values > 10

# 2. Red cards with odd values
red_odd_cards <- deck[deck$suit %in% c("Hearts", "Diamonds") & deck$value %% 2 == 1, ]

Missing Information

Let’s introduce and handle missing values:

# Introduce some NA values
deck$value[sample(nrow(deck), 5)] <- NA

# Count NA values
sum(is.na(deck$value))
[1] 5
# Remove rows with NA values
deck_clean <- na.omit(deck)

# View results
sum(is.na(deck_clean$value))
[1] 0

The first line of code introduces some missing values (NA) into the value column of the deck dataframe:

Here’s what’s happening:

  • sample(nrow(deck), 5) generates a random sample of 5 row indices from the deck dataframe.
  • deck$value[…] selects the corresponding values in the value column.
  • <- NA assigns NA values to these selected positions. Counting NA Values

The next line of code counts the number of NA values in the value column

Here’s what’s happening:

  • is.na(deck$value) checks which values in the value column are NA, returning a logical vector (TRUE for NA, FALSE otherwise).
  • sum(…) sums up the number of TRUE values in this logical vector, effectively counting the number of NA values.
  • Removing Rows with NA Values

The next line of code creates a new dataframe, deck_clean, by removing rows with NA values from the deck dataframe.

Here’s what’s happening:

  • na.omit(deck) removes rows with NA values from the deck dataframe. By default, na.omit() removes rows with NA values in any column.
  • The resulting dataframe is assigned to a new variable, deck_clean.
  • Verifying the Results

The final line of code verifies that the NA values have been removed from the deck_clean dataframe

This should return 0, indicating that there are no NA values in the value column of deck_clean.

Exercise 8: Handling Missing Data

  1. Replace all NA values in the ‘value’ column with the mean value of the non-NA entries.
# Your code here
Click to see solution
deck$value[is.na(deck$value)] <- mean(deck$value, na.rm = TRUE)

Environments

An environment in R is a self-contained space where you can store and manage variables, functions, and other objects. Environments are useful for organizing your code and data, and for avoiding naming conflicts. Let’s explore environments:

# Create a new environment
card_env <- new.env()

# Assign a variable to the new environment
card_env$ace_value <- 1

# Access the variable
card_env$ace_value
[1] 1
Creating a New Environment

To create a new environment, you can use the new.env() function.

This creates a new, empty environment and assigns it to the variable card_env.

Assigning a Variable to the New Environment

To assign a variable to the new environment, you can use the $ operator.

This assigns the value 1 to the variable ace_value in the card_env environment.

Accessing the Variable

To access the variable, you can use the $ operator again. This returns the value of the ace_value variable in the card_env environment, which is 1.

Environment Details

Here are some additional details about the card_env environment.

The ls() function lists the objects in the environment, which in this case is just the ace_value variable. The typeof() and class() functions return the type and class of the environment, respectively.

Working with Environments

Let’s use environments to manage game state:

game_env <- new.env()

with(game_env, {
  deck <- shuffle_deck(deck)
  player_hand <- deal_card(deck)
})

# Access variables from the environment
game_env$player_hand

Exercise 9: Environment Usage

  1. Create a function that initializes a game environment with a shuffled deck and dealt hands for a specified number of players.
# Your code here
Click to see solution
initialize_game <- function(num_players, cards_per_hand) {
  game_env <- new.env()
  
  with(game_env, {
    deck <- shuffle_deck(deck)
    players <- list()
    for (i in 1:num_players) {
      players[[paste0("Player", i)]] <- deal_cards(deck, cards_per_hand)
      deck <- deck[-(1:cards_per_hand), ]
    }
  })
  
  return(game_env)
}

# Usage:
game <- initialize_game(4, 5)
game$players$Player1

Scoping Rules

Let’s explore R’s lexical scoping:

outer_var <- 10

example_function <- function() {
  inner_var <- 5
  outer_var + inner_var
}

example_function()  # Returns 15
[1] 15

Assignment

Let’s practice different assignment methods:

# Using <-
x <- 5

# Using =
y = 10

# Using ->
15 -> z

# View results
c(x, y, z)
[1]  5 10 15

Exercise 10: Assignment Practice

  1. Create a function that takes a deck as input and returns a list with two elements: the red cards and the black cards. Use different assignment operators for each.
# Your code here

Final Exercise: Putting It All Together

Create a card game simulation that uses environments, closures, and DataFrames. The game should:

  1. Initialize a shuffled deck
  2. Deal hands to players
  3. Allow players to draw and discard cards
  4. Keep track of the game state using an environment
# Your code here
Click to see solution
create_card_game <- function(num_players, cards_per_hand) {
  game_env <- new.env()
  
  with(game_env, {
    deck <- shuffle_deck(deck)
    players <- list()
    for (i in 1:num_players) {
      players[[paste0("Player", i)]] <- deal_cards(deck, cards_per_hand)
    }
    deck <- deck[-(1:(num_players * cards_per_hand)), ]
    
    turn <- 1
    
    draw_card <- function(player_index) {
      if (nrow(deck) == 0) {
        stop("No more cards in the deck!")
      }
      card <- deck[1, ]
      deck <<- deck[-1, ]
      players[[player_index]] <<- rbind(players[[player_index]], card)
      return(card)
    }
    
    discard_card <- function(player_index, card_index) {
      discarded <- players[[player_index]][card_index, ]
      players[[player_index]] <<- players[[player_index]][-card_index, ]
      return(discarded)
    }
    
    next_turn <- function() {
      turn <<- (turn %% num_players) + 1
    }
  })
  
  return(game_env)
}

# Usage:
game <- create_card_game(4, 5)
game$draw_card(1)
game$discard_card(1, 3)
game$next_turn()
---
title: "Working with DataFrames in R: A Comprehensive Guide"
author: Nayel Bettache
output: html_notebook
code_folding: hide
---

# R Notebook: Working with DataFrames

## Introduction

Welcome to this comprehensive lecture on working with DataFrames in R! DataFrames are one of the most fundamental and versatile data structures in R, making them essential for any data analysis task. In this session, we'll explore various operations and concepts related to DataFrames using a deck of cards as our primary example.

We've chosen a deck of cards as our dataset because it provides a relatable and intuitive structure that can demonstrate many DataFrame operations effectively. Each card in our deck will be represented as a row in our DataFrame, with attributes like face, suit, and value as columns.

Let's begin by loading the necessary libraries and our deck of cards:

```{r setup, message=FALSE, echo = TRUE, results = "hide"}
library(dplyr)
library(readr)

# Load the deck of cards
deck <- read_csv(url("https://nayelbettache.github.io/documents/STSCI_2120/deck.csv"))
```

Let's break down what's happening here:

1.  `library(dplyr)`: This loads the dplyr package, which provides a set of tools for efficiently manipulating datasets in R. dplyr is part of the tidyverse ecosystem and offers functions like `filter()`, `select()`, and `mutate()` that we'll use throughout this notebook.

2.  `library(readr)`: This loads the readr package, which provides a fast and friendly way to read rectangular data (like CSV files). It's generally faster than base R's read.csv() function and automatically handles many common CSV formats.

3.  `deck <- read_csv("deck_of_cards.csv")`: This line reads our CSV file containing the deck of cards data and stores it in a DataFrame called `deck`. The `read_csv()` function from readr is being used here.

Now, let's take a look at the first few rows of our deck:

```{r}
head(deck)
```

The `head()` function gives us a quick preview of our DataFrame. It shows the first 6 rows by default. This is always a good first step when working with a new dataset to understand its structure and content.

Our deck DataFrame has three columns: - `face`: The face value of the card (e.g., "Ace", "2", "King") - `suit`: The suit of the card (Hearts, Diamonds, Clubs, Spades) - `value`: The numerical value of the card (e.g., Ace might be 1, King might be 13)

## Selecting Values

In R, there are multiple ways to select values from a DataFrame. Understanding these methods is crucial for effective data manipulation. Let's explore three common approaches:

1.  Using column names:

```{r}
deck$suit
```

```{r}
deck[["face"]]
```

Here's what's happening:

-   `deck$suit`: This uses the `$` operator to access the "suit" column of the deck DataFrame. It returns a vector containing all the values in the "suit" column.
-   `deck[["face"]]`: This uses double square brackets `[[]]` to access the "face" column. Like `$`, it returns a vector of all values in the "face" column.

The main difference between these two methods is that `[[]]` can be used with variables containing column names, while `$` cannot. For example:

```{r}
column_name <- "value"
deck[[column_name]]  # This works
# deck$column_name   # This would not work as expected
```

2.  Using indices:

```{r}
deck[1:5, 2]  # First 5 rows, second column
```

This method uses R's built-in indexing. The format is `dataframe[rows, columns]`: - `1:5` specifies rows 1 through 5 - `2` specifies the second column

This approach is very flexible: - `deck[1:5, ]` would select the first 5 rows and all columns - `deck[, 2]` would select all rows of the second column - `deck[1:5, 1:2]` would select the first 5 rows of the first two columns

3.  Using dplyr's select function:

```{r echo = T, results = 'hide'}
select(deck, face, suit)
```

The `select()` function from dplyr provides a more intuitive and readable way to choose columns: - The first argument is the DataFrame - Subsequent arguments are the names of columns you want to select

This method is particularly useful when you need to select multiple columns or use more complex selection criteria. For example:

```{r}
# Select columns that start with "s"
select(deck, starts_with("s"))

# Select all columns except "value"
select(deck, -value)
```

### Exercise 1: Selecting Values

Now, let's practice these selection methods:

1.  Select all the faces from the deck.
2.  Select the first 10 rows and all columns of the deck.
3.  Use dplyr to select only the 'value' column.

```{r}
# Your code here
```

<details>

<summary>Click to see solution</summary>

```{r}
# 1. Select all the faces
deck$face

# 2. Select first 10 rows and all columns
deck[1:10, ]

# 3. Use dplyr to select only the 'value' column
select(deck, value)
```

Explanation: 1. We use the `$` operator to select the 'face' column, which returns a vector of all faces. 2. We use bracket notation `[]` with `1:10` to select the first 10 rows, and leave the column part empty to select all columns. 3. We use dplyr's `select()` function to choose only the 'value' column. This returns a DataFrame with one column, not a vector.

</details>

## Deal a Card

In card games, dealing is a fundamental operation. Let's create a function to simulate dealing a card from our deck:

```{r}
deal_card <- function(deck) {
  card <- deck[sample(nrow(deck), 1), ]
  return(card)
}

dealt_card <- deal_card(deck)
print(dealt_card)
```

Let's break down this `deal_card()` function:

1.  `function(deck)`: This defines a function that takes one argument, our deck DataFrame.

2.  `sample(nrow(deck), 1)`:

    -   `nrow(deck)` returns the number of rows in the deck
    -   `sample(x, size)` randomly samples `size` numbers from 1 to `x`
    -   So this generates one random row number

3.  `deck[sample(nrow(deck), 1), ]`: This selects the randomly chosen row from the deck. The comma with nothing after it means we select all columns for this row.

4.  The selected row (representing a single card) is assigned to `card`.

5.  `return(card)` sends this randomly selected card back as the output of the function.

When we call `deal_card(deck)`, it runs this function and returns a single, randomly selected card from our deck.

### Exercise 2: Dealing Cards

Let's extend our dealing functionality:

1.  Modify the `deal_card` function to deal multiple cards at once.
2.  Deal a hand of 5 cards and display them.

```{r}
# Your code here
```

<details>

<summary>Click to see solution</summary>

```{r}
# 1. Modify deal_card function
deal_cards <- function(deck, n) {
  cards <- deck[sample(nrow(deck), n), ]
  return(cards)
}

# 2. Deal a hand of 5 cards
hand <- deal_cards(deck, 5)
print(hand)
```

Explanation: 1. We modify the function to take an additional argument `n`, which is the number of cards to deal. 2. We use `sample(nrow(deck), n)` to get `n` random row numbers. 3. We select these rows from the deck to create our hand of cards. 4. We then use this new function to deal a hand of 5 cards and display it.

</details>

## Shuffle the Deck

Shuffling is another crucial operation in card games. In DataFrame terms, this means randomly reordering our rows:

```{r}
shuffle_deck <- function(deck) {
  shuffled_deck <- deck[sample(nrow(deck)), ]
  rownames(shuffled_deck) <- NULL
  return(shuffled_deck)
}

shuffled_deck <- shuffle_deck(deck)
head(shuffled_deck)
```

Let's break down this `shuffle_deck()` function:

1.  `sample(nrow(deck))`: This creates a random permutation of the numbers from 1 to the number of rows in the deck. For example, if we have 52 cards, this might return something like `[23, 7, 52, 1, 18, ...]`.

2.  `deck[sample(nrow(deck)), ]`: This uses the random permutation to reorder all rows of the deck. It's equivalent to randomly shuffling the cards.

3.  `rownames(shuffled_deck) <- NULL`: This resets the row names of the shuffled deck. When we reorder the rows, R keeps the original row names by default, which can be confusing. Setting them to NULL causes R to use sequential numbers as row names.

4.  The function returns this shuffled deck.

After shuffling, we use `head()` to view the first few rows of the shuffled deck, demonstrating that the order has indeed changed.

### Exercise 3: Shuffling

Now, let's combine our shuffling and dealing operations:

1.  Shuffle the deck and deal the top 3 cards.
2.  Create a function that shuffles the deck and deals a specified number of hands with a specified number of cards each.

```{r}
# Your code here
```

<details>

<summary>Click to see solution</summary>

```{r}
# 1. Shuffle and deal top 3 cards
top_3 <- head(shuffle_deck(deck), 3)

# 2. Function to shuffle and deal multiple hands
shuffle_and_deal <- function(deck, num_hands, cards_per_hand) {
  shuffled <- shuffle_deck(deck)
  hands <- list()
  for (i in 1:num_hands) {
    start <- (i - 1) * cards_per_hand + 1
    end <- i * cards_per_hand
    hands[[i]] <- shuffled[start:end, ]
  }
  return(hands)
}

# Example usage:
game_hands <- shuffle_and_deal(deck, 4, 5)  # 4 hands, 5 cards each
```

Explanation: 1. We first shuffle the deck using our `shuffle_deck()` function, then use `head()` to get the first 3 cards. 2. For the second part: - We create a function that takes the deck, number of hands, and cards per hand as arguments. - We shuffle the deck first. - We create an empty list to store the hands. - We use a for loop to deal the appropriate number of cards to each hand. - We use list indexing to add each hand to our list of hands. - Finally, we return the list of hands.

</details>

## Dollar Signs and Double Brackets

In R, we can access DataFrame columns using `$` or `[[]]`. Understanding the differences between these methods is crucial for effective R programming:

```{r}
# Using $
deck$suit

# Using [[]]
deck[["face"]]

# Using [[]] with variables
column_name <- "value"
deck[[column_name]]
```

Let's break down these methods:

1.  `deck$suit`:
    -   This uses the `$` operator to directly access the "suit" column.
    -   It's quick and intuitive, but has limitations.
    -   It doesn't work with variable column names.
    -   It can sometimes cause issues in complex operations or function calls.
2.  `deck[["face"]]`:
    -   This uses double square brackets `[[]]` to access the "face" column.
    -   It's more flexible than the `$` operator.
    -   It works with variable column names (as shown in the third example).
    -   It's generally safer in function calls and complex operations.
    -   It clearly indicates that you're extracting a single column.
3.  `deck[[column_name]]`:
    -   This demonstrates using `[[]]` with a variable containing the column name.
    -   This flexibility is particularly useful when writing functions that need to work with different columns.

Both `$` and `[[]]` return a vector of values from the specified column. The main difference is in their flexibility and how they behave in certain contexts (like inside functions or with variable column names).

### Exercise 4: Column Access

Let's practice using these access methods:

1.  Create a function that takes a column name as an argument and returns the unique values in that column.

```{r}
# Your code here
```

<details>

<summary>Click to see solution</summary>

```{r}
get_unique_values <- function(df, column_name) {
  unique(df[[column_name]])
}

# Example usage:
unique_suits <- get_unique_values(deck, "suit")
print(unique_suits)
```

Explanation: - We define a function `get_unique_values` that takes two arguments: a DataFrame `df` and a `column_name`. - Inside the function, we use `df[[column_name]]` to access the specified column. We can't use `$` here because we need to use the variable `column_name`. - We wrap this in the `unique()` function, which returns only the unique values from the vector. - This function is flexible - it can be used with any DataFrame and any column name. - In the example usage, we get the unique suits from our deck.

</details>

## Modifying Values

Modifying values in a DataFrame is a common operation in data cleaning and transformation. Let's look at how we can change values in our deck:

```{r}
# Change the value of the first card to 100
deck$value[1] <- 100

# Change the face of the last card to "Joker"
deck$face[nrow(deck)] <- "Joker"

# View the changes
head(deck, 1)
tail(deck, 1)
```

Let's break down what's happening here:

1.  `deck$value[1] <- 100`:
    -   This accesses the 'value' column of the deck using `$`.
    -   `[1]` selects the first element of this column.
    -   We assign the value 100 to this element, changing the value of the first card.
2.  `deck$face[nrow(deck)] <- "Joker"`:
    -   Similar to the first operation, but we're changing the 'face' column.
    -   `nrow(deck)` gives us the number of rows in the deck, effectively selecting the last row.
    -   We change the face of the last card to "Joker".
3.  We use `head(deck, 1)` and `tail(deck, 1)` to view the first and last rows of the deck, confirming our changes.

This method of direct assignment is straightforward but should be used cautiously. It's easy to accidentally modify data you didn't intend to change.

### Exercise 5: Value Modification

Now, let's try some more complex modifications:

1.  Change all the "Jack" cards to have a value of 11.
2.  Add a new column called "color" based on the suit (red for Hearts and Diamonds, black for Clubs and Spades).

```{r}
# Your code here
```

<details>

<summary>Click to see solution</summary>

```{r}
# 1. Change Jack values to 11
deck$value[deck$face == "Jack"] <- 11

# 2. Add color column
deck$color <- ifelse(deck$suit %in% c("Hearts", "Diamonds"), "Red", "Black")
```

Explanation: 1. `deck$value[deck$face == "Jack"] <- 11`: - `deck$face == "Jack"` creates a logical vector, TRUE for Jack cards, FALSE for others. - We use this to index `deck$value`, selecting only the values for Jack cards. - We assign 11 to these selected values.

</details>

## Changing Values in Place

mutate() is a verb function in dplyr that allows you to add new columns or modify existing ones in a data frame. The basic syntax is:
```{r}
mutate(data, new_column = expression)
```
Where:

- data is the data frame you want to modify.
- new_column is the name of the new column you want to create or the existing column you want to modify.
- expression is the operation you want to perform to create or modify the column.

Let's use dplyr's `mutate` function:

```{r}
deck <- mutate(deck, 
               value = ifelse(face == "King", 13, value),
               value = ifelse(face == "Queen", 12, value))

# View the changes
filter(deck, face %in% c("King", "Queen"))
```
##### Using mutate to Update the value Column

The code uses the mutate function from the dplyr package to update the value column in the deck dataframe.

##### The mutate Function

The mutate function takes a dataframe as input and returns a new dataframe with the modified columns. In this case, the mutate function is used to update the value column in the deck dataframe.

##### The ifelse Function

The ifelse function is a vectorized conditional statement that checks a condition and returns one value if the condition is TRUE and another value if the condition is FALSE.

In the first ifelse statement, the condition is face == "King". If this condition is TRUE, the value returned is 13; otherwise, the original value in the value column is returned.

In the second ifelse statement, the condition is face == "Queen". If this condition is TRUE, the value returned is 12; otherwise, the original value in the value column is returned (which may have already been updated by the previous ifelse statement).

##### The mutate Statement

The mutate statement is used to update the value column in the deck dataframe. The value column is updated twice, first to assign a value of 13 to the "King" cards, and then to assign a value of 12 to the "Queen" cards.

However, there's a subtle issue with this code. The second ifelse statement will only update the "Queen" cards if they haven't already been updated by the first ifelse statement. Since the first ifelse statement doesn't update the "Queen" cards, the second ifelse statement will work as expected.

##### The filter Function

The filter function is used to select a subset of rows from the deck dataframe where the face column is either "King" or "Queen". This allows us to view the changes made to the value column for these specific cards.


### Exercise 6: In-Place Modifications

1. Use `mutate` to add a new column "is_face_card" that is TRUE for Jack, Queen, and King, and FALSE otherwise.

```{r}
# Your code here
```

<details>
<summary>Click to see solution</summary>

```{r}
deck <- mutate(deck,
               is_face_card = face %in% c("Jack", "Queen", "King"))
```
</details>

## Logical Subsetting

Let's practice logical subsetting:

```{r}
# Get all Hearts
hearts <- deck[deck$suit == "Hearts", ]

# Get all face cards
face_cards <- deck[deck$face %in% c("Jack", "Queen", "King"), ]

# View results
head(hearts)
head(face_cards)
```
The code uses logical subsetting to extract specific rows from the deck dataframe based on certain conditions.

##### Getting All Hearts

The first line of code uses the following syntax to get all rows where the suit column is "Hearts":
		
Here's what's happening:

- deck$suit == "Hearts" is a logical expression that checks if the value in the suit column is equal to "Hearts". This will return a vector of TRUE and FALSE values, where TRUE indicates that the row has a suit of "Hearts".
- The square brackets [] are used to subset the deck dataframe based on this logical expression. The comma inside the brackets indicates that we want to select rows (if there were no comma, it would select columns).
- The resulting subset of rows is assigned to a new dataframe called hearts.
Getting All Face Cards

The second line of code uses the following syntax to get all rows where the face column is "Jack", "Queen", or "King":

Here's what's happening:

- deck$face %in% c("Jack", "Queen", "King") is a logical expression that checks if the value in the face column is one of the values in the vector c("Jack", "Queen", "King"). This will return a vector of TRUE and FALSE values, where TRUE indicates that the row has a face that is one of the specified values.
- The rest of the syntax is the same as before: the square brackets [] are used to subset the deck dataframe based on this logical expression, and the resulting subset of rows is assigned to a new dataframe called face_cards.

##### Viewing the Results

The final two lines of code use the head() function to view the first few rows of the hearts and face_cards dataframes
		
This allows us to verify that the subsetting worked correctly and see the resulting dataframes.

### Exercise 7: Logical Subsetting

1. Create a subset of the deck containing only cards with values greater than 10.
2. Create a subset of red cards (Hearts and Diamonds) with odd values.

```{r}
# Your code here
```

<details>
<summary>Click to see solution</summary>

```{r}
# 1. Cards with values > 10

# 2. Red cards with odd values
red_odd_cards <- deck[deck$suit %in% c("Hearts", "Diamonds") & deck$value %% 2 == 1, ]
```
</details>

## Missing Information

Let's introduce and handle missing values:

```{r}
# Introduce some NA values
deck$value[sample(nrow(deck), 5)] <- NA

# Count NA values
sum(is.na(deck$value))

# Remove rows with NA values
deck_clean <- na.omit(deck)

# View results
sum(is.na(deck_clean$value))
```

The first line of code introduces some missing values (NA) into the value column of the deck dataframe:


Here's what's happening:

- sample(nrow(deck), 5) generates a random sample of 5 row indices from the deck dataframe.
- deck$value[...] selects the corresponding values in the value column.
- <- NA assigns NA values to these selected positions.
Counting NA Values

The next line of code counts the number of NA values in the value column

Here's what's happening:

- is.na(deck$value) checks which values in the value column are NA, returning a logical vector (TRUE for NA, FALSE otherwise).
- sum(...) sums up the number of TRUE values in this logical vector, effectively counting the number of NA values.
- Removing Rows with NA Values

The next line of code creates a new dataframe, deck_clean, by removing rows with NA values from the deck dataframe.


Here's what's happening:

- na.omit(deck) removes rows with NA values from the deck dataframe. By default, na.omit() removes rows with NA values in any column.
- The resulting dataframe is assigned to a new variable, deck_clean.
- Verifying the Results

The final line of code verifies that the NA values have been removed from the deck_clean dataframe
		
This should return 0, indicating that there are no NA values in the value column of deck_clean.



### Exercise 8: Handling Missing Data

1. Replace all NA values in the 'value' column with the mean value of the non-NA entries.

```{r}
# Your code here
```

<details>
<summary>Click to see solution</summary>

```{r}
deck$value[is.na(deck$value)] <- mean(deck$value, na.rm = TRUE)
```
</details>

## Environments

An environment in R is a self-contained space where you can store and manage variables, functions, and other objects. Environments are useful for organizing your code and data, and for avoiding naming conflicts.
Let's explore environments:

```{r}
# Create a new environment
card_env <- new.env()

# Assign a variable to the new environment
card_env$ace_value <- 1

# Access the variable
card_env$ace_value
```
##### Creating a New Environment

To create a new environment, you can use the new.env() function.
		
This creates a new, empty environment and assigns it to the variable card_env.

##### Assigning a Variable to the New Environment

To assign a variable to the new environment, you can use the $ operator.
		
This assigns the value 1 to the variable ace_value in the card_env environment.

##### Accessing the Variable

To access the variable, you can use the $ operator again.
This returns the value of the ace_value variable in the card_env environment, which is 1.

		
##### Environment Details

Here are some additional details about the card_env environment.
		
The ls() function lists the objects in the environment, which in this case is just the ace_value variable. The typeof() and class() functions return the type and class of the environment, respectively.

## Working with Environments

Let's use environments to manage game state:

```{r}
game_env <- new.env()

with(game_env, {
  deck <- shuffle_deck(deck)
  player_hand <- deal_card(deck)
})

# Access variables from the environment
game_env$player_hand
```

### Exercise 9: Environment Usage

1. Create a function that initializes a game environment with a shuffled deck and dealt hands for a specified number of players.

```{r}
# Your code here
```

<details>
<summary>Click to see solution</summary>

```{r}
initialize_game <- function(num_players, cards_per_hand) {
  game_env <- new.env()
  
  with(game_env, {
    deck <- shuffle_deck(deck)
    players <- list()
    for (i in 1:num_players) {
      players[[paste0("Player", i)]] <- deal_cards(deck, cards_per_hand)
      deck <- deck[-(1:cards_per_hand), ]
    }
  })
  
  return(game_env)
}

# Usage:
game <- initialize_game(4, 5)
game$players$Player1
```
</details>

## Scoping Rules

Let's explore R's lexical scoping:

```{r}
outer_var <- 10

example_function <- function() {
  inner_var <- 5
  outer_var + inner_var
}

example_function()  # Returns 15
```

## Assignment

Let's practice different assignment methods:

```{r}
# Using <-
x <- 5

# Using =
y = 10

# Using ->
15 -> z

# View results
c(x, y, z)
```

### Exercise 10: Assignment Practice

1. Create a function that takes a deck as input and returns a list with two elements: the red cards and the black cards. Use different assignment operators for each.

```{r}
# Your code here
```

### Final Exercise: Putting It All Together

Create a card game simulation that uses environments, closures, and DataFrames. The game should:

1. Initialize a shuffled deck
2. Deal hands to players
3. Allow players to draw and discard cards
4. Keep track of the game state using an environment

```{r}
# Your code here
```

<details>
<summary>Click to see solution</summary>

```{r}
create_card_game <- function(num_players, cards_per_hand) {
  game_env <- new.env()
  
  with(game_env, {
    deck <- shuffle_deck(deck)
    players <- list()
    for (i in 1:num_players) {
      players[[paste0("Player", i)]] <- deal_cards(deck, cards_per_hand)
    }
    deck <- deck[-(1:(num_players * cards_per_hand)), ]
    
    turn <- 1
    
    draw_card <- function(player_index) {
      if (nrow(deck) == 0) {
        stop("No more cards in the deck!")
      }
      card <- deck[1, ]
      deck <<- deck[-1, ]
      players[[player_index]] <<- rbind(players[[player_index]], card)
      return(card)
    }
    
    discard_card <- function(player_index, card_index) {
      discarded <- players[[player_index]][card_index, ]
      players[[player_index]] <<- players[[player_index]][-card_index, ]
      return(discarded)
    }
    
    next_turn <- function() {
      turn <<- (turn %% num_players) + 1
    }
  })
  
  return(game_env)
}

# Usage:
game <- create_card_game(4, 5)
game$draw_card(1)
game$discard_card(1, 3)
game$next_turn()
```
</details>


