R Notebook: Working with DataFrames
Introduction
Welcome to this comprehensive lecture on working with DataFrames in
R! DataFrames are one of the most fundamental and versatile data
structures in R, making them essential for any data analysis task. In
this session, we’ll explore various operations and concepts related to
DataFrames using a deck of cards as our primary example.
We’ve chosen a deck of cards as our dataset because it provides a
relatable and intuitive structure that can demonstrate many DataFrame
operations effectively. Each card in our deck will be represented as a
row in our DataFrame, with attributes like face, suit, and value as
columns.
Let’s begin by loading the necessary libraries and our deck of
cards:
library(dplyr)
library(readr)
# Load the deck of cards
deck <- read_csv(url("https://nayelbettache.github.io/documents/STSCI_2120/deck.csv"))
Let’s break down what’s happening here:
library(dplyr)
: This loads the dplyr package, which
provides a set of tools for efficiently manipulating datasets in R.
dplyr is part of the tidyverse ecosystem and offers functions like
filter()
, select()
, and mutate()
that we’ll use throughout this notebook.
library(readr)
: This loads the readr package, which
provides a fast and friendly way to read rectangular data (like CSV
files). It’s generally faster than base R’s read.csv() function and
automatically handles many common CSV formats.
deck <- read_csv("deck_of_cards.csv")
: This line
reads our CSV file containing the deck of cards data and stores it in a
DataFrame called deck
. The read_csv()
function
from readr is being used here.
Now, let’s take a look at the first few rows of our deck:
head(deck)
The head()
function gives us a quick preview of our
DataFrame. It shows the first 6 rows by default. This is always a good
first step when working with a new dataset to understand its structure
and content.
Our deck DataFrame has three columns: - face
: The face
value of the card (e.g., “Ace”, “2”, “King”) - suit
: The
suit of the card (Hearts, Diamonds, Clubs, Spades) - value
:
The numerical value of the card (e.g., Ace might be 1, King might be
13)
Selecting Values
In R, there are multiple ways to select values from a DataFrame.
Understanding these methods is crucial for effective data manipulation.
Let’s explore three common approaches:
- Using column names:
deck$suit
[1] "spades" "spades" "spades" "spades" "spades" "spades" "spades" "spades" "spades"
[10] "spades" "spades" "spades" "spades" "clubs" "clubs" "clubs" "clubs" "clubs"
[19] "clubs" "clubs" "clubs" "clubs" "clubs" "clubs" "clubs" "clubs" "diamonds"
[28] "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds"
[37] "diamonds" "diamonds" "diamonds" "hearts" "hearts" "hearts" "hearts" "hearts" "hearts"
[46] "hearts" "hearts" "hearts" "hearts" "hearts" "hearts" "hearts"
deck[["face"]]
[1] "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four" "three" "two"
[13] "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four" "three"
[25] "two" "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four"
[37] "three" "two" "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five"
[49] "four" "three" "two" "ace"
Here’s what’s happening:
deck$suit
: This uses the $
operator to
access the “suit” column of the deck DataFrame. It returns a vector
containing all the values in the “suit” column.
deck[["face"]]
: This uses double square brackets
[[]]
to access the “face” column. Like $
, it
returns a vector of all values in the “face” column.
The main difference between these two methods is that
[[]]
can be used with variables containing column names,
while $
cannot. For example:
column_name <- "value"
deck[[column_name]] # This works
[1] 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 7
[34] 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 1
# deck$column_name # This would not work as expected
- Using indices:
deck[1:5, 2] # First 5 rows, second column
This method uses R’s built-in indexing. The format is
dataframe[rows, columns]
: - 1:5
specifies rows
1 through 5 - 2
specifies the second column
This approach is very flexible: - deck[1:5, ]
would
select the first 5 rows and all columns - deck[, 2]
would
select all rows of the second column - deck[1:5, 1:2]
would
select the first 5 rows of the first two columns
- Using dplyr’s select function:
select(deck, face, suit)
The select()
function from dplyr provides a more
intuitive and readable way to choose columns: - The first argument is
the DataFrame - Subsequent arguments are the names of columns you want
to select
This method is particularly useful when you need to select multiple
columns or use more complex selection criteria. For example:
# Select columns that start with "s"
select(deck, starts_with("s"))
# Select all columns except "value"
select(deck, -value)
Exercise 1: Selecting Values
Now, let’s practice these selection methods:
- Select all the faces from the deck.
- Select the first 10 rows and all columns of the deck.
- Use dplyr to select only the ‘value’ column.
# Your code here
Click to see solution
# 1. Select all the faces
deck$face
[1] "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four" "three" "two"
[13] "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four" "three"
[25] "two" "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four"
[37] "three" "two" "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five"
[49] "four" "three" "two" "ace"
# 2. Select first 10 rows and all columns
deck[1:10, ]
# 3. Use dplyr to select only the 'value' column
select(deck, value)
Explanation: 1. We use the $
operator to select the
‘face’ column, which returns a vector of all faces. 2. We use bracket
notation []
with 1:10
to select the first 10
rows, and leave the column part empty to select all columns. 3. We use
dplyr’s select()
function to choose only the ‘value’
column. This returns a DataFrame with one column, not a vector.
Deal a Card
In card games, dealing is a fundamental operation. Let’s create a
function to simulate dealing a card from our deck:
deal_card <- function(deck) {
card <- deck[sample(nrow(deck), 1), ]
return(card)
}
dealt_card <- deal_card(deck)
print(dealt_card)
Let’s break down this deal_card()
function:
function(deck)
: This defines a function that takes
one argument, our deck DataFrame.
sample(nrow(deck), 1)
:
nrow(deck)
returns the number of rows in the deck
sample(x, size)
randomly samples size
numbers from 1 to x
- So this generates one random row number
deck[sample(nrow(deck), 1), ]
: This selects the
randomly chosen row from the deck. The comma with nothing after it means
we select all columns for this row.
The selected row (representing a single card) is assigned to
card
.
return(card)
sends this randomly selected card back
as the output of the function.
When we call deal_card(deck)
, it runs this function and
returns a single, randomly selected card from our deck.
Exercise 2: Dealing Cards
Let’s extend our dealing functionality:
- Modify the
deal_card
function to deal multiple cards at
once.
- Deal a hand of 5 cards and display them.
# Your code here
Click to see solution
# 1. Modify deal_card function
deal_cards <- function(deck, n) {
cards <- deck[sample(nrow(deck), n), ]
return(cards)
}
# 2. Deal a hand of 5 cards
hand <- deal_cards(deck, 5)
print(hand)
Explanation: 1. We modify the function to take an additional argument
n
, which is the number of cards to deal. 2. We use
sample(nrow(deck), n)
to get n
random row
numbers. 3. We select these rows from the deck to create our hand of
cards. 4. We then use this new function to deal a hand of 5 cards and
display it.
Shuffle the Deck
Shuffling is another crucial operation in card games. In DataFrame
terms, this means randomly reordering our rows:
shuffle_deck <- function(deck) {
shuffled_deck <- deck[sample(nrow(deck)), ]
rownames(shuffled_deck) <- NULL
return(shuffled_deck)
}
shuffled_deck <- shuffle_deck(deck)
head(shuffled_deck)
Let’s break down this shuffle_deck()
function:
sample(nrow(deck))
: This creates a random
permutation of the numbers from 1 to the number of rows in the deck. For
example, if we have 52 cards, this might return something like
[23, 7, 52, 1, 18, ...]
.
deck[sample(nrow(deck)), ]
: This uses the random
permutation to reorder all rows of the deck. It’s equivalent to randomly
shuffling the cards.
rownames(shuffled_deck) <- NULL
: This resets the
row names of the shuffled deck. When we reorder the rows, R keeps the
original row names by default, which can be confusing. Setting them to
NULL causes R to use sequential numbers as row names.
The function returns this shuffled deck.
After shuffling, we use head()
to view the first few
rows of the shuffled deck, demonstrating that the order has indeed
changed.
Exercise 3: Shuffling
Now, let’s combine our shuffling and dealing operations:
- Shuffle the deck and deal the top 3 cards.
- Create a function that shuffles the deck and deals a specified
number of hands with a specified number of cards each.
# Your code here
Click to see solution
# 1. Shuffle and deal top 3 cards
top_3 <- head(shuffle_deck(deck), 3)
# 2. Function to shuffle and deal multiple hands
shuffle_and_deal <- function(deck, num_hands, cards_per_hand) {
shuffled <- shuffle_deck(deck)
hands <- list()
for (i in 1:num_hands) {
start <- (i - 1) * cards_per_hand + 1
end <- i * cards_per_hand
hands[[i]] <- shuffled[start:end, ]
}
return(hands)
}
# Example usage:
game_hands <- shuffle_and_deal(deck, 4, 5) # 4 hands, 5 cards each
Explanation: 1. We first shuffle the deck using our
shuffle_deck()
function, then use head()
to
get the first 3 cards. 2. For the second part: - We create a function
that takes the deck, number of hands, and cards per hand as arguments. -
We shuffle the deck first. - We create an empty list to store the hands.
- We use a for loop to deal the appropriate number of cards to each
hand. - We use list indexing to add each hand to our list of hands. -
Finally, we return the list of hands.
Dollar Signs and Double Brackets
In R, we can access DataFrame columns using $
or
[[]]
. Understanding the differences between these methods
is crucial for effective R programming:
# Using $
deck$suit
[1] "spades" "spades" "spades" "spades" "spades" "spades" "spades" "spades" "spades"
[10] "spades" "spades" "spades" "spades" "clubs" "clubs" "clubs" "clubs" "clubs"
[19] "clubs" "clubs" "clubs" "clubs" "clubs" "clubs" "clubs" "clubs" "diamonds"
[28] "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds" "diamonds"
[37] "diamonds" "diamonds" "diamonds" "hearts" "hearts" "hearts" "hearts" "hearts" "hearts"
[46] "hearts" "hearts" "hearts" "hearts" "hearts" "hearts" "hearts"
# Using [[]]
deck[["face"]]
[1] "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four" "three" "two"
[13] "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four" "three"
[25] "two" "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five" "four"
[37] "three" "two" "ace" "king" "queen" "jack" "ten" "nine" "eight" "seven" "six" "five"
[49] "four" "three" "two" "ace"
# Using [[]] with variables
column_name <- "value"
deck[[column_name]]
[1] 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 1 13 12 11 10 9 8 7
[34] 6 5 4 3 2 1 13 12 11 10 9 8 7 6 5 4 3 2 1
Let’s break down these methods:
deck$suit
:
- This uses the
$
operator to directly access the “suit”
column.
- It’s quick and intuitive, but has limitations.
- It doesn’t work with variable column names.
- It can sometimes cause issues in complex operations or function
calls.
deck[["face"]]
:
- This uses double square brackets
[[]]
to access the
“face” column.
- It’s more flexible than the
$
operator.
- It works with variable column names (as shown in the third
example).
- It’s generally safer in function calls and complex operations.
- It clearly indicates that you’re extracting a single column.
deck[[column_name]]
:
- This demonstrates using
[[]]
with a variable containing
the column name.
- This flexibility is particularly useful when writing functions that
need to work with different columns.
Both $
and [[]]
return a vector of values
from the specified column. The main difference is in their flexibility
and how they behave in certain contexts (like inside functions or with
variable column names).
Exercise 4: Column Access
Let’s practice using these access methods:
- Create a function that takes a column name as an argument and
returns the unique values in that column.
# Your code here
Click to see solution
get_unique_values <- function(df, column_name) {
unique(df[[column_name]])
}
# Example usage:
unique_suits <- get_unique_values(deck, "suit")
print(unique_suits)
[1] "spades" "clubs" "diamonds" "hearts"
Explanation: - We define a function get_unique_values
that takes two arguments: a DataFrame df
and a
column_name
. - Inside the function, we use
df[[column_name]]
to access the specified column. We can’t
use $
here because we need to use the variable
column_name
. - We wrap this in the unique()
function, which returns only the unique values from the vector. - This
function is flexible - it can be used with any DataFrame and any column
name. - In the example usage, we get the unique suits from our deck.
Modifying Values
Modifying values in a DataFrame is a common operation in data
cleaning and transformation. Let’s look at how we can change values in
our deck:
# Change the value of the first card to 100
deck$value[1] <- 100
# Change the face of the last card to "Joker"
deck$face[nrow(deck)] <- "Joker"
# View the changes
head(deck, 1)
tail(deck, 1)
Let’s break down what’s happening here:
deck$value[1] <- 100
:
- This accesses the ‘value’ column of the deck using
$
.
[1]
selects the first element of this column.
- We assign the value 100 to this element, changing the value of the
first card.
deck$face[nrow(deck)] <- "Joker"
:
- Similar to the first operation, but we’re changing the ‘face’
column.
nrow(deck)
gives us the number of rows in the deck,
effectively selecting the last row.
- We change the face of the last card to “Joker”.
- We use
head(deck, 1)
and tail(deck, 1)
to
view the first and last rows of the deck, confirming our changes.
This method of direct assignment is straightforward but should be
used cautiously. It’s easy to accidentally modify data you didn’t intend
to change.
Exercise 5: Value Modification
Now, let’s try some more complex modifications:
- Change all the “Jack” cards to have a value of 11.
- Add a new column called “color” based on the suit (red for Hearts
and Diamonds, black for Clubs and Spades).
# Your code here
Click to see solution
# 1. Change Jack values to 11
deck$value[deck$face == "Jack"] <- 11
# 2. Add color column
deck$color <- ifelse(deck$suit %in% c("Hearts", "Diamonds"), "Red", "Black")
Explanation: 1.
deck$value[deck$face == "Jack"] <- 11
: -
deck$face == "Jack"
creates a logical vector, TRUE for Jack
cards, FALSE for others. - We use this to index deck$value
,
selecting only the values for Jack cards. - We assign 11 to these
selected values.
Changing Values in Place
mutate() is a verb function in dplyr that allows you to add new
columns or modify existing ones in a data frame. The basic syntax
is:
mutate(data, new_column = expression)
Where:
- data is the data frame you want to modify.
- new_column is the name of the new column you want to create or the
existing column you want to modify.
- expression is the operation you want to perform to create or modify
the column.
Let’s use dplyr’s mutate
function:
deck <- mutate(deck,
value = ifelse(face == "King", 13, value),
value = ifelse(face == "Queen", 12, value))
# View the changes
filter(deck, face %in% c("King", "Queen"))
Using mutate to Update the value Column
The code uses the mutate function from the dplyr package to update
the value column in the deck dataframe.
The mutate Function
The mutate function takes a dataframe as input and returns a new
dataframe with the modified columns. In this case, the mutate function
is used to update the value column in the deck dataframe.
The ifelse Function
The ifelse function is a vectorized conditional statement that checks
a condition and returns one value if the condition is TRUE and another
value if the condition is FALSE.
In the first ifelse statement, the condition is face == “King”. If
this condition is TRUE, the value returned is 13; otherwise, the
original value in the value column is returned.
In the second ifelse statement, the condition is face == “Queen”. If
this condition is TRUE, the value returned is 12; otherwise, the
original value in the value column is returned (which may have already
been updated by the previous ifelse statement).
The mutate Statement
The mutate statement is used to update the value column in the deck
dataframe. The value column is updated twice, first to assign a value of
13 to the “King” cards, and then to assign a value of 12 to the “Queen”
cards.
However, there’s a subtle issue with this code. The second ifelse
statement will only update the “Queen” cards if they haven’t already
been updated by the first ifelse statement. Since the first ifelse
statement doesn’t update the “Queen” cards, the second ifelse statement
will work as expected.
The filter Function
The filter function is used to select a subset of rows from the deck
dataframe where the face column is either “King” or “Queen”. This allows
us to view the changes made to the value column for these specific
cards.
Exercise 6: In-Place Modifications
- Use
mutate
to add a new column “is_face_card” that is
TRUE for Jack, Queen, and King, and FALSE otherwise.
# Your code here
Click to see solution
deck <- mutate(deck,
is_face_card = face %in% c("Jack", "Queen", "King"))
Logical Subsetting
Let’s practice logical subsetting:
# Get all Hearts
hearts <- deck[deck$suit == "Hearts", ]
# Get all face cards
face_cards <- deck[deck$face %in% c("Jack", "Queen", "King"), ]
# View results
head(hearts)
head(face_cards)
The code uses logical subsetting to extract specific rows from the
deck dataframe based on certain conditions.
Getting All Hearts
The first line of code uses the following syntax to get all rows
where the suit column is “Hearts”:
Here’s what’s happening:
- deck$suit == “Hearts” is a logical expression that checks if the
value in the suit column is equal to “Hearts”. This will return a vector
of TRUE and FALSE values, where TRUE indicates that the row has a suit
of “Hearts”.
- The square brackets [] are used to subset the deck dataframe based
on this logical expression. The comma inside the brackets indicates that
we want to select rows (if there were no comma, it would select
columns).
- The resulting subset of rows is assigned to a new dataframe called
hearts. Getting All Face Cards
The second line of code uses the following syntax to get all rows
where the face column is “Jack”, “Queen”, or “King”:
Here’s what’s happening:
- deck$face %in% c(“Jack”, “Queen”, “King”) is a logical expression
that checks if the value in the face column is one of the values in the
vector c(“Jack”, “Queen”, “King”). This will return a vector of TRUE and
FALSE values, where TRUE indicates that the row has a face that is one
of the specified values.
- The rest of the syntax is the same as before: the square brackets []
are used to subset the deck dataframe based on this logical expression,
and the resulting subset of rows is assigned to a new dataframe called
face_cards.
Viewing the Results
The final two lines of code use the head() function to view the first
few rows of the hearts and face_cards dataframes
This allows us to verify that the subsetting worked correctly and see
the resulting dataframes.
Exercise 7: Logical Subsetting
- Create a subset of the deck containing only cards with values
greater than 10.
- Create a subset of red cards (Hearts and Diamonds) with odd
values.
# Your code here
Click to see solution
# 1. Cards with values > 10
# 2. Red cards with odd values
red_odd_cards <- deck[deck$suit %in% c("Hearts", "Diamonds") & deck$value %% 2 == 1, ]
Environments
An environment in R is a self-contained space where you can store and
manage variables, functions, and other objects. Environments are useful
for organizing your code and data, and for avoiding naming conflicts.
Let’s explore environments:
# Create a new environment
card_env <- new.env()
# Assign a variable to the new environment
card_env$ace_value <- 1
# Access the variable
card_env$ace_value
[1] 1
Creating a New Environment
To create a new environment, you can use the new.env() function.
This creates a new, empty environment and assigns it to the variable
card_env.
Assigning a Variable to the New Environment
To assign a variable to the new environment, you can use the $
operator.
This assigns the value 1 to the variable ace_value in the card_env
environment.
Accessing the Variable
To access the variable, you can use the $ operator again. This
returns the value of the ace_value variable in the card_env environment,
which is 1.
Environment Details
Here are some additional details about the card_env environment.
The ls() function lists the objects in the environment, which in this
case is just the ace_value variable. The typeof() and class() functions
return the type and class of the environment, respectively.
Working with Environments
Let’s use environments to manage game state:
game_env <- new.env()
with(game_env, {
deck <- shuffle_deck(deck)
player_hand <- deal_card(deck)
})
# Access variables from the environment
game_env$player_hand
Exercise 9: Environment Usage
- Create a function that initializes a game environment with a
shuffled deck and dealt hands for a specified number of players.
# Your code here
Click to see solution
initialize_game <- function(num_players, cards_per_hand) {
game_env <- new.env()
with(game_env, {
deck <- shuffle_deck(deck)
players <- list()
for (i in 1:num_players) {
players[[paste0("Player", i)]] <- deal_cards(deck, cards_per_hand)
deck <- deck[-(1:cards_per_hand), ]
}
})
return(game_env)
}
# Usage:
game <- initialize_game(4, 5)
game$players$Player1
Scoping Rules
Let’s explore R’s lexical scoping:
outer_var <- 10
example_function <- function() {
inner_var <- 5
outer_var + inner_var
}
example_function() # Returns 15
[1] 15
Assignment
Let’s practice different assignment methods:
# Using <-
x <- 5
# Using =
y = 10
# Using ->
15 -> z
# View results
c(x, y, z)
[1] 5 10 15
Exercise 10: Assignment Practice
- Create a function that takes a deck as input and returns a list with
two elements: the red cards and the black cards. Use different
assignment operators for each.
# Your code here
Final Exercise: Putting It All Together
Create a card game simulation that uses environments, closures, and
DataFrames. The game should:
- Initialize a shuffled deck
- Deal hands to players
- Allow players to draw and discard cards
- Keep track of the game state using an environment
# Your code here
Click to see solution
create_card_game <- function(num_players, cards_per_hand) {
game_env <- new.env()
with(game_env, {
deck <- shuffle_deck(deck)
players <- list()
for (i in 1:num_players) {
players[[paste0("Player", i)]] <- deal_cards(deck, cards_per_hand)
}
deck <- deck[-(1:(num_players * cards_per_hand)), ]
turn <- 1
draw_card <- function(player_index) {
if (nrow(deck) == 0) {
stop("No more cards in the deck!")
}
card <- deck[1, ]
deck <<- deck[-1, ]
players[[player_index]] <<- rbind(players[[player_index]], card)
return(card)
}
discard_card <- function(player_index, card_index) {
discarded <- players[[player_index]][card_index, ]
players[[player_index]] <<- players[[player_index]][-card_index, ]
return(discarded)
}
next_turn <- function() {
turn <<- (turn %% num_players) + 1
}
})
return(game_env)
}
# Usage:
game <- create_card_game(4, 5)
game$draw_card(1)
game$discard_card(1, 3)
game$next_turn()
