2 Data types and datasets

2.1 Data types in R

There are 5 types of data in R:

  • integers (whole numbers)
  • numerics (also called doubles or floats, numbers that allow decimal points)
  • characters (also known as strings; have quotes around them)
  • logicals (TRUE or FALSE)
  • complex numbers (remember i, the imaginary number? like 5+5i – you probably won’t come across these)

You can find the type of data R thinks you’re working with with the function typeof(). IMPORTANT: R assumes that all elements in a vector are of the same type. A given vector cannot contain both character and numeric values: R will force all the values to be either characters or numerics (in this case, the default will be characters). This will become important when we start thinking of individual columns in a dataset as vectors.

typeof(x)
myIntegers <- 1:5
typeof(myIntegers)

myCharacters <- c("cat", "dog", "idunno")
typeof(myCharacters)

myLogicals <- c(TRUE, FALSE)
myLogicals
typeof(myLogicals)

myLogicals <- c("TRUE", "FALSE")
typeof(myLogicals)

test <- 1 == 1
typeof(test)

test2 <- 1+2i
typeof(test2)

2.1.1 Data types with respect to vectors

Recall that a vector is a one-dimensional list of elements, like a column in a spreadsheet. The elements in a vector in R must all be of the same type. If there is any ambiguity in the type of element in the vector, R uses the following ranking to ensure uniformity in the data type:

character > complex > numeric (double) > integer > logical

This means that if just one element in the vector is a character, then all the elements will be characters (even if you thought you were entering integers or numerics).

x <- c(5, "cat", 62.1)
typeof(x)

2.1.2 Type coercion

Vectors can be coerced into other data types using the as.X functions: (as.numeric, as.character, as.integer, as.logical)

y <- c(5, "NA", 62.1)
as.numeric(y)
y <- as.numeric(y)
y/2

You can check the data type using typeof() or is.X (is.numeric, is.character, etc.):

is.numeric(y)
is.character(y)
typeof(y)

A note on NA: NA means not applicable — it will be treated as an empty cell. It doesn’t need quotes and it can coexist with numbers. It’s a special string that can easily be converted to a numeric and R won’t be annoyed. However, if NA is in quotes, like “NA”, it will be treated as a character.

The following vector will be a character vector because of the quotes around NA:

x <- c(100, 200.2, 120.6, "NA")
typeof(x)

We can force that vector to be numeric and NA is readily interpretable as simply an empty cell:

x <- as.numeric(x)
typeof(x)
new_x <- x - 5
new_x

When coercing a character vector to a numeric, sometimes you need to wrap the vector in as.character() before doing the numeric conversion. This arises when the vector is actually a FACTOR in R. You don’t need to know what a factor is yet, but just know that if you try to do numeric type coercion, and R returns a bunch of unexpected 0, 1, 2, 3 etc to you, then you should try this line of code instead:

z <- c(2, 20, 200)
z <- factor(z)
z
notgood <- as.numeric(z)
notgood

good <- as.numeric(as.character(z))
good

2.2 Datasets

2.2.1 Loading in a dataset

There are a few ways to read in a dataset. One is to use the GUI (Graphical User Interface) in RStudio:

From the Environments window in RStudio (upper right corner), select Import –> From Text (base)…. This will let you navigate to the file and select how you want to name it. Give it the name “mcm”.

The alternative is to type the code and the path directly. (Note that using the Import option will automatically produce this code for you.)

An important note about paths: The path to a file is the computer address to your file. It is important that you get the address right so the computer can locate the file. A useful life tip is to keep your project files and more generally, your computer very organized so that the paths are easy (or at least intuitive) to type out. If you have folders with spaces in it, it seems that R can handle these. (In general, though, I would avoid labeling anything on your computer with spaces. It can make life very difficult when working with different programming languages.)

mcm <- read.csv("/Users/eleanorchodroff/A folder with spaces/QuantitativeMethods/datasets/McMurrayJongmanFricativeAcoustics.csv")

2.2.2 Viewing the dataset

Take a look at the dataset using View()

View(mcm)

While we won’t use these functions too frequently, you can see the top of the dataset in the console using head() and bottom using tail(). These functions are important if you ever find yourself in a situation where you can’t use RStudio and have to use a basic console.

head(mcm)
tail(mcm)

2.2.3 Getting a summary of the dataset

Get the dimensions of the dataset using nrow() (number of rows) and ncol() (number of columns):

nrow(mcm)
ncol(mcm)

Get a summary of all the columns in the dataset using summary() If the vector is numeric, R will tell you the quartiles, the mean, and the number of NAs, if any.

summary(mcm)
summary(mcm)

2.3 Extracting parts of the dataset

Remember that to get an element in a specific location from a vector, we could give it the index in square brackets. So if we wanted to get the 3rd element in the vector x, then we would type:

x[3]
x[3]

We can extend this logic to datasets, but now we are in a two-dimensional space. The dataset has Rows and Columns, so we need to provide the coordinates in square brackets, with Rows before Columns (R before C because we’re using R). When both the row and the column are specified, R will return the individual cell.

mcm[3,2]
mcm[1000, 20]

We can also extract an entire row by leaving the column specification blank (the location AFTER the comma).

mcm[1000,]
myrow <- mcm[1000,]
View(myrow)

We can extract a subset of the dataset by specifying a RANGE of rows using the colon :

myrow <- mcm[1000:1100,]
View(myrow)

We can also extract an entire column by leaving the row specification blank (the location BEFORE the comma):

mycol <- mcm[,30]
View(mycol)

We can get a random set of columns using our vector notation c(). We can get columns 1, 5 and 7 by specifying it as follows

mycol <- mcm[,c(1,5,7)]

2.3.1 The dollar sign: $

The dollar sign is incredibly useful when it comes to dealing with columns in a dataset! The dollar sign allows you to name the column that you want to refer to. For example, we can identify and extract the column “Talker” from the mcm dataset by calling mcm$Talker:

mycol <- mcm$Talker
View(mycol)

2.3.2 Creating new columns in a dataset

You can create a new column based on a pre-existing column using the dollar sign and then giving the new column a name. In this example, we’ll create a new column for the duration of a fricative in seconds by taking the column with the duration of the fricative in milliseconds and dividing by 1000:

mcm$dur_f_s <- mcm$dur_f/1000
View(mcm)

2.4 Deleting columns

Delete a column by assigning NULL to it

mcm$dur_f_s <- NULL
View(mcm)

2.5 Pasting strings together

We can paste strings together using the paste() function. In this example, we’ll create a column specifying the diphone (the consonant and the following vowel), by pasting together the fricative and the vowel columns with an underscore separating the two elements:

mcm$diphone <- paste(mcm$Fricative, mcm$Vowel, sep="_")
View(mcm)

The “sep=” argument refers to the character that should separate the two elements being pasted together. Since we have quotes wrapped around an underscore (“_“), then the underscore will separate the two elements. If we wanted nothing separating the two elements, then we could have quotes wrapped around nothing:

mcm$diphone <- paste(mcm$Fricative, mcm$Vowel, sep="")
View(mcm)

2.6 Simple if-else statements

If-else statements are a staple of almost all programming languages. As their name suggests, these are statements that make a certain change to an element if a condition is met, otherwise (or else), it makes a different change. For example, we could create a new column in the dataset that tries to guess the gender of a speaker based on their f0. If the f0 is less than, say, 200 Hz, let’s call it “male”, otherwise (else), we’ll call it female.

There are a few ways to write if-else statements in R. One way is using the ifelse() function. ifelse() takes three arguments: the condition that needs to be met (mcm$f0 < 200), what to do if that condition is TRUE (indicate “male”), what to do if the condition is FALSE (indicate “female”) Here we create a new column where we guess the gender of the speaker based on their F0:

mcm$guessG <- ifelse(mcm$F0 < 200, "male", "female")
View(mcm)

2.7 Cross tabulation

For character vectors, can get the overall number of cells per category using the xtabs() function. xtabs() tabulates, or counts up, the number of cells with each unique label. Here we can look at how many rows were specified as “male” and how many as “female” in our new guessG column. Note the use of the tilde (~) in this formulation. The use of the tilde is a bit idiosyncratic in the language, and frequently means “as a function of”, but just go with it. Like natural languages, computer languages sometimes also have idiosyncratic features.

xtabs(~guessG, mcm)
xtabs(~guessG, mcm)

We can get further breakdowns of the counts using the plus sign: The following example breaks down the counts by classified gender and the talker who produced those values

xtabs(~guessG + Talker, mcm)

We can also count how many tokens we have per fricative:

xtabs(~Fricative, mcm)

And how many tokens we have per fricative and talker:

xtabs(~Fricative + Talker, mcm)

How many items per fricative and talker and vowel:

xtabs(~Talker + Fricative + Vowel, mcm)

2.8 Subsetting data

You can take a subset of the dataset using the subset() function. subset() takes two arguments: the name of the dataset, and the condition to subset on Let’s say we just want to look at the data from Talker “M3”. We can take the subset in the following way. Note the use of the double equals sign!

m3 <- subset(mcm, Talker == "M3")
View(m3)

Let’s say we instead want to look at the dataset where the f0 value is below 75 Hz:

lowf0 <- subset(mcm, F0 < 75)
View(lowf0)

You can combine subset conditions using the ampersand (&) which means “AND”

lowf0 <- subset(mcm, F0 < 200 & Talker == "M3")
View(lowf0)

lowf0 <- subset(mcm, F0 < 100 & Talker == "M3")
View(lowf0)

2.9 Practice

Create an R script to save the answers to these questions.

2.9.1 Practice with data types and type coercion

  1. Create a vector called ‘myNumbers’ and store three numbers in it.
  2. Create a new vector called ‘myCharacters’ that turns the myNumbers vector into characters.
  3. Create a new vector called ‘myNumbers2’ that turns myCharacters back into numbers.
  4. Create a vector called ‘mixed’ that contains a mixture of strings, numbers, and/or logicals.
  5. Figure out what the data type of the ‘mixed’ vector is.

2.9.2 Practice with datasets: basics

  1. Import the dataset ‘L2_English_Lexical_Decision_Data.csv’ into R and call it ‘lex’ for lexical decision. This data set contains reaction times (RT) in milliseconds to words and nonwords of English from L2 English speaking participants. More info about the data and project can be found here.
  2. Get the number of rows in the dataset using the console (don’t just look at the Environment window).
  3. Get the number of columns in the dataset using the console (don’t just look at the Environment window).
  4. Get a summary of the dataset.

2.9.3 Practice with datasets: extracting parts

  1. Create a new variable vector called ‘myRows’ that contains rows 1 to 10 from the new dataset.
  2. Create a new variable vector called ‘myCols’ that contains columns 3 to 5 from the new dataset.
  3. Create a variable vector called ‘accuracy’ that contains the accuracy column in the dataset (‘acc’). Use the dollar sign when referring to this column.

2.9.4 Practice with datasets: new columns and xtabs

  1. Let’s say your measurement of reaction time (RT) was systematically off by 20 ms. For example, the program recording RT may have started 20 ms after some event, but for the participant, the 20 ms was still technically part of the RT. Create a new column in ‘lex’ called ‘updatedRT’ and add 20 to lex$RT.
  2. Now delete the old RT column (lex$RT). If things go horribly wrong here, just reload the dataset back in to start over. :)
  3. Create another column in lex called ‘gender’. If the sex of the participant is equal to 1, then the new value in the ‘gender’ column should be “m” for male, otherwise the new value should be “f”. (See the if-else code).
  4. Get a count of how many male and female participants there are using cross-tabulation on the gender column.

2.9.5 Practice with datasets: subsets

  1. Create a subset of the lex data that includes only data from male participants. Call this new dataset lex_m.
  2. Create another subset of the lex data that includes only reaction times greater than 3000 ms. Call this new dataset sus_long for suspiciously long. How many rows (data points) are there? What is the range of this data? (use the summary() function).
  3. Create another subset of the lex data that includes only reaction times less than 100 ms. Call this new dataset sus_short for suspiciously short. How many rows (data points) are there? What is the range of this data? (use the summary() function).

You can find the answer key here.