Reading data to a data frame and performing basic operations
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What is a data frame?
How do I access subsets of a data frame?
How do I calculate simple statistics like mean and median?
How can I access documentation?
How do I install a package?
How do I reshape data?
Objectives
Select individual values and subsections from data.
Perform operations on a data frame of data.
Be able to install a package from CRAN.
Be aware of useful packages for working with data frames.
Read data into R
Now that we know how to assign things to variables and use functions, let’s read some yeast OD growth data into R using read.table
and briefly examine the dataset.
growth <- read.table(file = "data/yeast-growth.csv", header = TRUE, sep = ",")
Loading Data Without Headers
What happens if you put
header = FALSE
? The default value isheader = TRUE
?. What do you expect will happen if you leave the default value? Before you run any code, think about what will happen to the first few rows of your data frame, and its overall size. Then run the following code and see if your expectations agree:head(read.table(file = "data/yeast-growth.csv", header = FALSE, sep = ","))
V1 V2 V3 V4 V5 1 well timepoint od concentration_level concentration 2 a 1 0.017 low 0.01 3 b 1 0.017 low 0.03 4 c 1 0.018 medium 1 5 d 1 0.017 medium 3 6 e 1 0.017 medium 30
Where is that file? Or, what is my working directory?
R is always running inside a specific directory, the working directory. Paths can be given relative to that directory so with
data/yeast-growth.csv
we mean ‘the fileyeast-growth.csv
in thedata
directory that is right at the working directory. Set the working directory using RStudioSession
>Set Working Directory..
orsetwd()
Reading Files from Different ‘Locales’
Depending on what countrys standard your computer is set to (the ‘locale’), software such as Excel will use different characters to separate fields. E.g., the default for a computer with UK defaults will be to use ‘;’ to separate fields and ‘,’ to separate thousands. Try finding the right arguments to
read.table
to get something sensible out ofdata/example-us.txt
anddata/example-dk.txt
.Solution
read.table('data/example-uk.txt', sep=',', header=TRUE)
name height age income Ian 183 27 12 1e-02 Peter 162 28 11 5e-01 Bernhard 173 30 10 0e+00 Steven 163 32 12 5e+02
read.table('data/example-dk.txt', sep=';', dec=',', header=TRUE)
name height age income 1 Ian 183 27 12000.01 2 Peter 162 28 11100.50 3 Bernhard 173 30 11000.00 4 Steven 163 32 12500.00
Reading excel files is not natively supported in R so we need to use a special package for that, readxl
is recommended.
library(readxl)
read_xlsx('data/example-dk.xlsx')
# A tibble: 4 x 4
name height age income
<chr> <dbl> <dbl> <dbl>
1 Ian 183 27 12000
2 Peter 162 28 11100
3 Bernhard 173 30 11000
4 Steven 163 32 12500
Working with data frames
Now that our data is loaded in memory, we can start doing things with it.
First, let’s ask what type of thing growth
is:
head(growth)
well timepoint od concentration_level concentration
1 a 1 0.017 low 1e-02
2 b 1 0.017 low 3e-02
3 c 1 0.018 medium 1e+00
4 d 1 0.017 medium 3e+00
5 e 1 0.017 medium 3e+01
6 f 1 0.016 high 1e+02
str(growth) # what data types are the different columns?
'data.frame': 455 obs. of 5 variables:
$ well : Factor w/ 7 levels "a","b","c","d",..: 1 2 3 4 5 6 7 1 2 3 ...
$ timepoint : int 1 1 1 1 1 1 1 2 2 2 ...
$ od : num 0.017 0.017 0.018 0.017 0.017 0.016 0.015 0.015 0.018 0.021 ...
$ concentration_level: Factor w/ 3 levels "high","low","medium": 2 2 3 3 3 1 1 2 2 3 ...
$ concentration : num 1e-02 3e-02 1e+00 3e+00 3e+01 1e+02 3e+02 1e-02 3e-02 1e+00 ...
class(growth)
[1] "data.frame"
The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. Data frames are very useful for storing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.
We can see the shape, or dimensions, of the data frame with the function dim
:
dim(growth)
[1] 455 5
This tells us that our data frame, growth
, has 455 rows and 5 columns.
If we want to get a single value from the data frame, we can provide an index in square brackets, just as we do in math:
# first value in dat
growth[1, 1]
[1] a
Levels: a b c d e f g
# middle value in dat
growth[30, 2]
[1] 5
An index like [30, 2]
selects a single element of a data frame, but we can select whole sections as well.
For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:
growth[1:4, 1:2]
well timepoint
1 a 1
2 b 1
3 c 1
4 d 1
We can use the function c
, which stands for concatenate, to select non-contiguous values:
growth[c(3, 8, 37, 56), c(1, 3)]
well od
3 c 0.018
8 a 0.015
37 b 0.020
56 g 0.024
We also don’t have to provide a subset for either the rows or the columns.
If we don’t include a subset for the rows, R returns all the rows; if we don’t include a subset for the columns, R returns all the columns.
If we don’t provide a subset for either rows or columns, e.g. growth[, ]
, R returns the full data frame.
growth[5, ]
well timepoint od concentration_level concentration
5 e 1 0.017 medium 30
Addressing Columns by Name (the better way)
Columns can also be addressed by name, with either the
$
operator (ie.growth$medium
) or square brackets (ie.growth[,"medium"]
). You can learn more about subsetting by column name in this supplementary lesson.
Particularly useful is also to user other vectors as filters and only return the rows that evaluate to TRUE
. Here, growth$well == "a"
gives a vector with TRUE
or FALSE
for every element in growth$well
that is equal to "a"
.
head(growth[growth$well == "e",])
well timepoint od concentration_level concentration
5 e 1 0.017 medium 30
12 e 2 0.019 medium 30
19 e 3 0.016 medium 30
26 e 4 0.022 medium 30
33 e 5 0.022 medium 30
40 e 6 0.022 medium 30
Now let’s perform some common mathematical operations to learn about our growth curves.
max(growth[growth$well == "e", "od"])
[1] 0.065
Forcing Conversion
Note that R may return an error when you attempt to perform similar calculations on subsetted rows of data frames. This is because some functions in R automatically convert the object type to a numeric vector, while others do not (e.g.
max(growth[1, ])
works as expected, whilemean(growth[1, ])
returns an error). You can fix this by including an explicit call toas.numeric()
, e.g.mean(as.numeric(growth[1, ]))
(but mostly this is not what you want to do). By contrast, calculations on subsetted columns always work as expected, since columns of data frames are already defined as vectors.
Particularly useful is also to user other vectors as filters and only return the rows that evaluate to TRUE
. Here, growth$well
R also has functions for other common calculations, e.g. finding the minimum, mean, and standard deviation of the data:
min(growth[growth$well == "e", "od"])
[1] 0.016
mean(growth[growth$well == "e", "od"])
[1] 0.04347692
sd(growth[growth$well == "e", "od"])
[1] 0.01259974
We may want to compare the different wells and for that we can use the split-apply approach which is very common in R. A common approach is to first split the data:
splitData <- split(growth$od, growth$well)
## and then apply a function
sapply(splitData, max)
a b c d e f g
0.237 0.233 0.231 0.221 0.065 0.040 0.034
## or in one go
tapply(growth$od, growth$well, max)
a b c d e f g
0.237 0.233 0.231 0.221 0.065 0.040 0.034
There are many more apply
style functions among which lapply
for applying functions to elements of lists, apply
for applying functions to rows or columns of a matrix.
Gentle introduction to dplyr and tidyr
Two great packages for doing much more advanced things with data frame are dplyr
and tidyr
which together overlaps a lot with Python pandas but it is not practical to use R completely without these so let’s cover the basics.
As an example, let’s take some mildly messy data and make it easier to use in R. The exact process for cleaning up data is of course entirely dependent on the problem with your data but this example demonstrates the basic process.
Original data looks like this
messy <- read.csv("data/yeast-growth-messy.csv")
messy[, 1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Test1 low 1e-02 0.017 0.015 0.016 0.018 0.022 0.021 0.025
2 Test1 low 3e-02 0.017 0.018 0.015 0.019 0.021 0.020 0.024
3 Test1 medium 1e+00 0.018 0.021 0.016 0.020 0.024 0.023 0.025
4 Test1 medium 3e+00 0.017 0.015 0.017 0.018 0.022 0.022 0.024
5 Test1 medium 3e+01 0.017 0.019 0.016 0.022 0.022 0.022 0.024
6 Test1 high 1e+02 0.016 0.015 0.015 0.018 0.021 0.020 0.023
7 Test1 high 3e+02 0.015 0.016 0.015 0.018 0.021 0.019 0.021
Problems
- OD measurements for each replicate are on different columns
- There are no sensical header so we would have to work with “messy$V1” etc
Let’s fix this using dplyr and tidyr.
library(dplyr)
library(tidyr)
Both these packages make use of the %>%
operator which allows you to chain functions with each other, e.g. instead of.
mean(rnorm(10))
we can write
rnorm(10) %>%
mean()
For this simple example, the first might look easier but with many calls it really helps readability.
tidy <- as.tbl(messy) %>%
dplyr::mutate(timepoint=1:7) %>%
tidyr::gather(well, od, -V1, -V2, -V3, -timepoint) %>%
dplyr::rename(concentration_level=V2) %>%
dplyr::rename(concentration=V3) %>%
dplyr::select(-V1)
tidy
# A tibble: 455 x 5
concentration_level concentration timepoint well od
<fctr> <dbl> <int> <chr> <dbl>
1 low 1e-02 1 V4 0.017
2 low 3e-02 2 V4 0.017
3 medium 1e+00 3 V4 0.018
4 medium 3e+00 4 V4 0.017
5 medium 3e+01 5 V4 0.017
6 high 1e+02 6 V4 0.016
7 high 3e+02 7 V4 0.015
8 low 1e-02 1 V5 0.015
9 low 3e-02 2 V5 0.018
10 medium 1e+00 3 V5 0.021
# ... with 445 more rows
How could we reverse the process if we for some reason wanted a wide format again?
unite(tidy, key, concentration_level, concentration, timepoint) %>%
spread(key, od)
# A tibble: 65 x 8
well high_100_6 high_300_7 low_0.01_1 low_0.03_2 medium_1_3 medium_3_4
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 V10 0.023 0.021 0.025 0.024 0.025 0.024
2 V11 0.024 0.024 0.028 0.027 0.027 0.026
3 V12 0.023 0.022 0.026 0.025 0.025 0.025
4 V13 0.025 0.024 0.028 0.028 0.028 0.028
5 V14 0.027 0.025 0.030 0.031 0.029 0.030
6 V15 0.029 0.027 0.032 0.032 0.031 0.032
7 V16 0.029 0.027 0.033 0.033 0.035 0.034
8 V17 0.032 0.028 0.037 0.037 0.038 0.036
9 V18 0.033 0.028 0.041 0.038 0.038 0.038
10 V19 0.035 0.031 0.042 0.042 0.041 0.041
# ... with 55 more rows, and 1 more variables: medium_30_5 <dbl>
Key Points
Use
read.table
andwrite.table
to import / export data.The function
str
describes the data frame.Use
object[x, y]
to select a single element from a data frame.Use
from:to
to specify a sequence that includes the indices fromfrom
toto
.All the indexing and slicing that works on data frames also works on vectors.
Use
#
to add comments to programs.Use
mean
,max
,min
andsd
to calculate simple statistics.Use split-apply to calculate statistics across the groups in a data frame.
Use dplyr/tidyr in R for manipulating data frames