R Language Notes
Notes on how to install and use the R statistical language.
Installing R and RStudio
The easiest and quickest way to install or update R on Debian is to execute the following commands:
apt-get update
apt-get install r-base r-base-dev
Note: r-base-dev is only required to compile R packages or software that depends on R.
Having done that, download and install the latest RStudio package as follows.
Copy the link to the latest version of RStudio from https://www.rstudio.com/products/rstudio/download/.
For example, this will look like https://download1.rstudio.org/rstudio-1.1.453-amd64.deb.
Download it using wget, for example, wget https://download1.rstudio.org/rstudio-0.99.878-amd64.deb. Finally, install it using dpkg as follows: sudo dpkg -i rstudio-0.99.878-amd64.deb
Installing a package
Use install.packages("package name")
Loading a package
Use library(package name)
Viewing data available in a package
data(package='package name')
Accessing data in a package
Use data(dataset_name, package='package name')
Following this command you can type dataset_name to view the data set contents.
Viewing data
Use View(variable)
Listing objects in current environment
Use ls() or objects() to get a vector of strings, listing names of objects in current environment.
If the above functions are specified within a function, they will only list the names of objects specified within that function.
For example:
test <- function() {internal_variable <- 'testing'; ls()}
will output:
"internal_variable"
Creating a vector
Use the c() function to combine values into a list or vector, for example, ages <- c(12,14,19,28,42).
Removing objects from an environment
Use the rm() or remove() functions as follows:
To remove a single object type:
rm(object_name)
To remove multiple objects type:
rm(list=c('an_object','other_object','final_object'))
To remove multiple objects matching a specific pattern, say starting with pa, type:
rm(list=ls()[grep('^pa',ls())])
Computing five-number summary plus mean
Suppose we have a vector of ages as follows:
ages = c(18,22,21,24,19,22,20,20,30,42)
To compute its five-number summary plus mean, just type summary(ages), which gives us:
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 20.0 21.5 23.8 23.5 42.0
Computing sample mean and standard deviation
Using the above vector ages, you can easily compute the sample mean using mean(ages) and sample standard deviation using sd(ages), which return 23.8 and 7.223 respectively.
We can double-check these statistics by computing them using the following formulas for sample mean $\bar x$ and standard deviation $s$.
\[\bar x=\frac{\sum_{i=1}^nx_i}{n}\] \[s=\sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1}}\]ages
[1] 18 22 21 24 19 22 20 20 30 42
> x_bar = sum(ages)/length(ages)
> x_bar
[1] 23.8
> var = sum((ages - x_bar)^2)/(length(ages) - 1)
> var
[1] 52.17778
> s = sqrt(var)
> s
[1] 7.223419
Quick introduction to R
A comprehensive quick introduction to the R language can be found at https://www.statmethods.net.
If you prefer a more hands-on approach, you might find DataCamp’s interactive free course, Introduction to R, more interesting.
Basic data frame filtering
Suppose we have a data frame, people, with 2 columns, country and age.
> country = c("Italy","Greece","Spain","Italy","France","Italy")
> age = c(20,43,16,20,25,34)
> people = data.frame(country,age)
> people
country age
1 Italy 20
2 Greece 43
3 Spain 16
4 Italy 20
5 France 25
6 Italy 34
To extract only the records for a specific country, say Italy, do the following:
> people[people$country == "Italy", ]
country age
1 Italy 20
4 Italy 20
6 Italy 34
Since nothing is specified after the comma, R retrieves all columns, i.e. all record fields. If we need only the age of people from Italy, we specify the column age after the comma, as follows:
> people[people$country == "Italy", "age"]
[1] 20 20 34</pre>
Check proportion of samples falling within range
There is more than one method to compute the proportion of values falling within a specific range. Each method uses the fact that R generates a boolean vector whenever a vector is compared to a value. We will use the mtcars built-in dataset in the code below.
> data("mtcars")
> dim(mtcars)
[1] 32 11
> mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4
[17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
If we compare the mpg vector to identify all values below 20mpg, R will generate a boolean vector as follows, with TRUE in place of values that satisfy our condition.
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[14] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[27] FALSE FALSE TRUE TRUE TRUE FALSE
If we call sum on this boolean vector it will count the number of TRUE values, since these are treated as 1, whilse FALSE is of course 0.
> sum(mtcars$mpg < 20)
[1] 18
If we divide the above sum by the length of the vector, we get the proportion of values satisfying a condition.
> sum(mtcars$mpg < 20) / length(mtcars$mpg)
[1] 0.5625
The same approach can be used to compute proportions of values falling within a specific range. All we need to do is use logical operators such as &. Below we compute the proportion of cars that have miles per gallon (mpg) falling in the range $\left[20,25\right]$.
> sum(mtcars$mpg >= 20 & mtcars$mpg <= 25) / length(mtcars$mpg)
[1] 0.25
This could also be computed using the mean function, as follows.
> mean(mtcars$mpg <= 25) - mean(mtcars$mpg < 20)
[1] 0.25