R Language Notes
Notes on how to install and use the R statistical language.
Installing R and RStudio
The easiest and quickest way to install or update R on Debian is to execute the following commands:
apt-get update
apt-get install r-base r-base-dev
Note: r-base-dev
is only required to compile R packages or software that depends on R.
Having done that, download and install the latest RStudio package as follows.
Copy the link to the latest version of RStudio from https://www.rstudio.com/products/rstudio/download/.
For example, this will look like https://download1.rstudio.org/rstudio-1.1.453-amd64.deb
.
Download it using wget
, for example, wget https://download1.rstudio.org/rstudio-0.99.878-amd64.deb
. Finally, install it using dpkg
as follows: sudo dpkg -i rstudio-0.99.878-amd64.deb
Installing a package
Use install.packages("package name")
Loading a package
Use library(package name)
Viewing data available in a package
data(package='package name')
Accessing data in a package
Use data(dataset_name, package='package name')
Following this command you can type dataset_name
to view the data set contents.
Viewing data
Use View(variable)
Listing objects in current environment
Use ls()
or objects()
to get a vector of strings, listing names of objects in current environment.
If the above functions are specified within a function, they will only list the names of objects specified within that function.
For example:
test <- function() {internal_variable <- 'testing'; ls()}
will output:
"internal_variable"
Creating a vector
Use the c()
function to combine values into a list or vector, for example, ages <- c(12,14,19,28,42)
.
Removing objects from an environment
Use the rm()
or remove()
functions as follows:
To remove a single object type:
rm(object_name)
To remove multiple objects type:
rm(list=c('an_object','other_object','final_object'))
To remove multiple objects matching a specific pattern, say starting with pa
, type:
rm(list=ls()[grep('^pa',ls())])
Computing five-number summary plus mean
Suppose we have a vector of ages as follows:
ages = c(18,22,21,24,19,22,20,20,30,42)
To compute its five-number summary plus mean, just type summary(ages)
, which gives us:
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 20.0 21.5 23.8 23.5 42.0
Computing sample mean and standard deviation
Using the above vector ages
, you can easily compute the sample mean using mean(ages)
and sample standard deviation using sd(ages)
, which return 23.8
and 7.223
respectively.
We can double-check these statistics by computing them using the following formulas for sample mean $\bar x$ and standard deviation $s$.
\[\bar x=\frac{\sum_{i=1}^nx_i}{n}\] \[s=\sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1}}\]ages
[1] 18 22 21 24 19 22 20 20 30 42
> x_bar = sum(ages)/length(ages)
> x_bar
[1] 23.8
> var = sum((ages - x_bar)^2)/(length(ages) - 1)
> var
[1] 52.17778
> s = sqrt(var)
> s
[1] 7.223419
Quick introduction to R
A comprehensive quick introduction to the R language can be found at https://www.statmethods.net.
If you prefer a more hands-on approach, you might find DataCamp’s interactive free course, Introduction to R, more interesting.
Basic data frame filtering
Suppose we have a data frame, people, with 2 columns, country and age.
> country = c("Italy","Greece","Spain","Italy","France","Italy")
> age = c(20,43,16,20,25,34)
> people = data.frame(country,age)
> people
country age
1 Italy 20
2 Greece 43
3 Spain 16
4 Italy 20
5 France 25
6 Italy 34
To extract only the records for a specific country, say Italy, do the following:
> people[people$country == "Italy", ]
country age
1 Italy 20
4 Italy 20
6 Italy 34
Since nothing is specified after the comma, R retrieves all columns, i.e. all record fields. If we need only the age of people from Italy, we specify the column age after the comma, as follows:
> people[people$country == "Italy", "age"]
[1] 20 20 34</pre>
Check proportion of samples falling within range
There is more than one method to compute the proportion of values falling within a specific range. Each method uses the fact that R generates a boolean vector whenever a vector is compared to a value. We will use the mtcars
built-in dataset in the code below.
> data("mtcars")
> dim(mtcars)
[1] 32 11
> mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4
[17] 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
If we compare the mpg
vector to identify all values below 20mpg, R will generate a boolean vector as follows, with TRUE
in place of values that satisfy our condition.
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[14] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[27] FALSE FALSE TRUE TRUE TRUE FALSE
If we call sum
on this boolean vector it will count the number of TRUE
values, since these are treated as 1
, whilse FALSE
is of course 0
.
> sum(mtcars$mpg < 20)
[1] 18
If we divide the above sum by the length of the vector, we get the proportion of values satisfying a condition.
> sum(mtcars$mpg < 20) / length(mtcars$mpg)
[1] 0.5625
The same approach can be used to compute proportions of values falling within a specific range. All we need to do is use logical operators such as &
. Below we compute the proportion of cars that have miles per gallon (mpg) falling in the range $\left[20,25\right]$.
> sum(mtcars$mpg >= 20 & mtcars$mpg <= 25) / length(mtcars$mpg)
[1] 0.25
This could also be computed using the mean
function, as follows.
> mean(mtcars$mpg <= 25) - mean(mtcars$mpg < 20)
[1] 0.25