One thing is important for using any package, and that is knowing how to reach the help files. There are two easy ways to do this in R. That is either by typing, for example,
which will list all functions that contain the word "sort" in their description, or via typing
When you already know the name of a function, the one named "sort" in this case, and you want to know what it can do and how you have to supply the arguments.
Try to search for some statistical terms you know, and take a look at the functions where they appear.
While the help pages are not intended to be an encyclopedia on statistics, they contain much helpful commentary on the methods whose implementation they document. It can help enormously, before launching into use of an R function, to check the relevant help page!
It is quite easy to type in data directly into R, or to construct lists of observations on a variable.! In general, missing values in R are coded "NA" For instance, let's type in data on numbers of beetles caught in pitfall traps:
We type c() around the observations because this data becomes stored as a column vector(hence the "c"). Another way of doing the same, uses the scan() function. In that case, you have to type in observations yourself. Try using it by copy/paste'ing the code in the R-console [the use of the function scan is not allowed/possible in Rweb].
testtrapped<-scan()
Supply observations separated by spaces, or returns. You only need to hit return twice to end adding observations.
Different variables can have similar names. You can for example use a dot to structure a bit more. Assume that you have a second set of trappings c(4,1,2,1,0,3,3) that differs ony in a single observation from the first one. An economic way to make the second column vector is:
Typing ls() returns all the objects stored in memory:
which isn't very much because Rweb doesn't preserve data between calls.
So from now on we will turn to the R console in stead of Rweb, by starting R on our computer and copy/paste the R code from this document into the R command window [console]
By writing trapped.day2[1] we accessed the fourth element of the column vector trapped.day2.
trapped<-c(0,1,2,1,0,3,3)
trapped.day1 <- trapped # make a new variable from old
trapped.day2 <- trapped.day1 # make a second from old
trapped.day2[1] <- 4 # modify trapped.day2
trapped.day2 # take a look at it
trapped.day2[2] # prints element two.
trapped.day2[c(1,2,3)] # prints elements 1,2,3
Beware, most datasets in R become matrices, and then an element like trapped.day2[1] is not well defined anymore.
We see below how we can access the variables within such a data matrix. R also has a list of datasets that come with the program. You can view the list by typing
data()Or, if you downloaded a lot of R data packages from the internet, you can list them all by typing:
data(package = .packages(all.available = TRUE))
To read a dataset in from that list, say dataset "trees", type
The dataframe trees contains three variables. We can access the first variable "Girth" by copying into the R console:
data(trees)
ls() # now contains trees
summary(trees)
trees$Girth
or
trees[,1] # note the comma
A data matrix has rows and columns, therefore two indices (separated by a comma) are necessary to code each element. Each variable has a separate column usually, and each observation is on a row.
trees[1,] # gives observation one
trees[1,1] # returns the first variable of observation one.
You can also use R to build a data-frame interactively by typing:
datanew <- edit(data.frame())at the command prompt. R will open a new window, with a spreadsheet-like data editor in which you can add or delete colums, insert or delete rows, and enter your data...
We now give an easy way to read data into R, and at the same time saving them in a data-object.
Prepare the datafile in, for example, the Excel software program. In that spreadsheet, the first line of the data matrix will contain the names of the variables (single words). This line is called the header line.
Every observation is written on a separate line below the header line.
Important -> Save your spreadsheet as a text file (e.g. "filename.txt").
In R: Change the working directory to the directory where the data are.
Use the R menu Misc [Change Working Directory on Mac, Unix of Linux] or File [Change dir on Windows] for that. Or you can use the R function setwd on the command line.
You can check which files are present in the working directory by typing:
dir()
Now type the following command at the command line:
dataset<-read.table("filename.txt", header=TRUE)
You can also use the full path to the file, like so:
dataset<-read.table("C:\\directory\\subdirectory\\filename.txt", header=T)
Note the double back-slash ( \\ ) to separate the names of the directories. A single back-slash is used in R as an escape character, for instance to indicate a new line command in "\n", or a tab as in "\t".
Also note the argument TRUE (or simply T) for the header, indicating that the first row in the data matrix contains the names of the variables!
The data will be read in as a data-frame with name "dataset". You can check that this object exists by typing:
ls()
The function ls prints all objects in your workspace. Dataset should be among the objects listed. Or get a summary of what dataset contains as follows:
summary(dataset) dim(dataset)
gives the numbers of lines (observations) and columns (variables) in it.
If you type the following command, the variables in the object dataset can be accessed directly by giving their names:
attach(dataset)Then you may list the variables in the objects with name dataset.
ls("dataset")
or names(dataset)
Try this for a variable within dataset you have read in. For instance type loc at the prompt, if "dataset" contains a variable with that name. You can also type:
Girth # not found .... # let's take a look at the names of the variables present in trees summary(trees) attach(trees) # and attach them... Girth # aha! now we can see the content of Girth...
The function detach() is used to make a dataset less accessible as soon as you don't really need the variables anymore. This avoids messing up variables from different data-objects.
detach(trees)
See also Reading Excel spreadsheets [CRAN docs]
You can save all the work [commands, results] you did in the Console as a text file, by opening the File menu and clicking Save or Save As... Use Save As....
Use the latter option the first time you save your session, as it allows you to give the particular session a name.
Another option is to save your work as an R Workspace by using de Workspace menu. All the data that you entered and manipulated are saved, and you can load this workspace if you want to continue your work - the data will be there...
See also: Export to text files [CRAN docs]
dataset <- read.table("data.txt", header=T) # read data from working directory
attach(dataset) # attach the names in first row as variable names
jpeg(filename="barpl.jpg") # write results of next graphic command to a jpg file
barplot(table(dataset[,2])) # make a barplot from tabulated second variable
dev.off() # close the graphics device
You can use the png function if you want to write a file in .png format, or postscript for a file in .eps format [see > help(postscript) ].
On the other hand, when you are on a Mac, you can use the Save as... command from the File menu in the menu-bar to save the picture in pdf format.
When on a Windows PC you also use the Save as... command from the File menu in the menu-bar of the graphics window. You can choose from different formats:
There are a lot of data available for R on the internet. Sometimes these data are used as examples for a book, for instance the DAAG package for the book Data Analysis and Graphics using R by John Maindonald & John Braun. Or the data are generated in a research project, and made available when the results of the project are published in a scientific journal.
The DAAG package is made available at this address: http://www.stats.uwo.ca/DAAG/. You can download the Windows version here http://www.stats.uwo.ca/DAAG/DAAG_0.82.zip. Click with right mouse button on the link, and save to disk.
There is a menu bar in your R console. Click on the Packages menu item, and select the last item: Install package(s) from local zip files. Go find the downloaded zip file, select it, and R will install it in the right spot [ C:\Program Files\R\R-2.4.0\library\ ]. You can use the same Packages menu item to load the data into your workspace.
It is relatively easy to manipulate datasets within R. Also concatenating different datasets, extracting parts of them, and so on, is possible.
For instance, consider the object
trapped.day2
We can access element number eight of it by typing:
trapped.day2[8]
....replace it by typing :
trapped.day2[8]<-5
such that trapped.day2 becomes
trapped.day2
Replacing several elements at a time can be done as well:
trapped.day2[9:11]<-rep(1,3) #replaces elements 9 to 11 by ones.
See
help(rep) #for explanation on the rep() function used here.
Try to figure out what the ":" does for you.
trapped.day2[c(1,8)]<-c(1,2) #replaces elements 1, 8 by 1 and 2 resp.
We can also add observations, even ones with missing values
trapped.day2[12]<-NA trapped.day2[13]<-12
Spreadsheet facilities are available by using
data.entry(trapped.day2)
or we can use another way of editing still,
edit(trapped.day2)
The function edit() can use different text editors such as vi, pico, emacs etc... For details, take a look at
help(edit)
Using data.entry() to type in a new dataset is a little bit tricky. Suppose you want to make trapped.day3. This does not work:
data.entry(trapped.day3)
But the following does work, since we initialize trapped.day3 immediately
data.entry(trapped.day3=c(NA))
Concatenating column vectors goes as follows:
trapped.all<-c(trapped.day1,trapped.day2)
For combining data matrices, you use rbind() or cbind().
Some functions return logical TRUE or FALSE, for instance
trapped.day2==1
returns TRUE for all elements of trapped.day2 that equal 4, FALSE otherwise.
Sometimes you just want the positions where that occurs:
which(trapped.day2==3)
The functions is.na() is also extremely handy. It tests for occurrence of missing values "NA"
is.na(trapped.all) help(is.na)
You can for instance replace missing values by zeroes as follows:
trapped.all[is.na(trapped.all)]<-0 trapped.all
If trapped.all contained any missing values, they would become replaced by zeroes.
Obviously, you can use R as a pocket calculator, but there's much more. We can do calculations on entire vectors:
trapped.day2-trapped.day1 # calculates the elementwise differences between them
The following functions do what you would expect already from reading their names:
length(trapped.all) #length of the vector diff(trapped.all) # differences between successive observations diff(c(1:10)) # Predict what this will do...
Several sample statistics are available as well:
median(trapped.all) mean(trapped.all) var(trapped.all) max(trapped.all) sum(trapped.all)
If your data vector contains NA's, missing values, you can add an option to remove missing values, see help(mean) for this.
Cumsum returns a vector whose elements are the cumulative sums of the argument
cumsum(trapped.all) cummax(trapped.all) # Similarly, cumulative maxima.
It is also possible to calculate a statistic on a subset of a datavector:
mean(subset(trapped.day2,trapped.day2>2))
Beware, the following command calculates the proportion of observations with values larger than 2:
mean(trapped.day2>2)
Suppose you want the standard deviation of a variable and not just the variance, for which we used var() in the last section.
std(trapped.all) # This does not work.
Lets' make a function that calculates a standard deviation ourselves. We will call it std. For a vector x, it has to calculate the square root of the variance. It must become a function with one argument x, that uses other functions sqrt(var(x)) on x. Voila:
std<-function(x)sqrt(var(x)) std(trapped.all) # This works!
We could have avoided writing this function by looking a bit harder for:
help.search("standard deviation")
Then we would have seen that this is available:
sd(trapped.all)
Which produces the same result.
Now, can you write a function to compute the Coefficient of Variation of a variable?
The function table() builds a contingency table of the counts. For instance, when we do that for trapped.all, we get a table listing the frequencies of counts.
table(trapped.all) help(table)
Much of the material in this short tutorial comes from:
Using R for data analysis and graphics. An Introduction [.pdf] by J. H. Maindonald (2001)
Using R for introductory Statistics [.pdf] by John Verzani (2002)