R: Back to Basics and Definitions

Before I learn a new software or new skills, I often like to do some homework and ask the silly questions like:  what, when, why, and how, to give me a base understanding of the software.  So, let’s work through these questions for R.

What is R?

R is a system that is used for statistical computation and graphics.  It has a number of aspects to is that include a programming language, graphics, interfaces or connection opportunities with other languages, and debugging capabilities.  I have found that many do not refer to R as a statistical software package, because it can do so much more.

What does this all mean?  It means that R is a very robust program that folks use for a variety of reasons, it’s not just for statistical analysis!

Where did R come from?

The history of software packages can be quite interesting to learn about.  For instance R has been defined as a “dialect” of S.  Some of you may remember the statistical software called S-Plus?  Well, that’s where R comes from.  It was developed in the 1980s and has been one of the fastest growing open-source software packages since.

What does “open-source” mean?

I’m sure you’ve heard of this term in the past or in different contexts.  One thing that you will hear when people talk about R, is that it is free or that is is open-source.  Keep in mind that open-source means that it is freely available for people to use, modify, and redistribute.  Which usually translates to: there is no cost to acquire and use the R software!  Another aspect of open-source is that it is or rather can be community-driven.  So, any and all modifications to the software and subsequent documentation (if it exists) is driven by the community.

Please note, that R has matured over the years, and today’s R community is extremely strong, and encouraging documentation for anything that is released, making it a very desirable product.  This may not always be the case with open-source software.

Who uses R?

Business, academia, statisticians, data miners, students, and the list goes on.  Maybe we should ask the question, who is NOT using R, and then ask the question Why?

There are so many different statistical software options today and which one you choose to use will depend on several different factors:

  • What does your field of study use?
  • If you are a graduate student, what does your supervisor suggest and use?
  • What type of analyses are you looking to perform and does your program of choice offer those analyses?
  • What types of support do you have access to?

How does R work?

If you’re looking for a statistical package that is point and click, R is not for you!  R is driven by coding.  YES!  you will have to learn how to write syntax in R.  You can use R interactively by using R-Studio, and you may never reach a point in your studies or your research where you will move away from the interactive capabilities of R – so no big worries!  Besides, today there are a lot of resources available to help you learn how to use R.  So don’t let that stop you!

Base R and Packages

When you download and install R, the Base R program is installed.  To run many of the analyses you may be required to install a package.  What is a package?  It is a collection of functions, data, and documentation that extend the current capabilities of the Base R program.  These are what makes R so versatile!  As we work through our workshops and associated code, I will provide you with the name of the Package.  There are a number of ways to acquire and install packages, we will review these as we work through them.  Please note that there may be several packages that perform a similar analysis, please read all the documentation before selecting a package to use.

I will add a page to this Blog in the near future (Summer 2018) that will list the packages and associated documentation that I have used and recommend.

How do I acquire R? Where can I download it?

Visit the Comprehensive R Archive Network (CRAN) website to download the R software.  https://cran.r-project.org/   Please note that this will also be the website used to download future packages used in analyses as well.

To download R-Studio, visit The RStudio website at https://www.rstudio.com/

Both websites have comprehensive instructions to assist you with the installation on your own computers.

Let’s get started by reviewing some definitions

When you think about conducting any statistical analysis, your starting point is data.  So let’s start with a few definitions of the different data types observed in R.

Numeric, Character, or Logical

A quick overview of the different types of data you can work with in R.

  • Numeric = numbers
  • Character = words
  • Logical = TRUE or FALSE – not all data is in the form of numbers or letters, sometimes you might have data that has been collected as matching a criteria (TRUE) or not matching a criteria (FALSE).  We’ll work through examples of this in another session, for now just be aware that this type of data is commonly used in R.
  • How do you find out what form your data are in?
    • class(…)
    • The results of this statement will tell you exactly what form your data are.
    • Example:

testform <- c(12, 13, 15)
class(testform)

> class(testform)
[1] “numeric”

Numeric Classes in R

Numbers are handled in a couple of ways in R.  These are referred to as the Numeric Classes of R, and two that we will are known as integer and double.  Having a basic understanding of these different numeric classes will come in handy.

  • Integer:
    • If you think back to high school math, you’ll probably remember the term “integer”.  First thing that comes to my mind when I think of integer – is Whole number, no fractions, no decimal places.
    • As you can imagine storing numeric data as integers does not require a lot of space.  So, in terms of computing, if you do not foresee your analysis needing decimals and precision numbers, then integers are the way to go.
  • Double:
    • Double precision floating point numbers – think of this as the decimals side of your numeric data.
    • Storing Double numeric data takes up more space than Integer data.  But sometimes you’re just not sure what you will need, so R will switch between the 2 numeric classes as it is required for your analysis.

Data Types in R

Let’s review the different data types available to you in R.

Vectors

  • Let’s not panic at some of these terms, but work through examples of each.  Think of a vector as a column of data or one variable.
  • Vectors can be numeric, characters, or logical format.
  • How to create a vector:

# a numeric vector
a = c(2, 4.5, 6, 12)

# a character vector
b = c(“green”, “blue”, “yellow”)

# a logical vector
c = (TRUE, TRUE, FALSE, TRUE)

Coding Explanation:

a = ; b = ; c = ;  creating vectors called a, b, c respectively.  Please note that a <- is the same as a =

c(x, x, x  )  tells R that we are creating a vector or a column with the contents found in the parentheses.  The , tells R to drop to the next row in the vector/column being created.

character values must be contained in ”  “, but logical values do not.

Matrices (matrix)

  • Think of a matrix as an object made up of rows and columns.
  • The vectors within a matrix must all be the same type, so all numeric, or all character, or all logical.
  • How to create a matrix:

# creates a 5 x 4 numeric matrix – 5 rows by 4 columns
y <- matrix(1:20, nrow=5,ncol=4)

Coding Explanation:

y = or y <- create a matrix called y
matrix(  )  – call the function matrix to create the matrix y
1:20 – the values of the matrix
nrows =  let’s R know how many rows are in the matrix that you are creating
ncol= let’s R know how many columns are in the matrix that you are creating.

Resulting matrix y will look like:

> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

Arrays

  • Arrays are very similar to matrices.  Think of an array as a matrix with an added dimension.  For example, we may have a matrix that contains data for 2015.  We want to add in the same data for 2016 in the same format.  So we can create an array, with a matrix that contains 2015 data and a matrix that contains a matrix of the 2016 data.

Data Frames

  • A Data Frame is a general form of a matrix.  What this really means, is that a data frame is like a dataset that we use in other programs such as SAS and SPSS.  The columns or variables do not need to be the same type as is required in a matrix.
  • We can have one vector/column/variable in a data frame that is integer (numeric), followed by a second one that is character, followed by a third that is logical.  But in a matrix, all three vectors/columns/variables must be the same type: numeric, character, or logical.
  • How to create a data frame:

d <- c(10, 12, 31, 4)
e <- c(“blue”, “green”, “red”, NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
sampledata <- data.frame(d, e, f)
names(sampledata) <- c(“ID”, “Colour”, “Passed”) # variable names

Coding Explanation:

sampledata <- or sampledata = name of the data frame that we are creating
data.frame(  )  calling on the function that creates a data frame
d, e, f  tells R that we are creating the data frame with the 3 vectors in the order of d, followed by e, followed by f

names(sac(“ID”, “Colour”, “Passed”) mpledata) – providing variable names within the data frame
c(“ID”, “Colour”, “Passed”)  – creating or identifying the 3 variable names within the data frame:  ID, Colour, Passed are the variable names

Lists

  • an ordered collection of objects.
  • objects in the list do not have to be the same type.
  • You can create a list of objects and store them under one name.
  • How to create a list:

# a string, a numeric vector, a matrix, and a scaler 
wlist <- list(name=”Fred”, mynumbers=a, mymatrix=y, age=5.3)

Coding Explanation:

wlist <- or wlist =  creating a list called wlist
list(  )  – calling the function to create a list
name=”Fred”, mynumbers=a, mymatrix=y, age=5.3  values that are to be contained in the list called wlist

Factors

Factors are categorical variables in your data.  You can have a nominal factor or you can have an ordinal factor.  Yup, those words again – remember nominal and ordinal data are categorical pieces of data, so you can fall into one group or another.  Nominal, there is no relationship or order to the categories, whereas ordinal data there is an order to the different levels.

Questions or Homework for Self-study work:

  1. Create examples of a vector, matrix, data frame, and a list.
  2. Using the following files, identify the type of data :
    • cars sample found in R
  3. Create a data frame with the following information:
    • column 1:  13, 14, 15, 12
    • column 2:  Male, Female, Male, Male
    • column 3: TRUE, TRUE, FALSE, FALSE
    • column 4: 26, 44, 77, 31
  4. Can I create a matrix with the information listed in #3 above?  Why or Why not?

 

R: Keyboard Shortcuts

As I continue to learn about R and RStudio in particular, I will add to this ongoing list of keyboard shortcuts.  Some of these may also be listed in other posts.  But I’ll try to update this one as I learn new ones.

Keyboard Shortcut                                        Function

Ctrl-Enter                                                        Submit code
Ctrl-1                                                                Move to Source window
Ctrl-2                                                                Move to Console window
Ctrl-L                                                                Clear Console window
Alt –                                                                    <-

R: Using RStudio and Importing Data

Where do you start?

Learning a new program can be scary and overwhelming at times.  So let me share a few of the shortcuts I’ve learned these past few months on using RStudio.

Navigating the Windows in RStudio

When you open RStudio you’ll see 4 windows or 4 sections on your screen:  editor, console, history, and environments with tabs.  Let’s start with the environments window – you should see 6 tabs:  Environment, Files, Plots, Packages, Help, and Viewer.   The Environment tab lists the files/datasets that are being used during the current project.  The Files tab allows you to view all the files that are available in your working directory.  The Plots tab will show any plots that are created during your session.  The Packages tab will list all packages that you have loaded.  The Help tab is self-explanatory.  A quick sidenote, the Help window is great!  Please take advantage of it by using the search function in the Help tab.

The History window will list all the lines of code that you have run until you clear it out.  A great way to see what you have done – especially if you encounter troubles along the way.

That leaves the editor and the console.  The editor is where you open an R script file and the console is where you run your code as you type it in.  To run code that is in your editor – select the bits of code and hit Ctrl-Enter to run it.  In the console, you type the line, hit enter and it runs immediately.  I use these two windows in tandem.  To move between these two windows – Ctrl-2 moves you to the Console window and Ctrl-1 brings you back to the editor window.  Of course, a mouse works great too!

One more quick tip – the console window can fill up quite quickly and to me, can feel very cluttered.  Remember the History window will keep a history of your code, so it would be ok to clear out the console as you see fit.  In order to do this, use Ctrl-L to clear it out.

Working Directory

Sometimes having your program always refer to the same directory, when saving files or when opening files, can be very handy.  You’ll always know where your files are!  R makes it very easy to accomplish it.

First, let’s see what the current working directory of your RStudio is by typing:

getwd()

To change the working directory for the current project you are working on type:

setwd (“C:/Users/edwardsm/Documents/Workshops/R”)

Of course, you’ll want to make this a directory on your computer 😉   But as you look at this – do you notice anything odd about this statement???  You’ll notice that the slashes / are the opposite direction than you normally see on a Windows machine.  Changing these manually can be a time consuming effort.  One way around this is to add an extra \ after everyone in your location.  See below:

setwd (“C:\\Users\\edwardsm\\Documents\\Workshops\\R”)

Always double-check your working directory by checking getwd() Are the results what you were expecting?  If not, try it again.

More ways to set your working directory:

  • In RStudio, Session in the File Menu provides 3 options for setting your working directory:
    • To Source File location (the directory where you save your R script and program files)
    • To Files Pane Location – in the Files Pane – navigate to the location you want to have as your Working Directory.  Once you have it selected in the Files Pane, then choose Session -> Set Working Directory -> Files Pane location.  You will see the new working directory appear in your console and it should match what you select in the Files Pane.
    • Choose Directory – will open a windows dialogue box where you navigate and select the directory of choice.
  • While you are in the Files Pane location – navigate to the directory that you would like to set as your working directory, then in the Files Pane – select More -> Set Working Directory.  This option is very similar to the Files Pane Location option under the Session menu of RStudio.

Importing Data

Every statistical package has a number of ways to bring data in.  R is no different!  Now most of us will use Excel to enter our data, then we’ll clean it up in Excel before we import it into our preferred statistical package.  For the purposes of this session, I will use an Excel worksheet as an example.

The first step is to save our Excel worksheet as a CSV (Comma separated values) file.   In Excel, File -> Save As -> Select CSV as the file format.  You will be asked several questions regarding the format of the file you want to save.  Please note that when you save an Excel file or a worksheet as a CSV, it will only save the worksheet that you have selected and NOT the entire Excel file (which may have several worksheets).

There are a couple of ways to import the CSV file.  But, the first thing you’ll need to do is give the file a name.  Please note that in R, you can use a “.” in the name of the file.  For more information on best practices for filenames and variables, please visit the Best Practices for entering your Research Data using Excel

The following code will import a CSV file called Example and save it in an R file called my.data .  For this piece of code, the file, Example.csv, must be in the working directory that you’ve set earlier  OR you will need to provide R with the full location of the file – this includes the drive and directory structure.  The header=TRUE option, let’s R know that the first line of the datafile has a header or contains the names of the variables.

my.data=read.csv(“Example.csv”, header=TRUE)

If you have files that are not in the working directory or you don’t want to provide the full location of the file, then you can use the following piece of code.  Personally, I much prefer this next piece.

my.data2=read.table(file.choose(), header=TRUE, sep=”,”)

Now my file in R will be called my.data2  .  Using this code, a dialogue box will open and will allow me to navigate to the directory that holds my files.  This coding option provides me more opportunities than the first, in my opinion.  The header=TRUE holds the same meaning as above.  However, this time we need to specify the delimiter, or the item that is separating the variables in the CSV file, which is a comma – depicted as sep=”,” in the code.

Why do you have to use a sep=”,” option in the second case and not in the first, we are reading the same CSV file in both cases?  The first import coding option is using a function called read.csv – so R already knows that it will be reading a CSV or comma separated file.  Whereas in the second case, read.table – R has no idea what type of data is till be reading, therefore we need to specify what the delimiter or separator is in the datafile.

There are other ways to import data into R, but I have found these two, with preference for the second one, to be quite direct and straightforward.  It also encourages the user to maintain a data Master file in Excel, with a text copy of your data to use for the analysis.  Remember the text format will be a great option for preserving and sharing once your project is complete.

Two other packages discussed during our session today, that import data, in one case specifically Excel files and in the second case, many data formats.  These are:

  • read.xl package
  • tidyverse package

Look to future sessions on more details about these packages.

If you have suggestions or hints for other methods to import data into R, please leave a comment below or send me an email and I can add them here.

Communities of Practice: Coming Fall 2017

“Communities of practice are groups of people who share a concern or a passion for something they do and learn how to do it better as they interact regularly.”  – wenger-trayner.com

The OAC Stats Support Service will facilitate Communities of Practice (COP) to engage the OAC research community and assist with the statistical analyses and statistical software. Our researchers use a variety of statistical approaches and statistical software packages to conduct their research, by meeting, sharing perspectives, and learning new aspects of our software and/or statistical approaches, as a community, we can create enriched learning environments for all.

Fall 2017, will see the creation and revitilization of four COPs:

  • SASsy Fridays
  • Crimes of Statistics
  • OAC R Users Group
  • OAC Data Visualization

SASsy Fridays

SASsy Fridays started as a COP in W14 in response to the growing interest of SAS-specific topics beyond what was being taught in the workshops. If you use SAS and are interested in learning and sharing new approaches to using the software or new statistical approaches in SAS, this is the COP for you!  For past topics please review the SASsy Fridays blog.  If you have a topic you would like to present or would like more information about, please email oacstats@uoguelph.ca. SASsy Fridays sessions will take place in the Crop Science Lab Rm 121A on the following dates and times:

  • Friday, October 13  from 12:30-1:20 p.m.
  • Friday, October 27 from 12:30-1:20 p.m.
  • Friday, November 10 from 12:30-1:20 p.m.
  • Friday, November 24 from 12:30-1:20 p.m.
  • Friday, December 8 from 12:30-1:20 p.m.

Crimes of Statistics

Many of us conduct experiments and run the appropriate statistical analysis, but sometimes we can get caught up in questioning the basics of the theoretical background. Topics such as replication, sampling, power, p-values, and many more. This COP will meet to discuss these and other topics. A short presentation on the topic du jour will be followed by a discussion of situations you may have encountered. The Crimes of Statistics COP will meet in the OAC Boardroom (Johnston Hall) on the following dates and times:

  • Tuesday, August 22 from 10:00-10:50 a.m.
  • Tuesday, September 5 from 10:00-10:50 a.m.
  • Tuesday, October 3 from 10:00-10:50 a.m.
  • Thursday, November 2 from 10:00-10:50 a.m.
  • Thursday, November 30 from 10:00-10:50 a.m.
  • Tuesday, December 12 from 10:00-10:50 a.m.

The first meeting on August 22 will be an information gathering session. Please bring any topics you would like to see discussed to this session.

OAC R Users Group

R is growing in popularity and is gaining international acceptance in the research community. The goal of this group will be to exchange knowledge about R-packages and R-libraries that your research field or your lab uses. A short presentation or demonstration of  practical application of an R-package or R-library will be followed by questions and exploration of other uses for the presented material. The OAC R User Group meetings will take place in Crop Science Lab Rm 121A on the following dates and times:

  • Friday, October 20 from 12:30-1:20 p.m.
  • Friday, November 3 from 12:30-1:20 p.m.
  • Friday, November 17 from 12:30-1:20 p.m.
  • Friday, December 1 from 12:30-1:20 p.m.
  • Friday, December 15 from 12:30-1:20 p.m.

Data Visualization

You have been collecting data for a project and now it’s time to do something with it! What do you do? How do you present it? Should it be a table? A graph? A chart? This COP will discuss different ways of presenting data, the pros and cons of different formats, and will encourage the community to demonstrate their favourite data visualization formats. The Data Visualization COP will meet in the OAC Boardroom (Johnston Hall) on the following dates and times:

  • Tuesday, October 17 from 12:00-12:50 p.m.
  • Tuesday, October 31 from 12:00-12:50 p.m.
  • Tuesday, November 14 from 12:00-12:50 p.m.
  • Tuesday, November 28 from 12:00-12:50 p.m.
  • Tuesday, December 12 from 12:00-12:50 p.m.