S17 RDM Workshop: Data Preservation: The Legacy of your Research Data

Powerpoint Presentation used during Workshop – June 14, 2017

This workshop was a continuation of the 1st RDM workshop where we concentrated on creating new datasets with variable names that matched Best Practices for reading your data in a Statistical analysis package.  This workshop looks at how do you handle data that has been passed on to you from another individual, and all the challenges that accompany “inheriting” or acquiring data.

There were 2 exercises in this workshop.  The first exercise, you were provided a small dataset and a number of variables and titles for each.  You were asked to determine what types of questions you would need answered before you were able to work with this dataset effectively.  Below are links to the 4 datasets along with a sample of questions that you may need to ask.

Exercise1_Group1

Exercise1_Group2

Exercise1_Group3

Exercise1_Group4

The second exercise that was conducted in this workshop, was one where you were provided with one of two directories that held a number of files.  You were asked to review the directory structure and the file naming conventions, and provide a new one that was consistent with the recommended Best Practices presented in the workshop.  Below are links to proposed answer keys to both directories.

Group1_folder_directory_answersheet

Group2_folder_directory_answersheet

To reiterate that the primary goal of these workshops, is that you take all the recommended Best Practices presented and implement them with your own data

S17 RDM Workshop: Best Practices for entering your Research Data using Excel

Powerpoint Used during the 20170607 workshop

Commonly Used Statistical Packages

For the purposes of this workshop, the following statistical packages were reviewed:

  • SAS
  • SPSS
  • Stata
  • R
  • Matlab

It is recognized that there are many more available and used in the OAC community.  If you have questions regarding other packages not included here, please email oacstats@uoguelph.ca .

Commonly Used Statistical Packages:  Variable name restrictions and limits

LENGTH OF THE VARIABLE NAME

SAS – 32 characters long
SPSS – 64 bytes long
• 64 characters in English
• 32 characters in Chinese
Stata – 32 characters long
R – 10,000 characters long
Matlab – 63 characters lo

1ST CHARACTER OF THE VARIABLE NAME

SAS – 1st character MUST be:
• a letter (English) OR
• an underscore “_”
SPSS – 1st character MUST be:
• a letter (English) OR
• an underscore “_” OR
• “@”,“#”,“$”
Stata – 1st character MUST be:
• a letter (English) OR
• an underscore “_”
R – NA
Matlab – 1st character MUST be:
• a letter

BLANKS IN VARIABLE NAMES

SAS – NO Blanks!
SPSS – NO Blanks!
Stata – NO Blanks!
R – NO Blanks!
Matlab – NO Blanks!

SPECIAL CHARACTERS IN VARIABLE NAMES

SAS – NO Special characters with the exception of:
• “_”
SPSS -NO Special characters with the exception of:
• “_”
• “.”
• “@”
Stata -NO Special characters with the exception of:
• “_”
R -NO Special characters with the exception of:
• “_”
• “.”
Matlab – NA

CASE IN VARIABLE NAMES

SAS – Mixed case – for presentation only
SPSS – Mixed case – for presentation only
Stata – Mixed case – for presentation only
R – Mixed case – for presentation only
Matlab – Case sensitive

NAMES/WORDS TO AVOID IN VARIABLE NAMES

SAS – SAS keywords
SPSS – SPSS Reserve words
Stata – NA
R – R function words
Matlab – Function names

GENERAL NOTES ABOUT VARIABLE NAMES

SAS – Libref names can only be 8 characters long
SPSS – #variable – is a scratch variable used in syntax
• $variable – is a system variable
• Do NOT end variable name with a “.” OR “_”
Stata – NA
R – R Community recommends that you develop a naming convention for your data
• Use of “_” is faster (10-20%) than the use of “.”
Matlab – NA

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Names

LENGTH RECOMMENDATION

  • Maximum length: 32 characters
  • Keep the variable names short and use a variable label to provide more information. Remember you need to type these variable names in and you will need to remember them.

1ST CHARACTER OF A VARIABLE NAME

  • ALWAYS start variable names with a letter

VARIABLE NAMES AND SPECIAL CHARACTERS

  • Numbers may be used anywhere in the variable name AFTER the first character
  • Only use underscores “_”
  • Do NOT use BLANKS – replace blanks with an underscore “_”

CASE

  • Use lowercase
  • Case doesn’t matter for most packages.
  • If you are using MatLab – please be aware that the variable names are case sensitive – if you use lowercase as a Best Practice you won’t forget which ones are Capitals and which ones are NoT.

FAMILIARITY WITH STATISTICAL PACKAGE NOMENCLATURE

  • As you work with a particular package you will become familiar with keywords or reference words that are reserved for the program to use.
  • As a general rule keep away from Statistical terms as variable names.  If you REALLY want to use “mean” qualify it with your data, so wt_mean or concentration_mean

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Labels

Variable names are often short and may not reflect the contents of the data collected.  Trying to create a variable name that is a descriptive summary of the data can be extremely challenging.  Recommendation is to create short, concise variable names and to create variable labels that are descriptive for each variable name.

Variable name:  wt28

Variable label: Weight (kg) at 28 days of age 

CREATING VARIABLE LABELS – SAS

Data first;
  Infile …
  Input …

 
  Label wt28 = “Weight (kg) at 28 days of age”;
Run;

 CREATING VARIABLE LABELS – SPSS

  1. Variable View in the SPSS Data viewer
    • Find the variable called wt28
    • In the Column called Label – Type: Weight (kg) at 28 days of age
  2. Syntax Window:

VARIABLE LABELS

Wt28 “Weight (kg) at 28 days of age”.

CREATING VARIABLE LABELS – R

Apply the appropriate function for the space you are working in.  For instance the Dataframe, Vector, etc..

Lapply function

CREATING VARIABLE LABELS – STATA

label variable wt28 “Weight (kg) at 28 days of age”

 

CREATING VARIABLE LABELS – MATLAB

T.Properties.VariableDecsriptions{‘wt28’}=”Weight (kg) at 28 days of age”;