Agreement statistics – Inter- and Intra-observer reliability

This is a topic that comes up every now and again 🙂  So let’s try to tackle it in a way that will be helpful.  First let’s define the difference between inter- and intra-.  This may be trivial for some folks, but I know I can get them mixed up on occasion.

Inter – between observers – number of different people
Intra – within observer – same person

Now, there are also 4 terms that are often associated with inter- and intra-observer reliability.  These are:

  • Accuracy
  • Precision
  • Validity
  • Reliability

Different ways of thinking about these terms – which term means what?

  1. Are you measuring the correct thing?  Are you taking a valid measurement?  Are you measuring what you think you’re measuring?  This is the ability to take a TRUE measure or value of what is being measured.  VALIDITY or PRECISION!
  2. Are the measurements you are taking repeatable and consistent?  Are you taking a good measure?  How consistent are the results of a test?  This is the RELIABILITY or the ACCURACY of a measure.

I saw this diagram and I think it provides a great visual to help us distinguish between these two terms.

Precision and Accuracy target diagram

Precision or Reliability – the ability to repeat our measures or our results – consistency
Accuracy or Validity – the ability to measure the “correct” measure

How do we measure Inter- or Intra-observer reliability?

Let’s start with Intra-  In this situation, you have several measurements that were taken by the same  individual, and we want to be able to assess how precise or reliable these measures are.  Essentially, was the individual measuring the same thing every time they took that measure?  Were the test results the same or extremely similar every time they were taken.  We are NOT interested in whether we were measuring the “correct” thing at this time, only whether we were able to measure the same trait over and over and over again.

Inter-observer reliability is the same thing.  Except now we’re trying to determine whether all the observers are taking the measures in the same way.  As you can imagine there is another aspect to Inter-observer reliability and that is to ensure that all the observers understand what and how to take the measures.  Are they measuring the SAME thing?  We cannot assess this per say – but we can do our best in a research project to mitigate this problem by training all the observers, documenting the procedures on how to take the measures, and reviewing the procedures along with update training on occasion.

What statistical tests are available?

The type of data you are working with, will determine the most appropriate statistical test to use.  The two most common tests are:

  • % agreement using Nominal variables – Kappa
  • Continuous variables – correlation coefficient

Example for Kappa calculation:

Instantaneous    Observer1    Observer2    Observer 3
Sample Time

1                                    R                     R                     R
2                                    R                     R                     R
3                                    F                     W                    F
4                                    F                     F                      F
5                                    W                   W                    F
6                                    R                     R                     F
7                                    W                   W                   W
8                                     F                    F                     W
9                                     R                    R                     R
10                                   R                    R                     R

Where R=Resting; F=Feeding; W=Walking

Let’s calculate Kappa between Observer 1 and 2 together and then you can try the same set of calculations between Observer 1 and 3.

We first need to create a cross tabulation between Observer 1 and 2, listing the different combinations of behaviours observed during the 10 sampling times.

Observer 2                                   Observer 1                    Prop.  of total for Obs2
Measures:                             F                    R                   W
F                                               2                     0                    0                   2/10 = 0.2
R                                               0                     5                    0                   5/10 = 0.5
W                                             1                     0                    2                    3/10 = 0.3
Proportion of total
for Observer 1                     0.3                  0.5                   0.2

Kappa = (Pobserved – Pchance) / (1 – Pobserved)

Pobserved = sum of diagonal entries/total number of entries = 9/10 = 0.90

Pchance = Sum of (Proportion for Observer 1 x Proportion for Observer2)
= (0.2 x 0.3) + (0.5 x 0.5) + (0.3 x 0.2) = 0.37

Kappa = (0.90 – 0.37) / (1 – 0.37) = 0.53/0.63 = 0.83

Now isn’t that nice and easy to calculate by hand?  Give it a try for Observer 1 and Observer 3.

Kappa in SAS

Data kappa_test;
    input timept observer1 $ observer2 $ observer3 $;
    datalines;
1 R R R
2 R R R
3 F W F
4 F F F
5 W W F
6 R R F
7 W W W
8 F F W
9 R R R
10 R R R
;
Run;

Proc freq data=kappa_test;
    table observer1*observer2 / agree;
Run;

To view the results please download the following PDF document.

Kappa in R

I used the R package called irr for Various Coefficients of Interrater Reliability and Agreement.  Looking at only Observer1 and Observer2, here is the Rscript I used:

install.packages(“irr”)
library(irr)

observer1 <- c(“R”, “R”, “F”, “F”, “W”, “R”, “W”, “F”, “R”, “R”)
observer2 <- c(“R”, “R”, “W”, “F”, “W”, “R”, “W”, “F”, “R”, “R”)

x=cbind(observer1, observer2)
x

kappa2(x, weight=c(“unweighted”, “equal”, “squared”), sort.levels=FALSE)

The results are:

Cohen’s Kappa for 2 Raters (Weights: unweighted)

Subjects = 10
Raters = 2
Kappa = 0.841

z = 3.78
p-value = 0.000159

Guide to Interpreting the Kappa Statistic

The Kappa statistic can have a value from 0 to 1.  Anything greater than 0.70 has been considered to be a great value and a sign of agreement.  R

There are other statistics, such as Kendall’s coefficient of concordance used to calculate reliability, but Kappa and correlation coefficients are probably the most common.

Name

Crimes of Statistics: Is it RANDOM or is it FIXED?

A topic that comes up a lot these days during my consultation appointments.  Deciding whether our treatments are FIXED or RANDOM is easy, but when we combine experiments – something that is commonly done in Plant Agriculture field – are years, trials, environments FIXED or RANDOM?

I’d like to propose that we talk about this one question during our session this coming week.  I’m proposing that you read the following paper in preparation for our discussion.  Moore and Dixon do a great job at digging into this topic, but there is still room for a discussion, especially how it relates to your own trials.  See you on Wednesday, February 14, 2018.

Moore, K.J. and Dixon, P.M. (2015). Analysis of Combined Experiments Revisited.  Agronomy Journal 107(2): 763-771. doi:10.2134/agronj13.0485

Name

Crimes of Statistics: Longitudinal Studies or Repeated Measures – What are the implications?

What is a longitudinal or repeated measures study?

Let’s take a little step back first and recall the conversation we had back in the Fall semester  – experimental unit – the unit to which the treatment is applied to.  This is a VERY crucial concept and definition when we talk about a repeated measures study.

Like the term says repeated measures, the researcher is taking the same measurements on “some unit” repeatedly.  We often think of this in terms of time.  I’m going to take weight measures or height measures every month during the summer growing period.  The question that needs to be answered is “What” unit?  Is it the same experimental unit?  If yes, then we have a classic repeated measures study.  If no, then we have reps.

Longitudinal study is a term often used in the social sciences.  We tend to think of a longitudinal study – again in terms of time – and usually in the context of a longitudinal survey.  The experimental units, in this case, are the survey respondents, and they will be answering the same survey several times in a year or across many years.

Bottonline, a longitudinal or repeated measures study is a study where the experimental unit is measured more than once.

Examples of longitudinal or repeated measures study

  • An educational survey where students answer the survey after high school, after their 1st year of University, 2nd year, 3rd year, and after graduating
  • A dairy lactation study, where the same cows in a herd are milked and measured each day during their first 3 lactations.
  • A new diet trial, where feed consumed by dogs is measured every day for a 21 day trial
  • A new herbicide trial, where plots in a field are measured every week for weed counts
  • A soil texture trial, where texture is measured at 4 depths of a soil core.

Challenges with a longitudinal or repeated measures study

The goal of many studies is to examine or determine whether differences exist between treatments of interest in the study.  We gather our data and conduct the statistical analysis to look at the variation between our treatments in that study.  When we enter our data, chances are we will enter an observation every time we take a measurement.  For example, if we have 20 dogs on our trial and we are measuring their feed intake for 21 days, we will have 420 lines of data.  OR we may have a dataset that has 20 lines, with each line containing 21 measures for each dog.  Either way, we have 21 measurements for each experimental unit.  The big challenge of a repeated measures analysis is to recognize that the variation within the experimental unit, dog in this example, needs to be accounted for, before looking at the differences between the treatments, diets in this case.

If I use my data cloud visual to try and explain.  We have 420 measures in our experiment – let’s throw this data up and think of it as a big cloud of data.  With our analysis, the goal is to partition that cloud into the treatment groups and hopefully be able to see distinct treatment groups.  However, we have 21 measurements for each dog and we want to ensure that when we start to look at treatments effects, that we keep those 21 measures for the dog together as a unit.  Remember we only have 20 experimental units and that’s where we should be concentrating when we look for treatment effects.  We do NOT have 420 experimental units!

No matter what statistical software package you use, there will be options to identify your experimental unit!  You need to find it.

Can you think of trials or studies that you have done in the past or will be doing in the future, is it a longitudinal or repeated measures study?

Questions to ask to help you determine if you have a longitudinal or repeated measures study:

  1. What is your experimental unit?
  2. Is your experimental unit being measured more than once?

Name

Crimes of Statistics: History and controversy surrounding the p-value

For this week’s Crimes of Statistics COP session, I’d like to have an open discussion about the history and the controversy surrounding the p-value.  I’ll write-up my thoughts and some quotes here to discuss with you.  If you are unable to make this session and want to add your thoughts, please add a comment to this post, and we can start the discussion on-line as well.

So, how did the p-value come into existence?  Are you aware of the history of this commonly used aspect of statistics?  We all feel compelled to report the p-values from our analyses – ok, let’s be honest in order to get our research published, we NEED to publish our p-values.  But, how many of us have conducted research that we have NOT published because it didn’t measure up to the 0.05 mark?  We have been taught and maybe brain-washed to believe that our results must have a p-value that is less than 0.05.  I cannot count the number of times that I’ve worked with students who are so disappointed when their results do not yield this “significant” value – so many years of work and NOTHING is significant, and they feel that they can  no longer publish their results except in their thesis.

But, let’s take a closer look at this p-value and how it came about.  I’ll admit that I’ve always been fascinated by the p-value and the “magical” powers of “0.05”.  Many of you have heard me say that there is nothing magical about 0.05, and I’m sure you’ve thought, oh…  she’s off her rocker – I’ve been taught from Day 1 in 2nd year statistics that in order for us to talk about our stats, we need that p-value that is < 0.05.  Well…  let’s talk about this for a bit.

Here is a famous quote from Fisher (1926) that started it all:

“If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 percent point), or one in a hundred (the 1 percent point).  Personally, the writer prefers to set a low standard of significance at the 5 percent point, and ignore entirely all results which fail to reach this level.  A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”

Yup! this is how it all started!  If you enjoy reading more about this history, you later discover that Fisher did not intend for his statement to be interpreted as it was and used in the way it is today.

Amazing how a statement made in a journal article in 1926 and to this day, we still use the 5% rule.  We have lots of research over the past century (yes, 92 years!), that has been hidden away because it never made the cut.  What does this mean?  What are the implications??

Let’s continue down that road for a moment, we recognize and acknowledge that most publications report results where p < 0.05.  In other words, we tend to publish only results that are significant, right?  Who is really interested in reading a study where there no significant results.  Well how about this quote from Moore (1979) talking about this very challenge or problem:

“Such a publication policy impedes the spread of knowledge.  If a researcher has good reason to suspect that an effect is present, and then fails to find significant evidence of it, that may be interesting news.  Perhaps more interesting than if evidence in favour of the effect at 5% level had been found.”

Hmm…  we are dealing with the same issue in 2018.  What are your thoughts?  Let’s talk about the implications of this?  And how do we define 0.05?  If we stick to that rigid line of 0.05, how do YOU handle a p-value of 0.45 or one of 0.54?

 

I’m looking forward to a great discussion!  See you tomorrow, Wednesday, January 18, 2016 in ANNU Rm 101 at 9:30am.

Name

Fisher, R.A. (1926).  The arrangement of field experiments.  J. Ministry of Agric. Great Britain 23: 503-513.

Moore, D.S. (1979). Statistics: Concepts and Controversies. San Francisco: Freeman.

Crimes of Statistics: Power

To consider the POWER of your statistical analysis, we need to take a step back and talk briefly about Hypothesis tests and their relationship with POWER.

Remember how you start your research?  With a hypothesis.  For our little example we will have an hypothesis statement that says the mean height of cats is equal to the mean height of dogs.  The alternate hypothesis would then say that the mean height of cats is not equal to the mean height of dogs.

Ho: µcats  = µdogs
Ha: µcats  ≠ µdogs

We are using an alpha value of 5%, therefore our p-value = 0.05.  We went out to measure 4 cats and 4 dogs and their height measurements (inches) are:
Cats:  11, 13, 11, 14
Dogs:  24, 21, 18, 28

The mean height for cats is 12.5 with a standard deviation of 1.5
The mean height for dogs is 22.8 with a standard deviation of 4.3

I can conduct a t-test and it provides me with a p-value of 0.02.  With data such as this I can also calculate the variation around the mean, such that I have 11.0-14.0 (12.5 ± 1.5) for the cats and 18.5-27.1 (22.8 ± 4.3) for the dogs.  Do the ranges overlap? No.

What conclusion do we draw?
That we will reject the Null hypothesis and state that dogs are significantly taller than cats by an average of 10″.

Sounds great right?  We did expect that the dogs would be taller than cats.  So right from the beginning, in this example, our experience and knowledge of cats and dogs, told us  that the Null hypothesis was false – and with our little sample we proved it!

Let’s review this table – in our case we were working with a Ho that we knew to be false and we rejected the Ho – so we have NO ERROR.

  Ho is TRUE Ho is FALSE
REJECT the NULL Hypothesis Type I error
(ALPHA)
No error
(POWER = 1-BETA)
ACCEPT the NULL Hypothesis No error
(1-ALPHA)
Type II error
(BETA)

We’re going to repeat this experiment and measure another 8 animals – 4 cats and 4 dogs.

Ho: µcats  = µdogs
Ha: µcats  ≠ µdogs

We are again using an alpha value of 5%, therefore our p-value = 0.05.  We have height measurements (inches) of 4 cats and 4 dogs:
Cats:  21, 13, 11, 14
Dogs:  23, 21, 18, 14

The mean height for cats is 14.8 with a standard deviation of 4.3
The mean height for dogs is 19.0 with a standard deviation of 3.9

I can conduct a t-test and it provides me with a p-value of 0.19.  With data such as this I can calculate the variation around the mean, such that I have 10.5-19.1 (14.8 ± 4.3) for the cats and 15.1-22.9 (19.0 ± 3.9) for the dogs.  Do the ranges overlap?  Yes.

What conclusion do we draw?
That we will NOT reject the Null hypothesis and state that the average height of cats and dogs is the same.

Are we comfortable with this?  If you review the table presented above – now we still have a FALSE Ho and this time around we did NOT reject the Null hypothesis – leading us to committing a Type II or Beta error.

A Type II error is directly related to the POWER of the test.  By definition, the power of a statistical test, is the probability that the test will correctly reject the null hypothesis when it is false.

POWER is related to a number of factors:

  • sample size
  • effect size – or the size of the difference between treatment groups
  • variation of our outcome variable
  • level of significance – p-value

Consider our example above, what factors could be change to increase the POWER of our test and ensure that we won’t see similar results to the second time we collected data?

  • Sample size

There are several ways to calculate the POWER of a statistical test.  SAS has 2 PROCs – Proc POWER and Proc GLMPOWER.  Review the SASsy Fridays post on these.  There are many links to online calculators as well.  Please choose one that is defendable.