NOTE: The self-tests aren’t currently working in Chrome. If you want feedback on your guesses, use a different browser. Sorry!
Lot’s of people have problems with stats. Why? For one thing, a prerequisite to being able to do statistics is algebra. If you have trouble with algebra, especially with how to read formulas and do all the operations in the right order, you’re going to have trouble with statistics. Another problem is that statistics use numbers to represent concepts that represent things in the real world. Most people are good at math but don’t understand the concepts, or understand the concepts but can’t do the math. This guide is designed to help students having trouble with statistics, mostly by reviewing the concepts involved. The math is just an extension of the concepts, so learn the concepts first and the math will be a lot easier. Your goal in this guide should be to read each section until the stuff seems like common sense to you.
What We Can Learn From Stats
We can do two things with statistics: describe things we do know or infer things that we don’t know.
If you have a bunch of balloons and you count how many there are of each color, that’s a descriptive use of statistics.
Imagine that there’s a dark room, and it’s full of thousands of helium baloons. You can go pull some baloons out and look at them, but you can’t get them all. You’ve only got a sample and you’ve got to figure out what the sample says about the whole. Let’s say you pull out 20 baloons: 10 are red and 10 are black. You can infer that half the baloons in the dark room are red and half of them are black. That’s inferential statistics. It sounds a lot like the kind of guessing everybody does all the time, and they don’t need to use math for it. There’s one important difference: when you use statistics to make that guess, you know what the probability that your guess is wrong is.
Test: Which is an inferential use of statistics?
- A. Making people feel stupid.
- B. Pulling all the change out of your pockets and counting it.
- C. Asking 100 kids if they’re on drugs and guessing what percentage of kids around the country are on drugs.
- D. Giving kids change to buy you drugs so you can feel stupid.
Types of Measurements and Distributions
If we want to do Stats, we have top find a way of measuring observations that other people will understand. No one wants to read a scientific paper about how many people are “skanky” unless they know what your exact criterion for calling people skanky is. There are 4 types of classifying observations that scientists like:
Nominal– A nominal value is a name. It has no numeric value (though it may be a number). For instance, if you have two personality types, named “introvert” and “extrovert”, and we define those personality tests in such a way that every person we’re studying is either one or the other. Those types are nominal values. You can tell it’s nominal because you can change the name, and it wouldn’t effect your study. You could call it “shy” and “talkative” or “type A” and “type B” or “pickle” and “monkey” and the stats would be exactly the same.
Ordinal– This is a measure of how an observation ranks among other observations. An ordinal measurement has no meaning on it’s own, only compared to the other scores. Like in a race: you know you came in fifth, but that doesn’t tell you how many miles-per-hour you ran, just that you ran faster than the fourth person and slower than the third. You can tell that a measurement is ordinal because the number looses its meaning if you take it outside of the current sample. For example, you know that someone scored third in the Miss Sideshow Freak Pageant – does that tell you anything about their “value” compared with Miss America contestants? No, because it’s an ordinal measurement.
Interval– This is a measure of value on an actual “scale”. It’s called interval because equal intervals have equal value. Date is an interval scale, and that means that the difference between 1400AD and 1405AD (a 5 year interval) is the same as the difference between 2000AD and 2005AD. In that way, inetrval is liek a ratio (see below). Yet intervals do not have a “true zero” that represents zero value. 0 AD doesn’t mean zero time, it’s just another point on the scale. 1000AD is not twice as much time as 500AD.
Ratio– Like interval, a ratio has equal intervals. Unlike an interval measurement, a ratio does have a true zero that represents zero value. Height is a ratio value. 0 ft. means no distance. And thus a 10 ft. tall person is twice as tall as a 5 ft. tall person.
So here’s the tests: Can you replace the labels that you call things, and still have basically the same data? If so, then it’s nominal. Do the labels rank the subject in comparison to your other subjects, but not in comparison to anything else? If so, then it’s ordinal. Do 5 points of difference mean the same thing on any point on the scale? If so, then it’s either Interval or Ratio. If 40 on the scale has twice the value of 20, then it’s ratio, and if not then it’s interval.
Whatever you’re trying to measure is the variable. The variable is a value that stays the same regardless of what system you use to measure it. I could rank you’re height by rating you as tall or short (nominal) or by ranking you against other people (ordinal) or by actually counting the number of feet (ratio) but you’re actual real height, which is the real variable, stays the same.
Variables can be continuous, which means there are infinite possibilities between any two points on the scale. Between five feet and six feet, there are an infinite number of possibilities. Or, variables can be discreet, which means there are a limited number of possibilies. When you’re variable is the number of baloons, then five baloons and ten baloons, there are only a few possibilities.
Test: Which of these is an interval measurement?
- A. Where a candidate ranks in popularity among other candidates.
- B. How much more or less the candidate’s budget is compared to the historical average for US candidates.
- C. Number of votes recieved in an election.
- D. Democrat or Replublican.
A distribution is just a way of writing out a bunch of statistics. It’s a list, like an inventory. There are a bunch of types of distributions, but the most common one is a “frequency” distribution. This type lists every possibility, and shows how many of each type there are. Let’s say you own a store, and you need to do an inventory. This might be your distribution for cans of soup:
|type of soup
|number of cans
|size of tree
|number of trees
You’ll notice that because height of tree is a continuous value (also a ratio), we can’t list every possible tree height, so we create arbitrary ranges.
f is a symbol that means frequency of occurrence, which means “how many times have we seen it” or “how many of these do we have”. So, if you want to make a smartass distribution, instead of saying “number of cans” say “f”. Another thing you can do to fancy up your table is to put a relative distribution. Relative just means, compared to the whole. In a relative distribution, instead of just counting how many times each thing occurs, we coudl what percentage of the total it is. Like this:
|type of soup
If you want to make it even fancier, you can make it a cumulative frequency distribution. When you do this, you put in a column that says how many things are equal to or less than the current group. You can even add a cumulative percentage frequency, which is just what percentage the cumulative frequency is of the whole. As an example:
|size of tree
|cumulative relative f
(17 trees are this size or less)
(7.5% of trees are this size or less)
(140 trees are this size or less)
(61.6% of trees are this size or less)
227 trees are this size or less)
(100% of the trees are this size or less)
Sometimes it is useful to have a number which represents what the majority of the scores are doing. When a yokel sees a basketball team and says “these guys are pretty damn tall!” that’s a measure of central tendency. The measures of central tendency scientists like are mean, median and mode.
Mean is just another word for average. Add up all the scores and divide that number by n (the number of scores). If you want to find the average height in your family, add up the height of everyone in your family and divide by the number of people in your family. This is the most commonly used measure of central tendency because it is effected by every member of the population. For exactly the same reason, though, it can screw you up. Let’s say most of the people in your family are aroudn 5 ft. tall, but you have a sister who is 20 ft. tall. This one very odd “outlier” would “skew” the measurement towards tallness.
Median is the middle score if the scores were put in order of value. If there are an odd number of scores, the score is the (n/2+.5)th score. If there are an even number of scores, the median is the halfway value between the n/2th and n/2+1th scores. Median scores are not quite as accurate as a median, but they are not effected by extreme scores. If you’re using this method, you would line your whole family up in order of height, pick the center person, measure that person, and that would be your median.
Mode is the value that is found most frequently in the distribution. If more than one value is tied for the highest frequency then the scores are bimodal. If everyone in your family, except your 20 ft. sister, is between 5-6 ft., then that’s the mode. No one uses mode, the only reason we mention it is because statisticians like saying “mean, median, mode.”
Test: There are 10 people in my family. The mean number of monkeys owned by each person is 50. The mode is 0 monkeys and the median is 0 monkeys. What does that tell you about my family?
- A. That most people in your family have about 50 monkeys.
- B. That some people in your family have no monkeys, some have a lot, and the majority have only a few.
- C. That most people in your family have no monkeys, but there is a minority that have a whole lot.
- D. That your family members are actually monkeys.
Measures of Variability
Once we’ve established the central tendency of a group of scores, we’ve got to wonder how much our scores vary from that central tendency. The average person in a group might be 5’9″, but are the rest of the people approximately the same or are they all midgets and basketball players? The simplest and most obvious measure of variance is the range: the distance between the lowest and highest scores. However, this only measures the most varied members of the group.
Statisticians prefer the sum of squares. To find the sum of squares, you figure out how far everyone is from the median. The average person in your family is 5’9. One person might be 1 ft. less then that, another might be 2 inches less, another might be 6 inches more, another might be 14 ft. more. Now, take each of those differences and square them (multiply them by themselves). One thing you’ll notice is that the negative numbers become positive (a negative squared is positive). Now, add everything up and you have the sum of squares. In mathematical terms:
The means sum of, means the mean of the scores and, of course, SSx is the sum of squares. Of course, this is a total of the variance, as the n gets bigger, it gets bigger. If you are measuring the sum of squares for a group of four people, it might be 8 ft. altogether, but if you are measuring the sum of squares for a hundred people there might be 90 ft. worth of variance.
So, take your sum of squares and divide by the number of scores you’re measuring. Then, since we had squared everything earlier, now we square root the whole thing. You’re basically measuring “what is the average of the squared difference between each member of this group and the average of the whole group.” What you get is the standard deviation, symbolized by s. Here’s the mathematical formula for finding s:
Test: I’m testing vitamin supplements and the average change to the lifespan (in years) of people who take it, which one do you want?
- A. One with a mean of +3 and s of 12
- B. One with a mean of +1 and s of .001
- C. One with a mean of -6 and s of 6
- D. One with a mean of 0 and s of -5
Having trouble with measures of variability? Try checking out Measures of Central Tendency.
Many statistical applications use proportions. Proportions are very simple, they’re just like percentages except without the hundred. 0% is equal to 0 proportion, 50% is equal to .5, 100% is equal to 1, 250% is equal to 2.5, etc. We use proportions because they are easier to work with mathematically. If .37 of a population of 500 likes Jell-O, then we can say that .37×500, or 185 people, like Jell-O. If .37 of the population likes Jell-O, then we can also say that 1 -.37 or .63 of people don’t like Jell-O.
Test: Out of ten people, 2 are allergic to monkeys. Using proportions, which is true?
- A. .2 are allergic to monkeys, .8 are not.
- B. 20% are allergic to monkeys, 40% are not
- C. 5.0 are alergic to monkeys, 0.2 are not.
- D. Monkeys!
Statisticians are very interested in what is the probability a certain thing will happen, for reasons we will go in to later. So there is a language in statistics for defining probabilities. All probabilities range between 0 and 1. 0 probability means the thing will never happen, .5 probability means it will happen half of the time and 1 means it will always happen. For a coin, the probability of getting heads is .5. So, if you flip a coin 20 times, a statistician’s best bet is that you’ll get 10 heads (20x.5).
To figure out a probability, figure out all the possible outcomes (heads or tails) and find out how many of those outcomes will meet your criterion for success (in this case, only heads gives us a success) and then divide the number of good outcomes from the number of possible outcomes: 1/2 = .5. Let’s look at another example, on a normal dice with the possible values of 1, 2, 3, 4, 5 and 6, what are the chances of getting an even number? There are 3 outcomes that fit our criterion, divided by 6 possible, is .5. Or what if you were asked to flip 2 coins, what is the possibility of getting 2 heads? The possible outcomes are Head-Head, Head-Tail, Tail-Tail and Tail-Head. That’s one good outcome out of four possible, or .25. The same answer holds true if you were asked the chance of flipping a coin and getting a head and then flipping it again and getting another head.
In probability, there’s a big difference between AND and OR. Let’s say you’re flipping 3 coins. I could say “you have to get them ALL heads, or it’s in to the pit of doom” or my less-evil twin could say “you have to get one of them heads or it’s in to the pit of doom.” What I’m really saying is that you have to get “a head AND a head AND a head” but what my less-evil twin is syaing is that I have to get “a head OR a head OR a head.”
When you’re figuring out probability, whenever there is an AND, you multiply. The probability of getting struck by lightning is, let’s say, .000005, while the probability of winning the lottery is .000002. What is the chance of both happening to you? In other words, what is the chance that you will get hit by lightning AND win the lottery? Well, AND means multiply, so multiply .000005 by .000002, and you’ll get .00000000001.
On the other hand, what if the question was OR? There are two ways to deal with ORs. You have to figure out whether you’re dealing with an exclusive OR or a non-exclusive OR. In an exclusive OR, when one condition happens, it means the other condition can’t happen. These are the simplest because you can simply add probabilities. In a non-exclusive OR, one condition can happen, or another condition could happen, or they could both happen. These are slightly more complicated to deal with.
Here is an exclusive OR: If you pick one marble out of a bag containing 2 red, 1 blue and 1 black marble, what are the chances the marble will be either blue OR black? We know that the chance of getting a blue marble is .25 and the chance of getting a black marble is .25 so we can simply add the two and get .5.
Here is a non-exclusive OR: If you flip a coin three times, what is the chance that you’ll get at least one head? To figure out the answer, we’ve got to find the hidden AND in that question. Flip the question around – instead of asking “what are the chances you won’t get one head” ask “what are the chances you won’t get any heads” – or, in other words, “what are the chances you will get all tails.” – or “what are the chances you will get tails AND tails AND tails.” Well, we know how to figure that you, it’s .45 times .5 times .5, or 0.125. That’s the chance we’ll fail to get at least one head. But what is the chance we’ll succeed? Well, it would be 1 minus 0.125, or 0.875.
Don’t be fooled in to thinking that probability is reality. There is no such thing as a .5 chance that can be measured. Probability only exists in an imaginary world where you can do something an infinite number of times and figure out what proportion of those times a particular thing happened. Probability is a fiction, but it is an incredibly useful fiction.
Test: I have developed three automated gambling machines. You put in money, and it either gives back money or it gives you nothing. Which machine do you put your money in?
- A. You put in $20 and it has a probability of 1 of giving you $20 back.
- B. You put in $20 and it has a .25 probability of giving you $50 back.
- C. You put in $20. It has a .25 chance of giving you the $20 back. It has a .5 chance of giving you $40 back.
- D. You put in $20 and it has a probability of 0 of giving you ten million dollars.
Test: You’re going to flip a coin three times. Which formula will tell you the probability of getting either 3 heads in a row or 3 tails in a row?
- A. ( .5 x .5 x .5 ) + ( .5 x .5 x .5 ) = .25
- B. 1 / 3 = .33
- C. ( .5 x .5 x .5 ) x ( .5 x .5 x .5 ) = .01
- D. ( 1 + .5 + .25 ) + ( 1 + .5 + .25 ) = 3.5
Having trouble with probability? Are you sure you’ve got Proportions down?
As I mentioned before, the strong point of statistics is that when we make educated guesses we know the chances our guesses are right. This is in part because we know how well the information we do know represents the information we don’t know. If a space alien wants to know about college students and grabs one, how well will studying that one college student tell him about all college students? What are the chances that the college student he chose is so abnormal that what the alien learns about college students in general will be completely wrong? What if he lands his spaceship in the quad, opens the doors and says “anyone want to come in and get probed?” Will the people who come in to the ship accurately represent college students as a whole? What if he landed at Harvard or Bakersfeild Junior College, will those colleges give him a good representation of college students worldwide? These are the types of questions we have to worry about when we do sampling. Specifically, we have to worry about how many are selected ad how they are selected.
We sample a population because it is inconvenient to make observations about the entire population. The problem is that there’s always a chance of picking a bunch of weirdoes that do not accurately represent the central tendency of the population. If you had a bag of red and black marbles and you reached in and the four you pulled out were all red, you might assume that all the marbles in the bag were red, which would be an error. Scientists are very afraid of this kind of error, they would rather admit to not knowing what the marbles in the bag are than make an assumption about that bag that turns out to be wrong. If we increase the number that we take out, it decreases the chance of us getting all red marbles. As we saw in probability, the chance of getting 10 red marbles from a bag of half red and half black marbles is .5 to the 10th power, or 0.00097652.
The more variance we have in a population, the larger a sample we need in order to be accurate. For instance: a space alien who took ten random college students might be able to make some pretty valid guesses as to what all college students are like. But let’s say the space alien took 10 animals at random, would that really be a big enough sample for the alien to say something about all life on this planet? Chances are, they would all be bugs and the alien would decide that the planet has no intelligent life.
Another thing we have to worry about is whether the sample is random, because if it isn’t then you aren’t really taking a sample of the population but a sample of the people in the population who are likely to be sampled. If our space alien lands and asks people to come in, he’s going to get a good sample of people willing to wander aboard an alien spaceship but not a good sample of all people. The problem is, we can take an unrepresentative sample and not even know it. If a researcher stands in the quad asking people questions, he or she might unconsciously choose good looking people to approach, and what if good looking people have a different central tendency than the population as a whole in terms of the thing your researching? The only way we can be absolutely sure that we’ve got a good sample is if our sample is completely random. Completely random means that every member of the population has an equal chance of being chosen. A truely random sample is an ideal that is hard to actually attain. Say you’re trying to study the effects of drugs on random members of the population. In order for your sample to be truly random, you’d have to make sure that: 1- A hermit living in a cave in Tibet has an equal chance of being picked for your study as a guy hanging around your research facility. And 2- People who are chosen randomly have to participate – they can’t say no. If they can say no, then it’s no longer truely random.
Having trouble understanding sampling? Maybe you should review probability and measures of variability.
Imagine you flipped five pennies and counted the number of heads and the number of tails. You might get all heads, or you might get all tails, but probably you’d get some kind of mix. To prove this you flipped five coins a hundred times, gave heads a value of +1 and tails a value of -1 and put all your results on a frequency distribution graph. You might have a few on the -5 and end a few on the +5 end, but most would be right in the middle. It would look something like this:
The more times you flipped, the less jagged the curve would be. And if you used an infinite number of coins and flipped them an infinite number of times, you would get a perfect normal curve (sometimes called a bell curve).
So why are normal curves so great? Normal curves (or at least approximations of normal curves) show up in a lot of places, from height to test scores. Think of something that occurs a lot, go out and measure it and graph it, and you’ll probably end up with a normal curve. Statisticians have come up with tables that tell us all about normal curves. We can say that on this particular point of the curve, 95% of the population will be greater and 5% will be less.
In order to fit a normal curve to some real live thing, we have to knwo two things: the mean and the standard deviation. The man tells us where the center of our curve is. If the average height of people is 5’2″ then 5’2″ will be dead in the center of our normal curve. The standard deviation tells us the scale we’re working with. Let’s take two normal curves: the height of American five year olds, and the height of all minors. They might have the same mean – the average height of these two groups might be exactly the same. Yet the variance is very different: 95% of all 5 year olds will probably be within 6 inches of the mean, while 95% of all minors will probably be within 2 ft. of the mean. On a normal curve graph, you’ve got little measurements on the bottom. When there’s more variance, those measurements “stretch out” and when there’s less they “squish together.”
Now we need to know our Z-Score. The Z-Score was invented so we can say where on the normal curve a score is, no matter what kind of scale is being used, from height to SAT. A Z-Score tells us how far away a score is from the mean in terms of the standard deviations of that score. If a score is one and a half deviations below the mean, then it’s Z score is -1.5. Z-Scores let us know exactly where our score is on a normal curve table.
Okay, let’s say we figured out the height of all people ever. The mean (average) is 5’2″. The standard deviation is 4 inches. So, someone who is 5’10” is 8 inches more than the average, or 2 standard deviations more than the average, so that person’s Z score would be +2.
A normal curve table usually gives us two pieces of information about our score: what proportion of cases lie beyond the score and what proportion lie between the score and the mean. And really, that’s all we need to know, anything else we can figure out. The table doesn’t know whether we’re above or below the mean, it just tells us the proportion in the direction of the nearest tail (proportion beyond) or the proportion of scores that go all the way to the mean. This lets us divide the one half of the normal curve we’re working with in to two parts. And remember, the normal curve is perfectly symmetrical so either side has exactly half of the scores. So, as you can see below, we know everything we need to know about the curve:
There’s a few handy facts we can memorize about normal curves: about 68% of scores fall between Z scores of -1 and +1, 96% fall between +2 and -2, and 99.7% fall between +3 and -3.
Okay, so say you know your buddy Jim’s height is +2 z score (2 standard deviations above the mean). We grab the nearest normal curve table, and it said a z score of 2 means .977. What does that tell us? That tells us that 97.7% of people are shorter than our buddy Jim. We can also subtract .977 from 1 and get .023. Only 2.3% of people are taller than Jim. If we really want to impress Jim, we can divide 1 by .023 (if you’re lucky, there’s a 1/x button on your calculator that will do this for you) and we can tell Jim that approximately 1 out of 43 people is taller than him.
Sometimes we will want to change a score from one scoring system to another. Say, for some reason, you wanted to convert your SAT score to a GPA. First, you would find out what the mean SAT score is, then find out the standard deviation and find out how many standard deviations above or below that score your score is. In other words, find your Z score. Then find the mean and standard deviation for GPA. Multiply the Z score by the standard deviation to find out how many points above or below the mean your score lies. As an equation, it would look like this:
Of course, SAT score and GPA aren’t really equal. What you’re really saying is “if my GPA was as uncommonly good, or uncommonly bad, as my SAT score, what would it be?”
Test: 34% of subjects lie between z = 1 and z = 0. If the average person can bench-press 100 lbs., the standard deviation is 50 lbs. and you can bench-press 100 lbs., then what does that mean about you?
- A. 34% of people can bench press more than me.
- B. 34% of people are better than average at bench-pressing but can’t bench press more than me.
- C. 34% of people can bench press less than me.
- D. 34% of people are less bad than I am good but less good than I am good.
Correlation and Prediction
Are fat people really jollier than skinny people? Statisticians have nothing better to do than sit around all day wondering about silly questions like this and finding the answers. What a statistician would go is get a random sample of people, measure their weight and take an objective measure of their “joliness.” Certainly, since this is people we’re talking about, we would expect a lot of variation. What we want to know is: do fat people tend to be jollier, and how strong is that tendancy? What we do is run our data through a pearson’s coefficient of correlation test, that test gives us a number called “r” which is the answer to our question. r is a proportion, the value of which can range from -1 to +1. Let’s examine what each score would mean:
r=0: If we got an r of 0, then we woudl know that joliness and weight have absolutely nothing to do with each other. We can’t tell anything about a person’s Joliness by knowing their weight and visa versa. Whether they’re skinny or fat, our best bet about a person’s joliness is “average.” If you drew a graph of weight and joliness, it would look like this:
r=1: Joliness can predict weight without any error and visa versa. If you drew a graph of weight and joliness it would look like this: Of course, one pound is probably not equal to one joliness point, but it doesn’t need to be because we know the standard deviation for weight and joliness. If someone is 1 standard deviation above the average in weight, then we can be sure that they’re one standard deviation above the mean in joliness. As long as we’re comfortable with Z scores, we can predict any one score from any other score. If we know someone’s joliness and nothing else about them, we can predict their weight and be 100% certain of being right.
r=-1: This is exactly the opposite. More weight equals less joliness and visa versa. If a person is 1 standard deviation below the mean in weight they’ll be 1 standard deviation above the mean in Joliness. A correlation of -1 has the same strength of correlation as +1, meaning we can predict Joliness from Weight as exactly with a +1 as with -1. A graph of scores would look like this:
r=+.62: Joliness and Weight are somewhat related. If we know one, we can make a guess about the other. It wouldn’t be a perfect guess but it would be better than nothing. Graphed, it would look soemthing like this: If we were to find a correlation between Weight and Joliness, it would probably be somewhere in this area. Most correlations are not a perfect 1 or -1. Children’s age and height would probably be around .9 because we can make very good guesses about a child’s height from their age but we can’t make perfect guesses. Gender and mathematical ability for college students would probably be around .07: we know that there is some relationship between the two, but that relationship is very weak. We’d only be a tiny bit more successful trying to guess mathematical ability from gender than if we just guessed the average.
Still, if an r isn’t zero, we can try to guess one score from another, and if all we know is the r and the person’s weight, our guess on joliness is the best guess we can make at the moment. First, let’s call the score we do know X and the score we want to guess, Y. We first find the Z score for the score we do know and multiply that by the r for our correlation. This result is our best guess as to how far our predicted score differs from the mean of all scores of that type. So, take the result, multiply it by the standard deviation of y and add that to the mean of the Y’s. As an equation, that would look like this:
Y’ means predicted Y.
Note that this whole equation works equally well for a negative r or if the z score is negative. If you use an r of 1, the equation turns to , the formula for transforming a score from one system to another. If you use an r of zero, you get , in other words, our best guess is simply the mean of all Y scores.
Test: Imagine you’re at a party looking to get a little action. I’ve been measuring several aspects of people’s behavior at the party. I’ve found:
- r=-.8 between amount of alcohol drunk and and getting someone to go home with you.
- r=0 between time spent dancing and getting someone to go home with you.
- r=.5 between number of cigarettes smoked and going home alone.
- r=-.7 between time spent in the hot tub and going home alone.Which do you do?
- A. Drink more booze!
- B. Get my ass on the dance floor!
- C. Smoke up a storm!
- D. Dive in to that hot tub!
Having trouble with correlation and prediction? It’s really important to understand Z-Scores.
Basic Statistical Tests
If there was no science and no statistics, all we would have is opinions. You might have a theory that fat people are jollier, a friend might believe that they aren’t. Both of you could yell “I’m right” for hours, but none of you would have hard evidence to show you’re right. Science is an objective language for proving that your theory is right. The whole hypothesis, experiment, results crap is just a standardized way of saying “I think I’m right because…” Statistics is the language of saying “I’m probably right, and here’s the chance that I’m right.” For the most part, if you’re a scientest then other scientests don’t want to hear about your discovery unless you can show there’s a 95% chance that you’re right. That’s “significant results” in the science biz.
Statisticians have created these statistical tests that will let researchers plug in their numbers and find out if they met that 95% mark. The tests don’t tell you if the research was screwed up, only if the numbers seem to be telling the story that the researcher wants them to.
The problem these tests have to deal with is random variation. What if you grab a group of people with heart problems, you put them randomly in to two groups: the control group and the leech group. The control group just sit there for an hour a day, the leech group sits there for an hour a day with leeches on them. What if all the people in the leech group get noticeably better, while the people in the non-leech group don’t. Before you know it, every heart patient in the world is going in for leech treatments three times a day, then you find out that it was just the luck of the draw, that the people who got selected for the leech group, by chance, were the people who were going to get better. It had nothing to do with leeches, only individual variation. Now, you’ve put people around the world through a disgusting experience, endangered lives, and worst of all you’ve made scientists look like a bunch of bumbling dolts. That’s a type I error. You know in vampire movies when crosses burn vampires? Type I errors do the same thing to scientests.
What if, on the other hand, the people you put in the leech group just happen to be the really bad patients who are doomed no matter what, and you find that the leech group is no better than the control group, even though leeches are really a life saving miracle cure. Now, you’ve ruled out a cure which might have saved millions of lives, but since you probably won’t publish negative results like this, you’re less likely to make scientists look like a bunch of idiots. We call this a type II error, and scientists don’t mind these nearly as much, you’re free to make as many type II errors as you want. (Of course, it’s not really you making the errors, it’s the statistical tests).
The difference between Type I and Type II errors is sort of like that old adage that the US Justice system is designed such that “We would rather see a 100 guilty men go free than see one innocent man put in prison.” Putting an innocent man in jail would be like a type I error, letting a 100 serial killers loose would be a Type II error. When we run a statistical test, we can actually set the chances of a type I error occurring. It’s usually set at .05 (95% chance you aren’t making a type I error) but you can set it anywhere you want. This variable is named (alpha).
You can’t take a multiple choice test where the possible answers are “A.” All tests have options. Most statistical tests have 2 options, and decide between the two. One option is called the null hypothesis and the other is called the alternative hypothesis. The names are kind of stupid, but just think of it as a euphamism that scientests made up because they’re so afraid of Type I errors that they don’t even want to refer to them by name.
The null hypothesis (HO) is opposite of what we’re trying to prove. Most of the time, the null hypothesis will be that our independent variable does nothing and means nothing. In our leech study, the null hypothesis would be that that leeches don’t do anything to help heart patients.
The alternate hypothesis (HA) is the opposite of the null hypothesis. The alternate hypothesis in our leech case is that leeches do have some effect on the health of heart patients. When you’re formulating your null and alternate hyptohesis, make sure that you have everypossibility covered: the leeches either do somehting, or they don’t, there’s no other option.
If your HA is that x is more than 5, then the HO would be that X is less than or equal to five.
The statistical test chooses between the two hypothesis. If it chooses the HA when it shouldn’t have, that’s a type I error. If it chooses the HO when it shouldn’t have, that’s a type II error. So, as with our above adage, the HO is that the suspect is innocent, the HA is that the suspect is guilty.
There are some situations where we can choose between a one tailed or two tailed test. A one tailed test means we’re only looking in one direction. In a one tailed test, we don’t care if leeches make heart patients worse, or if they don’t change them at all, we only want to know if the leeches make the patients better. Our HO is that leeches do not make heart patients better. Our HA is that they do. In a two tailed test, we care about any difference, positive or negative. Our HO is that leeches make some change in health, for better or worse, in heart patients. Our HA is that they don’t make any difference. Two tailed tests are the most commonly used.
Once we’ve got that figured out, the alpha, the null and alternative hypothesis, whether we’re using a one or two tailed test, the rest of the work is done by the statistical test which is appropriate for our particular situation. But we have to figure out all this before our test, because if we don’t, then it will be us making the decision and not the statistical test, so we can’t be sure that the result is impartial.
The statistical test will tell us either that it’s too likely that our results mean nothing and occurred by random chance, or that it’s so unlikely that our results were chance that we can assume that they mean something. We start out before the test with the inherent assumption that HO is true, kind of like the “innocent until proven guilty” ideal. If the test says we don’t have enough difference to be sure it’s not random chance, then we can’t accept our HA and we’re left with the HO. HO wasn’t proven to be true, we just failed to prove anything else, just like when we let a suspect go, we haven’t proven that they’re innocent, we just failed to prove that they were guilty. If the test says our differences were too great to have occurred by random chance, though, then we can actually reject HO, and if HO isn’t true, the only other possibility is that HA is true. If we can prove beyond a reasonable doubt that the suspect is not innocent of the crime, we have to accept that he’s guilty.
Test: You want to prove that punching people in the face changes their freindliness. You have a test that measures freindliness on a scale from 1 to 100, 1 being the least friendly and 100 being the most freindly. You take 10 people and sort them in to a two groups of five. One group, you just give them the test. The other group, you punch each person in the face before giving them the test. Which would be you’re null and alternative hypothesis.
- A. HO: The average friendliness of the punched group is different from the average friendliness of the not-punched group. HA: We don’t know whether or not one group is different from the other.
- B. HO: The average friendliness of the punched group is different from the average friendliness of the not-punched group. HA: The average friendliness of the punched group is equal to the average friendliness of the not-punched group.
- C. HO: The average friendliness of the punched group is less than the average friendliness of the not-punched group. HA: The average friendliness of the punched group is more than the average friendliness of the not-punched group.
- D. HO: The average friendliness of the punched group is equal to the average friendliness of the not-punched group. HA: The average friendliness of the punched group is different from the average friendliness of the not-punched group.
Test: In the above study, what would be a type I error?
- A: Declaring that being punched changes freindliness, when it really doesn’t.
- B: Declaring that being punched doesn’t change friendliness, when it really does.
- C: Declaring that you don’t have enough information to decide whether being punched changes friendliness.
- D: Declaring that being punched changes freindliness, but really it only makes freindliness go down.
Can’t get your mind around Statistical Tests? Maybe you should review Sampling.
The most commonly used Statistical Test is the t-test. The purpose of a t-test is to figure out if the difference between two means is caused by random chance or by an actual difference between the two populations. Or, more precisely, the t-test determines if the probability that the differences we found are due to simple variation is more or less than our alpha.
Let’s say we have two means. One is “average amount of improvement among heart patients that sat in a room by themselves,” another is “average amount of improvement among heart patients who sat ina room with leeches on them.” The first is 7, the second is 15. Are we 95% sure that this wasn’t just random variation between the two groups of people that had nothing to do with leeches? A t-test will tell us.
Now, I can hear you asking “Wow, how does the t-test do that?” The t-test can do that because it knows a special magic trick that I’m going to reveal to you. Zando the magician is at a party with a bunch of kids and adults. Now, let’s say that the ages of the people at the party are not a normal curve. Let’s say it looks something like this: Now, what Zando does is put everyone’s name in a hat. He picks out two names, averages the ages of the two people, then puts the names back in the hat. He graphs that average on a frequency distribution. Then he does it again and again about 100 times. Now, no matter than the frequency graph of the original population looked like, what he’s going to end up with will be a beautiful normal curve. He turned a non-normal curve in to a normal curve. How does he do it? There’s no trick, it’s actual magic!
Say Zando wanted to prove that the people drinking Kool-Aid were younger than the people drinking Tequila at the party. He takes a random sample of two kool-aid drinkers and two Tequila drinkers. He finds that the mean age for the kool-aid drinkers sample was less than the mean age for the Tequila drinkers sample, 7 and 30. It seems like he might be right, but what if the results he got are due to random variation? Assuming HO, that there is no difference between Kool-Aid and Tequila drinkers, what is the chance of picking a sample with a mean of 7 and a sample with a mean of 30? All Zando has to do is look on his magical normal curve and see where 7 and 30 are. Using a normal curve table, he can see what proportion of cases lie below 7 and above 30. And, if you remember probabilities, the probability of something happening is the proportion that it happens in an infinite number of tries, which is exactly what our normal curve is telling us. So, let’s say that our normal curve tells us that the chance of getting a 7 or less and then a 30 or more (using the multiplication rule of probability) is .0371, less than our pre-set alpha of .05. Now, we can be sure at a .05 level of significance that there is a difference in age between tequila and kool-aid drinkers.
This is pretty much what a t-test does, you feed it your samples, it assumes a normal curve for possible samples if the HO is true, then it goes on to find out how unlikely your results are. If they’re really unlikely, you can reject HO and accept Ha.
There are a few types of t-tests which are pretty much the same. The first is a one-sample t-test. We use this when we want to compare the mean of our sample with a mean for the population that we already know is true. For instance, what if we only wanted to prove that tequila drinkers are older than the mean age of the population of the party, and we’ve already figured out what the mean age for the population is, it’s 10. Our HO would be that and our HA that . To run this test, we need to know the mean of X (our sample scores for tequila drinkers), the population mean we’re comparing it to (10), and the standard deviation of our estimated sample error of the mean (which is the standard deviation divided by the square root of the n of the sample). Given all this, we can find out the t, which is sort of like the z score of our sample mean in the normal curve of possible sample means. Once we have a t-score, we can go to a t-score table and find out what t-score we need at our sample size and alpha to be able to reject HO. If the t-score we calculated is larger than the t score in book, we can reject HO and accept HA. T-scores are listed on a t-score table by degrees of freedom. For single mean t-tests, the degrees of freedom (df) is equal to n-1.
Our equations for this test are:
equals the population mean and the mean our sample should be, as predicted by HO.
The next type of t-test we can do is a t-test for unrelated means. This is the same as our test to find out whether the tequila drinkers at the birthday party had a higher age than the kool-aid drinkers (or, merely a different age if we want to use a two-tailed test). In this test we compare the means of two samples to see if they’re different enough to reject HO. To do this test, we need to know the mean of both samples ( and ), the standard error of the difference between the means (), the sum of scores for each sample ( and ), the sum of the squares of each sample ( and ), and the pooled sum of squares (SSp). The equations we need will look like this:
The zero you see above represents the difference between means predicted by the HO (usually none). The degrees of freedom for a t-test of unrelated means is the sum of the n’s of both samples minus 2.
The third type of t-test we’ll consider right now is a t-test for related means. This is where we reduce the variance by matching scores together in to pairs. This can be done in two ways. The first is to use one subject and measure them twice under two conditions. If we want to find out if drinking jello shots changes intelligence, we can give subjects an intelligence test and record their score, give them a few jello shots, then test them again. The before and after scores for each subject would be a pair. Or, what if we want to find out if a person’s intelligence is different while they’re being hit on the head with a mallet. We find subjects who score the same on intelligence tests and match them up in pairs, then give one an intelligence test with no mallets, and give the other an intelligence test while we hit them on the head with a mallet. The non-mallet score and mallet score would be a pair of scores. To do a t-test for dependent means, we need to know the n (this time it’s the number of pairs of scores), the difference between each pair of scores (D), the square of those differences (D2), the sum of D (), the sum of the squares of D () and the estimate of the standard error of D. These are the equations we’ll use:
Once again, the 0 is the difference between means predicted by the HO. The degrees of freedom is the number of pairs minus 1.
Test: You’ve developed a drug which lets people see beyond the curtain of this reality and see the inhuman monsters that stalk us and eat our souls when we die. You’ve taken 100 people, given half of them the drug, and asked them to all report the number of monsters they see in a week. The group who didn’t took the drug saw, on average, .6 monsters each. The group who did take the drug saw, on average, 2.1 monsters each. Your HO is that the drug does not change monster sightings. Your HA is that the drug does change monster sightings. What are you hoping for?
- A. That the sample size is small enough, the variation between the two groups and within each group is large enough that out T score will be higher than .95 so we can reject both the null and alternate hypotheses.
- B. That the sample size is large enough, the variance between the two groups is small enough, the variance within the groups is small enough, that our T score will be less than that required by an alpha of .05 and we can reject the alternate hypothesis.
- C. That the variance within the two groups is low enough, the difference between the two groups is high enough and the sample size is large enough to give us a T score that meets an alpha of .05 and we can reject the null hypothesis.
- D. To burn all my research notes, destroy all samples of the drug and drink until I forget about this damned experiment.
Having trouble understanding t-scores? You might try reviewing Basic Statistical Tests and Normal Curves. If you’re having trouble actually doing the t-test, you might want to review Measures of Variability and Measures of Central Tendency.