Introduction to Statistics

previous lesson

Introduction to Statistics

TERMS / NOTATION

Frequency Distribution Table: AIDS cases by exposure category and sex reported July 1993 through June 1994 in the United States

EXPOSURE CATEGORY MALE f RELATIVE PROPOR-

TION CUM.

PROPOR-

TION FEMALE f RELATIVE PROPOR-

TION CUM.

PROPOR-

TION

GAY OR LESBIAN SEXUAL RELATIONS 42,146 .602 .603 0 .000 .000

INJECTING DRUG USE 17,441 .249 .852 6,138 .429 .429

HEMOPHILIA/COAGULATION DISORDER 586 .008 .860 17 .001 .430

HETEROSEXUAL CONTACT 2,838 .041 .901 5,457 .381 .811

RECEIPT OF BLOOD TRANSFUSION, BLOOD COMP., OR TISSUE 498 .007 .908 375 .026 .837

RISK NOT REPORTED OR IDENTIFIED 6,438 .092 1.000 2,322 .162 .999

TOTAL 69,955 14,309

SOURCE: HIV/AIDS Surveillance Report , Vol. 6:1.

Frequency distribution table: a summary of univariate samples

frequency (f): Number of cases that fall into a certain delineated category.

variable: something which varies (i.e. belief in god, age, gender, etc.)

categories: the subsets the variable varies between (e.g. gender categories are masculine and feminine).

total (N): Total number of cases in sample.

relative proportion: each proportion of variable category to the total number in the sample. Relative proportion equals f/N.

cumulative proportion: the sum of the relative proportion of current variable category and all preceding categories.

Crosstabulation Table: SEX BY BELIEF IN GOD

SEX DON'T BELIEVE BELIEVE IN HIGHER POWER DO BELIEVE

MALE 79 91 463 633

FEMALE 28 83 753 864

107 174 1216 1407

ROW SUBTOTALS

CATEGORIES SUBTOTALS^{SOURCE: GSS91 SURVEY SUBSAMPLE}

Crosstabulation table (also contingency table): A summary of the relationship of 2 or more variables.

data: detailed information of any kind.

cell: Indicated by the shaded section. Each cell contains the number of cases that are both described by the category delineated to its left and the category delineated above. In our shaded example, 79 cases are both male and do not believe in a god.

subtotals: (n): Total number of cases in particular row or column.

SEX
ROW%
COL%
DON'T BELIEVE BELIEVE IN HIGHER POWER DO BELIEVE

MALE 79

12.5

73.8 91

14.4

52.3 463

73.1

38 633

42.3

FEMALE 28

3.2

26.1 83

9.6

47.7 753

87.2

61.9 864

57.7

107

7.2 174

11.6 1216

81.2 1497

row percentages: Frequency divided by row total. This shows the proportion of the cases in the row category that are the column category. In the example above 12.5% of the males and 3.2% of the females do not believe in a god.

column percentages: Frequency divided by column total. This shows the proportion of the cases in the column category that are the row category. In the example above 73.8% of those that do not believe in a god are male.

raw data: "Raw" means nothing has been done to it yet, such as the following case listing

PROCESSING RAW DATA

Person Age Sex Marital Status

Joe 21 M M (married)

Ann 13 F S (single)

Sue 72 F M

Bill 54 M D (divorced)

Sam 18 M M

Kay 12 F S

1. Raw data comes from questions on questionnaires.

a) Open-ended questions -- allows respondent to write an answer to the questions.

b) Close-ended questions -- gives respondent choices to indicate answers

Example: sex (circle one) M F.

2. Processing raw data

a) Computers are extremely helpful when processing a great deal of data.

b) Processing by hand -- individual must make frequency distributions and/or tables.

Constructing a frequency distribution table (using raw data above)

Sex f % cum %

M 3 50.0 50.0

F 3 50.0 100.0

6 = N

identify the categories under the variable
count how many cases fall under each category
to compute relative frequency, divide each category's frequency by N.
to compute cumulative frequency, add the relative frequency to the cumulative frequency of the preceding category. If no preceding category exists, cumulative frequency equals relative frequency

Constructing a crosstabulation table

Examine the data, each variable will have a number of categories. Count the categories and construct a table using a grid large enough for all the categories. If the data is nominal, the placement of a variable on the top (columns) or side (rows) is arbitrary (although it makes sense to put the larger number of categories across the top). For the other levels of measurement usually the independent, or the higher level of measurement if there is no prediction, goes on top.

Begin with Joe -- he is male and married.
Make a mark in the appropriate cell.
Continue with the remaining data, then indicate the total in each cell.
Compute the totals for columns and rows and check that they equal N (total in sample).

Joe: male, married table:

Marital Status

Gender M S D

M l

F

Competed table:

Marital Status

Gender M S D row total

M 2 0 1 3

F 1 2 0 3____

column total 3 2 1 6 = N

LEVEL OF MEASUREMENT (Types of data)

categories

categories
order among categories

Interval/ratio

categories
order among categories
equal intervals between points on measuring scale
a true zero point*

*The requirement of a true zero point is the difference between interval and ratio. This distinction is not necessary in basic statistics.

(nominal) (ordinal) (interval)

"Region of residence" "Social Class" "Height"

N upper upper 6'2"

S lower upper 6'1"

N upper middle 6'1"

E lower middle 6'0"

W upper lower 5'11"

*For both ordinal and interval data, the categories must have order, but do not necessarily have to be in ordered form.

PURPOSE OF STATISTIC

A. Description ("summarizing")

1. univariate distribution ("one variable")

central tendency ("average," "typical") Example: average age at SDSU = 21.4yrs
dispersion ("spread," "variation," "range") Example: ages of SDSU students: 14 - 89.3 years

B. Multivariate distribution (several variables) ("relationship," "association", "correlation")

Example: (bivariate = 2 variables) relationship between education and income.

C. Inference ("generalize from sample to population")

1. univariate ("setting confidence intervals")

2. one bivariate sample ( 2 variables) ("testing the significance of association")

two or more univariate ("testing the significance of differences")

USING THE CHART

Example: "Ever had sex with someone other than the person you were married to?

Sex Yes No

M 106 369 475

F 89 613 702

195 982 1,177

GSS91 survey subsample

Look at the data - see what level type it is.
Look at the question asked, see the chart for type of statistic to use.
Different statistics have different purposes and assumptions. It is important that the correct statistics for the type of data are chosen.
For these data the statistic Phi would be ideal to measure relationship.