# Data Treatment for Biologists

Part A : Workshop

The aim of this workshop is to give you some experience using standard statistical treatments of data using statistical software. The package we will use for this workshop is Minitab. It is the package used as the standard statistical software by the Mathematics and Statistics courses at RMIT and Chemistry also has a site licence for it. Most of the analyses in this workshop can be done using Excel but Excel is not very user-friendly for statistical analyses. The analyses in Minitab can be done from simple pull-down menus and there is a good on-line help facility. You can copy-and-paste data from Excel into Minitab. For the assessment you should enter your results into the attached pro forma. You may submit the assignment in hard copy or email as an attachment

Case Study 1

18% protein Diet 5% protein Diet
13.3 5.1
16.3 8.7
9.9 8.7
9.3 8.5
16.1 8.1
9.7 6.9
9.7 6.9
14.1 12.3

It is believed that nutritional deprivation affects various components of the immune system, such as the tuberculin skin reactivity. In this study a sample of 8 male rats were fed with a normal diet of 18% protein. Another sample of rats were fed with a diet of only 5% protein. After 4 weeks, the rats were given an interdermal injection of 25µg of purified protein derivative of tuberculin. The above table gives the skin reactivity diameter of erythema and induration (in mm) for the 2 groups

(a) Determine the mean, variance, standard deviation and 95% confidence interval for each data set

Basic Statistics 18% Diet 5% Diet
Mean
Standard Deviation
Variance
Confidence Interval
(b)verify the assumption that the two populations are (i) normally distributed and (ii) have equal variance

Are both data sets normally distributed? ……………….. Reason?…………………..

Equal variance?……………………………….. Reason?…………………………..
(c) display the data using a box-and-whisker plot

append a copy of the box-and-whisker plot

(d) use a t-test to determine if there is a significant difference between the tuberculin reactivity of normal and malnourished rats

Test
Null Hypothesis p Significant?
Reactivity difference between normal and malnourished rats

(e) in Excel, create a bar graph for each group to compare the means. Include ‘error bars’ showing the confidence intervals
append a copy of the Excel graph
Case Study 2
Germinated Did not germinate Total
Old Strain 125 15 140
New Strain 152 8 160
Total 277 23 300

The above table is a comparison of the germination rate of a new plant against an old strain of the same plant
Test whether there is a significant difference between the rates of germination of the strains (at the 95% level)
Null Hypothesis(Ho) ……………………………………

Alternative Hypothesis(H1) ………………………………………..
p ……………… significant? ……………………………………

Case Study 3

Fertilizer Blend
Farm U V W X Y Z
1 1130 1125 1350 1375 1225 1235
2 1115 1120 1375 1200 1250 1200
3 1145 1170 1235 1175 1225 1155
4 1200 1230 1140 1325 1275 1215

A trial of 6 different blends of fertilizers (U-Z) has been carried out on linseed crop on 4 different farms 91-4). The crop yields of linseed are given in the table. Carry out a 2-way ANOVA
(a) is there a significant difference between farms
(b) is there a significant difference between fertilizers

F p Significant?
Fertilizer Blend
Farm

Case Study 4

SBP (y) DBP (x) SBP (y) DBP (x)
112 63 156 100
120 69 124 82
135 70 99 56
142 82 105 65
132 76 124 73
115 67 144 89
119 71 134 76
128 73

Systolic arterial blood pressure (SBP) and diastolic arterial blood pressure (DBP) are tabulated above for 15 men aged 40-65
(a) carry out linear regression on this data
(b) give the 95% confidence intervals of the slope and intercept
(c) test whether there is a significant relationship between SBP and DBP for this group
(d) use the regression equation to estimate the expected SBP of a man aged 40-65 whose DBP is 75
(e) Linear regression predicts y from x. To do ‘inverse regression’ we need to vary the approach. Use the ‘invcalib’ spreadsheet to predict the DBP of a man with SBP of 120, with the confidence interval

Predicted equation (model)

p for hypothesis (gradient = 0)

Significant? i.e. is the gradient non-zero?

Standard deviation of slope (sb)

t (from tables)

Confidence intervals for gradient (+/- tsb)

Standard deviation of intercept (sa)

Confidence Intervals for Intercept (+/- sa)

Predicted SBP for DBP = 75

‘Inverse Calibration problem’

Predicted x for y = 120 Predicted CI (m=1) Predicted CI (m=5)

Notes for Part A:

Analysis: Basic Statistics (Question 1)

1. Open up Minitab (select from the Start Menu, Programs , under SAS)
2. When you open the program you will notice it is divided into two areas – the data area (lower screen) and the output area. Enter data from the above table in columns C1 and C2.
Warning: make sure you start entering data in row 1 NOT in the cell immediately below the column heading (C1 etc). This cell is reserved for column labels (you may put a label here like ‘18% diet’). Also make sure you don’t enter a column label in row 1. The whole column will then be formatted as text (C1-T) and cannot be used for analysis. If this happens delete the whole column and start again (clicking on ‘C1’ will highlight the whole column).
3. To get descriptive statistics click on Stat => Basic Statistics => Display Descriptive Statistics to get the basic statistics dialog box. Highlight C1 and C2 on the left and then click ‘Select’. Alternatively you can click in the Variable box and type C1 C2 . Click ‘Statistics’ then check ‘variance’. Then click OK and the output will appear in the output window. From the output data enter the values in the pro forma.

Confidence Intervals

1. The confidence intervals for the mean can be obtained as follows: Stat => Basic Statistics => 1-sample t. Select columns 1 and 2.
2. The confidence interval is of the form (low value, high value). To express ie interval in the form of ‘mean +/- deviation’ calculate the deviation as 0.5*(high – low)
Normally Distributed Data

1. A normality test can be carried out as follows:- Stat => basic Statistics => Normality test. Select the first column and accept other defaults. Repeat the test for the second column

Test for Equal Variance – F test

1. Use Stat => Basic Statistics => 2 variances. Check ‘samples in different columns’ . Select column C1 in the first and C2 in the second (note that for a F test the variable with the larger variance must be the first one selected). Use the defaults but under Graphs select boxplot (box-and-whisker plot)

Hypothesis Testing

We now want to test whether the 2 diets sample deviates significantly from each other We need to formulate the null hypothesis (Ho). In all statistical testing the probability is then calculated of the null hypothesis being true. If there is a low probability (usually

Comparison of Means

For question (d) we apply a t test, to compare two means: Stat = Basic Statistics => 2-sample t. Click on ‘Samples in different columns’ Click ‘First’ box and then double click on C1 in the variables column and similarly for C2 as ‘Second’. Check ‘assume equal variances’. The 95% confidence level given in the output is for the difference between the two means. The probability that this difference is actually zero (or not significantly different from zero) is given at the end of the output.

The Chi Squared Test (Case Study 2)

Enter the data into 2 columns in Minitab. Do not enter the totals – just the counts. You should have a 2×2 table.
Stat=> Tables => Chi squared test (Two-way variable in Worksheet)
Select the 2 columns of data, OK

How is the data analysed? The Chi2 test is calculated as :-

sum(observed cell count – expected cell count)2/expected cell count

but how do we calculate the expected cell count?

Expected cell count = (row total)*(column total)/grand total
The null hypothesis in this case is Ho: the proportions according to the row and column classifications are the same e.g if you were computing voting preferences of men and women the proportion voting liberal would be the same for each sex.

2 way Analysis of Variance (Case Study 3)

Two Way Analysis of Variance

In this study in that there are two variables or factors – farm and fertilizer blend. The data needs to be set out as follows:-

1. In one column enter all 24 crop yields (1130 …1215)
2. You also need two coding columns. Make one column the code for farm and give a code (1 – 4) for each farm.
3. Enter in a third column the code (1-6) for the fertilizer blend. Thus the first value (1130) will have 1130,1,1 in the three columns while the last value (1215) would have 1215,4,6 (i.e farm 4 and blend Z)
4. Carry out the two way ANOVA:- Stat=> ANOVA => 2-Way. In the response field enter the column for yields and enter the other two variables in the row and column boxes. Check the ‘display means’ boxes.

The output should be a typical ANOVA table (see the notes :Chemometrics Unit 1 for a full explanation of the ANOVA table). The key values are again the p values (p that Ho is true). Because there are two variables there are now null hypotheses for each variable (e.g no significant difference between farms i.e mean [yield] for farm 1 = mean [yield] for farm2 …). As with all our previous testing the p value is the probability that this is true and we reject Ho if p is low (

Minitab gives a diagram which can help in interpreting the results, showing each mean and confidence interval. Two results differ significantly if their CI’s don’t overlap. Note, however, that Minitab uses a pooled CI so they are all the same size. The diagram is thus just an indication but is still quite useful.

In 2 way ANOVA the possibility of variable interaction is also tested. An interaction means, for example, that blend differences depend on the farm. If we see blend differences with farm 1 but not farm 2 this would be an interaction effect. The diagram of means and CIs can be an indication of where differences occur.

Linear Regression (Case Study 4)

The analysis can be carried out as follows:-
Enter the data into 2 columns
Stat => Regression => Regression. Enter the Y column in the response box and the X column in the predictors box. Click on options and in the ‘prediction intervals for new responses’ enter ‘75’ (note if you have more than one X for prediction you can enter them in a new column and put the column in this box).

The output gives you the model (the regression equation), values of the intercept (constant) and gradient (predictor) with statistical information on these parameters. A full ANOVA table is also shown . For full interpretation of this output you should consult the ‘Calibration and Modelling of Data’ notes.

The t tests determine whether the gradient or the intercept are significantly non-zero.
(Again, check the p values)

The confidence intervals for the gradient and intercept can be determined as +/- sa*tn-2,.05 and similarly for sb . sa and sb are the standard deviations of gradient and intercept respectively (in Minitab, called the ‘standard error’ in the regression table). t is the critical t value for n-2 (n = number of pairs of data) degrees of freedom and 0.05 significance level. This value can be obtained from t tables.

At the end of the output is the predicted Y when X = 75, along with the confidence (CI) and prediction (PI) intervals. The full meaning of these terms is explained in the regression notes.

‘The Inverse Calibration Problem’

This question is typical of the sort of problem frequently encountered in analytical calibrations. We cannot proceed as in (d) because we now wish to determine X from a known Y (the ‘inverse’ problem). Least squares analysis assumes the error is in the Y determinations. However the error in X determined from Y can be estimated from the standard deviation of the interpolated X0:

SX0 = }0.5

A spreadsheet has been set up to carry out this calculation. It can be found in the Blackboard site for your Research Methods course

Carry out the determination of the prediction and confidence intervals as follows:-

(i) Copy the (X,Y) calibration values for Q4 from Minitab . Paste them in the invcalib spreadsheet (inverse calibration sheet) in the X and Y columns at the left.
(ii) enter ‘120’ in the yo cell (highlighted in green) and ‘1’ in the highlighted cell for ‘m’. This then gives the predicted value for x and the CI if 120 is a single measurement
(iii) change the ‘m’ values to 5 (i.e 120 is the average of 5 measurements. See what effect it has on the CI .

Get a 10 % discount on an order above \$ 100
Use the following coupon code :
ULTIMATE