Introduction
This project’s key purpose is to determine the student’s mastery of descriptive statistics. It involves analysis of a dataset of choice by use of the technology tool STATKEY. The study performs descriptive statistics and graphical analysis of one quantitative variable, one categorical variable, one quantitative and one categorical variable, two categorical variables, and two quantitative variables. The dataset chosen for this project is the “Nutrition Study” data file. This dataset contains three hundred and fifteen cases as well as seventeen variables. The variables include case ID, age, smoke, Quetelet, vitamin, calories, fat, fiber, alcohol, cholesterol, beta diet, retinol diet, beta plasma, retinol plasma, gender, vitamin use, and prior smoke. The table below provides a comprehensive analysis of the variables found in the dataset
Variable Name | Variable Description | Variable Type |
ID |
It numbers the cases; assigns each case a number |
Quantitative |
Age |
Represent ages of the subjects of nutrition study; The ages, in this case, are ranging from 19-83 |
Quantitative |
Smoke | It has values of yes and no | Categorical (level2) |
Quetelet | Represents Adolphe Quetelet’s indices | Quantitative |
Vitamin |
Represents vitamin intake by each subject of the Study. |
Quantitative |
Calories | Represents the subject’s level of calories | Quantitative |
Fat | Individual calories intake per diet | Quantitative |
Fiber | Represents the level of fiber intake | Quantitative |
Alcohol | Represent the subject’s level of alcohol intake | Quantitative |
Cholesterol |
The variable represents the level of cholesterol intake By the subject |
Quantitative |
Beta Diet | Type of diet | Quantitative |
Retinol Diet | Type of diet | Quantitative |
Beta Plasma | Type of diet | Quantitative |
Retinol Plasma | Type of diet | Quantitative |
Gender | Gender of the subject under study | Categorical (level 2) |
Vitamin Use | The frequency of vitamin usage | Categorical (level 3) |
Prior Smoke | Smokes prior to the period of study | Quantitative |
Delegate your assignment to our experts and they will do the rest.
Analysis
Analysis of One Quantitative Variable
Analysis of Calories
The following table shows the summary of descriptive statistics calculated using STATKEY.
Summary Statistics
Statistic | Value |
---|---|
Sample Size |
315 |
Mean |
1796.655 |
Standard Deviation |
680.347 |
Minimum |
445.2 |
Q 1 |
1338.000 |
Median |
1666.800 |
Q 3 |
2100.450 |
Maximum |
6662.2 |
The results in the table show that the calories variable in the dataset Nutrition Study has a mean of 1796.655. This implies that the average calories intake for all the subjects under study is 1796.655. The standard deviation measures the mean deviations. It represents how far the data points deviate from the mean ( Dini, 2016 ). The standard deviation value of 680.347 means that in average, the data points for the variable calories deviates from the mean by 680.347 units. The lowest calories intake is 445.2 while the highest calories intake is 6662.2; there is a huge range in the calories variable.
Dot Plot of Calories (Quantitative Variable)
Skewness
The dataset range from 445.2 to 6662.2, but most data is concentrated between 1000 and 3000. As seen in the graph, the calories data points are concentrated towards the left of the graph; this is evidence that the calories variable is skewed to the left. Also, the mean, 1796.655, is greater than the median, 1666.800. The median (middle value of the data set) is towards the left of the mean, implying that most cases are concentrated to the left as compared to the right of the mean. The fact that mean>median is, therefore, an indicator of skewness to the left.
Outliers
Outliers in a data score can be determined using z-scores or setting data limit using the minimum and maximum fence ( Dini, 2016 ). In this case, we will use the lower and upper fence method to identify outliers. The lower and upper fences are determined using quartiles. The formula for finding lower and upper fence is given below.
Upper fence = Q 3 + 1.5IQR
Lower Fence = Q 1 – 1.5 IQR
Where Q 1 is the first quartile, Q 3 is the third quartile, and IQR is the inter-quartile range.
Interquartile Range (IQR) = Third Quartile – First Quartile
In this case:
Inter-Quartile Range = Third Quartile – First Quartile
= 2100.450 – 1338.000
= 762.450
Upper Fence = Q 3 + 1.5IQR
=2100.450 + 1.5 (762.450)
= 3244.125
Lower Fence = Q 1 – 1.5 IQR
= 1338.000 – 1.5(762.450)
= 194.325
The calories data should range between 194.325 and 3244.125. A data point less than 194.325 or greater than 3244.125 is considered an outlier. The minimum value is 445.2, thus there is no value less than 194.325 (lower fence). However, there are values greater 3244.125 as shown in the table below.
Calories Outliers | |
ID | Calories Value |
62 | 6662.2 |
75 | 3457.2 |
77 | 3258.3 |
95 | 3711 |
152 | 4373.6 |
212 | 3328.4 |
269 | 3449.7 |
294 | 3511.1 |
The presence of outliers to the right of the third quartile can also be seen in the dot plot above. There are data points that stretch too far from the third quartile.
Analysis of One Categorical Variable
Analysis of Smoke
The smoke variable is a level two categorical variable. It has got two values, yes and no. The following data shows frequency table and relative frequency columns in one table.
Summary Statistics
Count | Proportion | |
---|---|---|
No | 272 | 0.863 |
Yes | 43 | 0.137 |
Total | 315 | 1.000 |
A large percentage of the subjects under the nutrition study are non-smokers. Those who responded YES under smoke variable were 43 out of 315 cases which represent 13.7% of the total cases. Non-smokers were the remaining 86.3% of the cases. Out of the ten cases studied, 9 are non-smokers. The summary statistics reveal that the study was conducted mostly on non-smokers. The figure below represents a graphical analysis of the number of smokers and non-smokers. The YES respondents are represented by the small bar while the NO respondents are represented by the bigger graphs. As can be seen in the bar graph, such graphical representations are the best means of representing categorical variables. They clearly represented the number of levels and their frequencies.
Categorical Variable (Smoke)
Analysis of One Relationship Between Two Categorical Variables
Relationship between Gender and Smoke
Summary Statistics
Smoke \ Gender | Female | Male | Total |
---|---|---|---|
No | 237 | 35 | 272 |
Yes | 36 | 7 | 43 |
Total | 273 | 42 | 315 |
Smoke \ Gender | Female | Male | Total |
---|---|---|---|
No | 0.752 | 0.111 | 0.863 |
Yes | 0.114 | 0.022 | 0.137 |
Total | 0.867 | 0.133 | 1 |
The tables above represent the summary statistic of the analysis of one relationship between two categorical variables (Gender and Smoke). The statistics reveal that out of the 273 female subjects in the study, 237 were non-smokers. On the other hand, 35 out of 42 of men were non-smokers. 7 out of 42 (16.7%) of male subjects are smokers, while 36 out of 273 (13.18%) of female subjects are smokers. This indicates the percentage of smokers in the male is larger as compared to that of females. 86.7% of the cases under study were females indicating that the study had a gender bias; it was mostly conducted on female non-smokers. There is some sort of association between gender and smoke. Percentage of male smokers is higher than that of female non-smokers. However, we cannot make conclusions from this study due to the biases in gender. It will only be fair to come to this conclusion if the number of male and female subjects studied were equal.
Analysis of One Relationship Between Categorical Variable and Quantitative Variable
Analysis of Relationship between Cholesterol and Vitamin Use
Summary Statistics
Statistics | Regular | Occasional | No | Overall |
---|---|---|---|---|
Sample Size |
122 |
82 |
111 |
315 |
Mean |
236.691 |
245.443 |
246.599 |
242.461 |
Standard Deviation |
151.098 |
99.628 |
131.330 |
131.992 |
Minimum |
59.2 |
84 |
37.7 |
37.7 |
Q 1 |
141.10 |
171.20 |
154.85 |
155.00 |
Median |
194.20 |
227.65 |
211.70 |
206.30 |
Q 3 |
283.30 |
308.80 |
333.40 |
308.85 |
Maximum |
900.7 |
574.2 |
718.8 |
900.7 |
The summary statistics table above indicates the relationship between the categorical variable (vitamin use) and quantitative variable (cholesterol). The mean cholesterol for regular vitamin users is 236.691; the mean cholesterol for occasional vitamin users is 245.4443, while that of non-vitamin users is 246.599. The overall mean of the cholesterol is 242.461. Regular vitamin users have the lowest cholesterol mean, followed by occasional vitamin users, then the non-vitamin users have the highest cholesterol levels. This shows that there is a relationship between the categorical variable (vitamin use) and quantitative variable (cholesterol).
An increase in the frequency of vitamin intake is associated with a decrease in cholesterol amounts. We can conclude that there is a negative correlation between vitamin use and cholesterol. This relationship can also be established by considering the overall mean of the cholesterol. The mean cholesterol for regular vitamin use is lower than the overall cholesterol average. Similarly, the mean cholesterol for no vitamin use is higher than the overall cholesterol mean.
Relationship between Cholesterol and Vitamin Use
The graph above indicates the relationship between the quantitative variable (cholesterol) and categorical variable vitamin use. All the data for no vitamin use, regular vitamin use, and occasional vitamin use are skewed to the left. The graph shows the possibility of a high number of outliers in regular vitamin use towards the right.
Analysis of Relationship Between Two Quantitative Variables
Analysis of Relationship between Beta Diet and Cholesterol Intake
A scatter plot is a graphical representation of the correlation between two variables. In the above scatter plot of beta diet against cholesterol, the plots are scattered everywhere hence there is no strong relationship between the two variables. There is no clear trend in the plots. The plots, however, are scattered towards the positive direction which shows there is a slight positive relationship between the two variables. The two variables lack a strong linear correlation; there is no association between the two variables. The linear relationship is weak that we cannot rely on it.
The weak positive relationship between the beta diet and cholesterol is also seen in the value of the correlation. The results of computation by the STATKEY program show that the correlation coefficient between beta diet and cholesterol is 0.116. This value is close to zero thus representing a weak relationship between the two variables. We cannot establish a strong relationship between the two variables; an increase in the cholesterol variable is associated with a slight increase in the beta diet variable. From this value of correlation coefficient and the trends in the scatter plot we can conclude that there is no relationship between the cholesterol variable and beta diet variable.
I expected this kind of relationship between bet diet and cholesterol. Picking cholesterol as the independent variable and beta diet as a dependent variable means a relationship will only exist if a change in the cholesterol is associated with a change in beta diet. However, the two do not have a relationship hence this kind of scatter plot was expected.
Conclusion
As a student, this project has helped me build a strong knowledge in the descriptive analysis using STATKEY. Statistical analysis using STATKEY is so easy as compared to excel since it is easier to obtain summary statistics and graphical analysis without performing complex calculations. For instance, in this project, we managed to analyze nutrition study file using STATKEY. The project shows how descriptive statistical analysis of both categorical and quantitative variables is possible in STATKEY program. For quantitative variables, it is possible to calculate descriptive statistics such as mean, standard deviation, median, quartiles, minimum, and maximum values. Also, proportions and relative frequencies can be determined in the case of categorical variables. In this project, I was able to analyze the relationship between two categorical variables, two quantitative variables, and a quantitative and categorical variable. The use of summary statics and graphs in the analysis using STATKEY provides a comprehensive statistical overview of each variable and their relationship with other variables.
Nutrition study analysis is important to project in our daily lives. The study shows how different lifestyles affect our lives. It is therefore important to show the relationship between the nutrition variables and lifestyle diseases to make such studies helpful. Therefore, I think the variable “lifestyle diseases” should be gathered next time to increase the depth of this analysis. Gathering this variable will make it possible to establish the relationship between various nutrition lifestyle and lifestyle diseases. Obtaining scatter plot and correlation coefficients for such relationship will show if a certain nutrition lifestyle help increases or decrease the risk of lifestyle diseases. Also, the above dataset is gender biased; a large portion of subjects under study are females. Nutrition study is a sensitive topic which cut across all genders. It is therefore important to gather more information on the gender variable such that it has an equal number of females and males.
References
Dini, L. (2016). EDTC 810: Statistics for Educational Research Dr. Glazer May 6, 2016.
Tintle, N., Chance, B. L., Cobb, G. W., Rossman, A. J., Roy, S., Swanson, T., & VanderStoep, J. (2015). Introduction to Statistical Investigations: High School Binding . John Wiley.