How to Analyze Data Using Descriptive Statistics Free Essay Example

Introduction

This project’s key purpose is to determine the student’s mastery of descriptive statistics. It involves analysis of a dataset of choice by use of the technology tool STATKEY. The study performs descriptive statistics and graphical analysis of one quantitative variable, one categorical variable, one quantitative and one categorical variable, two categorical variables, and two quantitative variables. The dataset chosen for this project is the “Nutrition Study” data file. This dataset contains three hundred and fifteen cases as well as seventeen variables. The variables include case ID, age, smoke, Quetelet, vitamin, calories, fat, fiber, alcohol, cholesterol, beta diet, retinol diet, beta plasma, retinol plasma, gender, vitamin use, and prior smoke. The table below provides a comprehensive analysis of the variables found in the dataset

Variable Name	Variable Description	Variable Type
ID	It numbers the cases; assigns each case a number	Quantitative
Age	Represent ages of the subjects of nutrition study; The ages, in this case, are ranging from 19-83	Quantitative
Smoke	It has values of yes and no	Categorical (level2)
Quetelet	Represents Adolphe Quetelet’s indices	Quantitative
Vitamin	Represents vitamin intake by each subject of the Study.	Quantitative
Calories	Represents the subject’s level of calories	Quantitative
Fat	Individual calories intake per diet	Quantitative
Fiber	Represents the level of fiber intake	Quantitative
Alcohol	Represent the subject’s level of alcohol intake	Quantitative
Cholesterol	The variable represents the level of cholesterol intake By the subject	Quantitative
Beta Diet	Type of diet	Quantitative
Retinol Diet	Type of diet	Quantitative
Beta Plasma	Type of diet	Quantitative
Retinol Plasma	Type of diet	Quantitative
Gender	Gender of the subject under study	Categorical (level 2)
Vitamin Use	The frequency of vitamin usage	Categorical (level 3)
Prior Smoke	Smokes prior to the period of study	Quantitative

It’s time to jumpstart your paper!

Delegate your assignment to our experts and they will do the rest.

Get custom essay

Analysis

Analysis of One Quantitative Variable

Analysis of Calories

The following table shows the summary of descriptive statistics calculated using STATKEY.

Summary Statistics

Statistic	Value
Sample Size	315
Mean	1796.655
Standard Deviation	680.347
Minimum	445.2
Q 1	1338.000
Median	1666.800
Q 3	2100.450
Maximum	6662.2

The results in the table show that the calories variable in the dataset Nutrition Study has a mean of 1796.655. This implies that the average calories intake for all the subjects under study is 1796.655. The standard deviation measures the mean deviations. It represents how far the data points deviate from the mean ( Dini, 2016 ). The standard deviation value of 680.347 means that in average, the data points for the variable calories deviates from the mean by 680.347 units. The lowest calories intake is 445.2 while the highest calories intake is 6662.2; there is a huge range in the calories variable.

Dot Plot of Calories (Quantitative Variable)

Skewness

The dataset range from 445.2 to 6662.2, but most data is concentrated between 1000 and 3000. As seen in the graph, the calories data points are concentrated towards the left of the graph; this is evidence that the calories variable is skewed to the left. Also, the mean, 1796.655, is greater than the median, 1666.800. The median (middle value of the data set) is towards the left of the mean, implying that most cases are concentrated to the left as compared to the right of the mean. The fact that mean>median is, therefore, an indicator of skewness to the left.

Outliers

Outliers in a data score can be determined using z-scores or setting data limit using the minimum and maximum fence ( Dini, 2016 ). In this case, we will use the lower and upper fence method to identify outliers. The lower and upper fences are determined using quartiles. The formula for finding lower and upper fence is given below.

Upper fence = Q 3 + 1.5IQR

Lower Fence = Q 1 – 1.5 IQR

Where Q 1 is the first quartile, Q 3 is the third quartile, and IQR is the inter-quartile range.

Interquartile Range (IQR) = Third Quartile – First Quartile

In this case:

Inter-Quartile Range = Third Quartile – First Quartile

= 2100.450 – 1338.000

= 762.450

Upper Fence = Q 3 + 1.5IQR

=2100.450 + 1.5 (762.450)

= 3244.125

Lower Fence = Q 1 – 1.5 IQR

= 1338.000 – 1.5(762.450)

= 194.325

The calories data should range between 194.325 and 3244.125. A data point less than 194.325 or greater than 3244.125 is considered an outlier. The minimum value is 445.2, thus there is no value less than 194.325 (lower fence). However, there are values greater 3244.125 as shown in the table below.

Calories Outliers
ID	Calories Value
62	6662.2
75	3457.2
77	3258.3
95	3711
152	4373.6
212	3328.4
269	3449.7
294	3511.1

The presence of outliers to the right of the third quartile can also be seen in the dot plot above. There are data points that stretch too far from the third quartile.

Analysis of One Categorical Variable

Analysis of Smoke

The smoke variable is a level two categorical variable. It has got two values, yes and no. The following data shows frequency table and relative frequency columns in one table.

Summary Statistics

	Count	Proportion
No	272	0.863
Yes	43	0.137
Total	315	1.000

A large percentage of the subjects under the nutrition study are non-smokers. Those who responded YES under smoke variable were 43 out of 315 cases which represent 13.7% of the total cases. Non-smokers were the remaining 86.3% of the cases. Out of the ten cases studied, 9 are non-smokers. The summary statistics reveal that the study was conducted mostly on non-smokers. The figure below represents a graphical analysis of the number of smokers and non-smokers. The YES respondents are represented by the small bar while the NO respondents are represented by the bigger graphs. As can be seen in the bar graph, such graphical representations are the best means of representing categorical variables. They clearly represented the number of levels and their frequencies.

Categorical Variable (Smoke)

Analysis of One Relationship Between Two Categorical Variables

Relationship between Gender and Smoke

Summary Statistics

Smoke \ Gender	Female	Male	Total
No	237	35	272
Yes	36	7	43
Total	273	42	315

Smoke \ Gender	Female	Male	Total
No	0.752	0.111	0.863
Yes	0.114	0.022	0.137
Total	0.867	0.133	1

The tables above represent the summary statistic of the analysis of one relationship between two categorical variables (Gender and Smoke). The statistics reveal that out of the 273 female subjects in the study, 237 were non-smokers. On the other hand, 35 out of 42 of men were non-smokers. 7 out of 42 (16.7%) of male subjects are smokers, while 36 out of 273 (13.18%) of female subjects are smokers. This indicates the percentage of smokers in the male is larger as compared to that of females. 86.7% of the cases under study were females indicating that the study had a gender bias; it was mostly conducted on female non-smokers. There is some sort of association between gender and smoke. Percentage of male smokers is higher than that of female non-smokers. However, we cannot make conclusions from this study due to the biases in gender. It will only be fair to come to this conclusion if the number of male and female subjects studied were equal.

Analysis of One Relationship Between Categorical Variable and Quantitative Variable

Analysis of Relationship between Cholesterol and Vitamin Use

Summary Statistics

Statistics	Regular	Occasional	No	Overall
Sample Size	122	82	111	315
Mean	236.691	245.443	246.599	242.461
Standard Deviation	151.098	99.628	131.330	131.992
Minimum	59.2	84	37.7	37.7
Q 1	141.10	171.20	154.85	155.00
Median	194.20	227.65	211.70	206.30
Q 3	283.30	308.80	333.40	308.85
Maximum	900.7	574.2	718.8	900.7

The summary statistics table above indicates the relationship between the categorical variable (vitamin use) and quantitative variable (cholesterol). The mean cholesterol for regular vitamin users is 236.691; the mean cholesterol for occasional vitamin users is 245.4443, while that of non-vitamin users is 246.599. The overall mean of the cholesterol is 242.461. Regular vitamin users have the lowest cholesterol mean, followed by occasional vitamin users, then the non-vitamin users have the highest cholesterol levels. This shows that there is a relationship between the categorical variable (vitamin use) and quantitative variable (cholesterol).

An increase in the frequency of vitamin intake is associated with a decrease in cholesterol amounts. We can conclude that there is a negative correlation between vitamin use and cholesterol. This relationship can also be established by considering the overall mean of the cholesterol. The mean cholesterol for regular vitamin use is lower than the overall cholesterol average. Similarly, the mean cholesterol for no vitamin use is higher than the overall cholesterol mean.

Relationship between Cholesterol and Vitamin Use

The graph above indicates the relationship between the quantitative variable (cholesterol) and categorical variable vitamin use. All the data for no vitamin use, regular vitamin use, and occasional vitamin use are skewed to the left. The graph shows the possibility of a high number of outliers in regular vitamin use towards the right.

Analysis of Relationship Between Two Quantitative Variables

Analysis of Relationship between Beta Diet and Cholesterol Intake

A scatter plot is a graphical representation of the correlation between two variables. In the above scatter plot of beta diet against cholesterol, the plots are scattered everywhere hence there is no strong relationship between the two variables. There is no clear trend in the plots. The plots, however, are scattered towards the positive direction which shows there is a slight positive relationship between the two variables. The two variables lack a strong linear correlation; there is no association between the two variables. The linear relationship is weak that we cannot rely on it.

The weak positive relationship between the beta diet and cholesterol is also seen in the value of the correlation. The results of computation by the STATKEY program show that the correlation coefficient between beta diet and cholesterol is 0.116. This value is close to zero thus representing a weak relationship between the two variables. We cannot establish a strong relationship between the two variables; an increase in the cholesterol variable is associated with a slight increase in the beta diet variable. From this value of correlation coefficient and the trends in the scatter plot we can conclude that there is no relationship between the cholesterol variable and beta diet variable.

I expected this kind of relationship between bet diet and cholesterol. Picking cholesterol as the independent variable and beta diet as a dependent variable means a relationship will only exist if a change in the cholesterol is associated with a change in beta diet. However, the two do not have a relationship hence this kind of scatter plot was expected.

Conclusion

As a student, this project has helped me build a strong knowledge in the descriptive analysis using STATKEY. Statistical analysis using STATKEY is so easy as compared to excel since it is easier to obtain summary statistics and graphical analysis without performing complex calculations. For instance, in this project, we managed to analyze nutrition study file using STATKEY. The project shows how descriptive statistical analysis of both categorical and quantitative variables is possible in STATKEY program. For quantitative variables, it is possible to calculate descriptive statistics such as mean, standard deviation, median, quartiles, minimum, and maximum values. Also, proportions and relative frequencies can be determined in the case of categorical variables. In this project, I was able to analyze the relationship between two categorical variables, two quantitative variables, and a quantitative and categorical variable. The use of summary statics and graphs in the analysis using STATKEY provides a comprehensive statistical overview of each variable and their relationship with other variables.

Nutrition study analysis is important to project in our daily lives. The study shows how different lifestyles affect our lives. It is therefore important to show the relationship between the nutrition variables and lifestyle diseases to make such studies helpful. Therefore, I think the variable “lifestyle diseases” should be gathered next time to increase the depth of this analysis. Gathering this variable will make it possible to establish the relationship between various nutrition lifestyle and lifestyle diseases. Obtaining scatter plot and correlation coefficients for such relationship will show if a certain nutrition lifestyle help increases or decrease the risk of lifestyle diseases. Also, the above dataset is gender biased; a large portion of subjects under study are females. Nutrition study is a sensitive topic which cut across all genders. It is therefore important to gather more information on the gender variable such that it has an equal number of females and males.

References

Dini, L. (2016). EDTC 810: Statistics for Educational Research Dr. Glazer May 6, 2016.

Tintle, N., Chance, B. L., Cobb, G. W., Rossman, A. J., Roy, S., Swanson, T., & VanderStoep, J. (2015). Introduction to Statistical Investigations: High School Binding . John Wiley.

How to Analyze Data Using Descriptive Statistics

Related essays

Scatter Diagram: How to Create a Scatter Plot in Excel

Calculating and Reporting Healthcare Statistics

Survival Rate for COVID-19 Patients: A Comparative Analysis

5 Types of Regression Models You Should Know

The Motion Picture Industry - A Comprehensive Overview

Spearman's Rank Correlation Coefficient (Spearman's Rho)

Running out of time?