To conceptualize certain statistical concepts we procured publicly available and opened sourced data sets from The U.S. Department of Health and Human Services which stated the leading causes of death in the United States from 1999 to 2015. The observations recorded had fifty two states categorized on a yearly basis via each cause of death. The total number of observations were not included in this paper however, our main focus was on the total number of deaths primarily caused, either from Kidney diseases or Cancer.
From these two data sets we then took sample observations to identify and test all our underlying concepts on statistical inference, population mean, sample mean, the standard error, and the various testing procedures defined here in. The appendix tables are a representation of the sample data set observed in question while for both the diseases, we procured fifty one observations to run the desired metrics.
Delegate your assignment to our experts and they will do the rest.
Estimates of Sample & Population Means
To fully identify and estimate by how much was the sample mean closer to the given population mean we took sample data from the Kidney disease spreadsheet and tabulated the results in the appendix A, Table A1. The observations noted were the total deaths caused by the Kidney disease, in all underlying states, for the year 2005. We also captured the same qualitative aspects for the cancer disease however, to eliminate any sort of selection bias, we went a year forward and took the total number of deaths for 2006.
For both these sample sets we then calculated the sample mean by the following notation:
The sample mean for kidney diseases and cancer were recorded as 860.80 and 10,978.20 respectively, while their population mean were 874.20, for kidney diseases and 11,532.00 for cancer. The notation used for population mean to calculate the preceding values is as follows:
The estimates taken for both sets of observations are reasonably accurate as the sample mean, for the values taken in table A1 in the Appendix A, are 98.46 percent closer to the overall population mean for kidney diseases, while for cancer the sample mean for the data set selected in table A2 are 95 percent accurate when compared to the total number of observations for deaths caused by cancer.
Report on Standard Error
To normalize a sample data set for any given number of observations researchers calculate the standard error on their measured readings. This is aptly done by identifying the standard deviation of the sample data on either the mean or the median. It is not strictly necessary that the statistics in question be either a mean or median. Any set of observations, which can be accounted via standard deviation, can be used to identify a standard error.
The standard error of any data set also is inversely proportional to the size of the sample procured from the total population. The larger the selected number of observations from the total population the closer the standard error will be to the actual value for the entire data set.
Calculating standard error can be done by dividing the standard deviation of a set of observations by the root of all the observations in question. For our standard error on both, tables A1 and A2, we plugged in the following equation:
Where σ is the standard deviation calculated from the below notation:
And n = the total number of observations in the collected sample set.
Based on these expressions and via the available observations collected in tables A1 and A2 both our data sets had a recorded standard error of ± 112.32 and ± 1562.51. Both these values are minimal enough to negate the outliners and normalize the data sets for mapping any given policy by the U.S. Department of Health & Human Services.
Determining the Confidence Limits
Confidence limits are a specified range of values that tell us the domain of our sample set and where a particular value may be in a normally distributed data. To calculate the confidence limits we would need to obtain the confidence coefficient which is denoted by the expression (Zα/2) and where Zα is the percentage confidence level. The result of the confidence coefficient is then mapped on a Z table to get the critical value known as (t) which is then multiplied with the standard error of the data set.
To finally get the correct intervals, limits procured are added to the mean of the sample population and also subtracted from the latter as well, this gives us upper and lower bounds. The following expression correctly describes the confidence limits:
Where t = value obtained (X+Y) by mapping (Zα/2) on a Z distribution chart. If we were to apply this same notation on our sample data set, with a 98 percent of confidence level, the kidney disease would have coefficient limits of 860.80 ± 261.70 while cancer would be at 10,978.20 ± 3640.64.
The Use of Z Scores
In a normal distribution Z scores are used to calculate the probability of a sample data set and how far it is from the sample mean. Z scores are expressed in terms of standard deviations and can be notated using the following:
Now if we are to apply the same Z score parameters on our tables A1 and A2 we would know exactly by how much each value is deviating from the sample mean. To take an example of such an instance we can isolate the total deaths caused in Wisconsin by kidney diseases, which are 930 in our sample data set, and using the preceding formula we can find that the number procured is 0.086 standard deviations away from the sample mean.
Similarly, we can also calculate a Z score for North Carolina cancer deaths which amounts to 0.57 standard deviations away from the sample mean. The exercise of providing Z scores in a normally distributed data set helps us identify outliners in a set of observations and therefore enables the governing body to make adjustments to policy as well.
Different Methods of Testing
Accuracy is very important in any statistical computation therefore a researcher should be able to aptitude their findings by running them through either a one tailed or two tailed hypothesis test. Since in our tables A1 and A2, we have more than forty given observations it is critical that multiple different testing methods are used to get the most accurate measure of the calculated mean. The following can be used as a general hypothesis test table to identify and compare the mean calculated of the sample data set:
Set |
Null Hypothesis |
Alternative Hypothesis |
Number of Tails |
1 |
X = M |
X ≠ M |
2 |
2 |
X ≥ M |
X < M |
1 |
3 |
X ≤ M |
X > M |
1 |
Where X is the mean of the entire sample data set and M is a single recorded observation.
F-Tests
F-tests are conducted when a set of observations may include a very small sample size and therefore may induce a bias in any calculations done on the data. A statistician would ideally conduct an f-test by obtaining different sample data sets from the same overall population and comparing the values of their respective mean and standard deviation to verify his or her findings. For our sample data we procured fifty two values which are approximately 6 percent of the overall population and cover each demographic for the entire region in question, therefore running an f-test was not entirely necessary.
Application of Z-Tests
In a z-test, a researcher would compare the mean values of a set of observations to either approve or disapprove a null hypothesis. Usually, for the sample data set provided the variances are similar in both group of observations which is why the mean is taken as the next best reference point to refute or enforce (Ho).
In health care research z-tests can be used to identify demographics based on age but since that metric is not available in sample data set we are not liable in using this particular testing method for tables A1 and A2.
Application of T-tests
A t-test is conducted when the mean values of two difference data sets are similar to each other and the statistician needs to differentiate using a T distribution and the corresponding P value (Probability) for both groups of observations. In most cases the variances for a t-test are unknown and the P value is used to contest a null hypothesis (Ho).
In our Tables A1 and A2 we can conduct a T test by summing all the deaths caused per year and then tabulating the result on a spreadsheet. This would give us overall deaths and using the latter along with the probability procured by a T test, we can effectively mark which among the two diseases in question have a higher progression rate.
One-Tailed Test
One-tailed tests are observed when a researcher is trying to prove or refute a null hypothesis (Ho) using a conditional statement which can be either true or false. For example in our table A1, we can draw up a null hypothesis stating that the all values for kidney diseases, for the state of Iowa are less than the sample mean calculate for the entire population.
This type of a claim can be easily put to test by doing where x-bar is the sample mean and X represents the observed value. If all values fulfil this notation then (Ho) stands to be correct.
Two-Tailed Test
A two tailed test comprises of a range of conditions that needs to be fulfilled for a statistician to claim or refute (Ho). For instance, age demographics can be used to identify whether cancer or kidney diseases are more prevalent in a specific age range and certain healthcare policy could be drawn up using these particular findings.
A two-tailed test comprises on both sides of a normally distributed data set and therefore may have a lower error percentage than that of a one tailed test. Also, since our tables A1 and A2 do not have any aged reference metrics, we have not included this in our paper as well.
Type 1 & Type 2 Error
The notion of ‘false positive’ relates to a type 1 error and normally it occurs when a statistician may have implied something which does not fundamentally exist in the entire population data set. Conversely a ‘false negative’ or a type 2 error is when a researcher has implied something to be absent from the data set when it duly exists in the overall recorded observations. Our healthcare data in tables A1 and A2 are devoid of any type 1 or type 2 error since they do not make predictive calculations on any account throughout the paper.
References
Larry Wasserman., 2005. All of Statistics: A Concise Course in Statistical Inference. Springer Text in Statistics, Pages 119-148
Douglas C. Altman., 2015. Practical Statistics for Medical Research. Chapman & Hall Publications [Hardback]
US Department of Health & Human Services., 2015. NCHS – Leading Causes of Death: United States (Dataset 1999-2015). Centre for Disease Control and Prevention & Organization, Link
Peter Westfall., Kevin S. S. Henning., 2013. Understanding Advance Statistical Methods (1st Edition). Chapman & Hall/CRC Texts in Statistical Science, Pages 37 -77
Larry Hatcher., 2013. Advanced Statistics in Research: Reading, Understanding, and Writing Up Data Analysis Results. Shadow Finch Media LLC.
Shelby J. Haberman., 1996. Advanced Statistics. Springer Series in Statistics, Pages 1-86
Jeremy J. Foster., Emma Barkus., Christian Yavorsky., 2006. Understanding and Using Advanced Statistics: A Practical Guide for Students (First Edition). SAGE Publications Limited, Pages 70-85 (Factor Analysis)