30 Nov 2022

122

R-Studio: Data Recovery Software

Format: APA

Academic level: University

Paper type: Research Paper

Words: 854

Pages: 5

Downloads: 0

Selection of Dataset 

The study used the database of the default of credit card clients. The study aimed to examine the customer default payment in Taiwan. The continuous variables that were selected for this study are the Amount of the given credit (NT dollar) denoted as “LIMIT_BAL”, age of customer and amount of bill statement in September 2005 denoted as “BILL_ATM1”. The categorical variables include the education level of the customer (1 = graduate school; 2 = university; 3 = high school; 4 = others), Marital status (1 = married; 2 = single; 3 = others) and gender (1 = male; 2 = female). The dichotomous variable will be the default payment (Yes = 1, No = 0). 

2. Descriptive Statistics 

Amount of the given credit (NT dollar) 

The chart above histogram represents distribution of the amount of the given credit. The histogram shows that the data is skewed to the right. 

It’s time to jumpstart your paper!

Delegate your assignment to our experts and they will do the rest.

Get custom essay

Age 

The graph above shows the age of the credit card clients. The graph indicates the data is not normal and its skewed to the right since most of the client age is less compared to the mean. 

Amount of Bill statement in September, 2005 

The bar plot above indicates that the number of females is grater compared to the of male. The male gender I represented by 1 while female 2. 

The bar graph above if for the education level of the credit rad clients. It shows that most of the client were at the university and graduate levels. While the least number had high school and others. 

The pie chats showed the marital status of the credit card clients. It shows that most of the client are “others”, followed by singles. The least clients are married. 

The pie charts showed the default payment among credit card clients. It shows that most of the customers defaulted on payment while a small fraction did not.  

Summary M easures   

The study will use the measures f centrality and dispersion. The measures f centrality consists of the mean, median and mode. Besides, the measures of dispersion comprise the range and standard deviation. The following are the descriptive statistics for the data.  

default.payment.next.month LIMIT_BAL SEX EDUCATION 

Min. :0.0000 Min. : 10000 Min. :1.000 Min. :0.000 

1st Qu.:0.0000 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000 

Median :0.0000 Median : 140000 Median :2.000 Median :2.000 

Mean :0.2212 Mean : 167484 Mean :1.604 Mean :1.853 

3rd Qu.:0.0000 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000 

Max. :1.0000 Max. :1000000 Max. :2.000 Max. :6.000 

MARRIAGE AGE BILL_AMT1 

Min. :0.000 Min. :21.00 Min. :-165580 

1st Qu.:1.000 1st Qu.:28.00 1st Qu.: 3559 

Median :2.000 Median :34.00 Median : 22382 

Mean :1.552 Mean :35.49 Mean : 51223 

3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 67091 

Max. :3.000 Max. :79.00 Max. : 964511 

The means age of the client is 35.49 with a maximum and minimum ages being 79 and 21 respectively. The mean amount of credit given is 167484 with a maximum and minimum ages being 1000000 and 10000 respectively. Further, 51223 is the average amount of bill statements in September 2005. 

3 . Inferential Statistics 

T-test 

> t.test(AGE,LIMIT_BAL, paired = TRUE, alternative = "two.sided") 

Paired t-test 

data: AGE and LIMIT_BAL 

t = -223.54, df = 29999, p-value < 2.2e-16 

alternative hypothesis: true difference in means is not equal to 0 

95 percent confidence interval: 

-168917.1 -165980.6 

sample estimates: 

mean of the differences 

-167448.8 

The test statistic, t = -223.54, p < 2.2e-16 . Since the p-value is smaller compared to alpha = 0.05, we reject the null hypothesis. Therefore, we deduce that there is a significant difference linking the mean of age and Amount of credit given. 

ANOVA 

We will conduct ANOVA for Age, balance limit and the amount of bill statement in September, 2005. 

> summary(da) 

Df Sum Sq Mean Sq F value Pr(>F) 

LIMIT_BAL 1 53381 53381 641.79 < 2e-16 *** 

BILL_AMT1 1 619 619 7.44 0.00638 ** 

Residuals 29997 2495008 83 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

The F (1,29997) = 641.79 and F (1,29997) = 7.44, p < 0.01. Given that the p-value is smaller compared to alpha = 0.01, we reject the null hypothesis. Hence, we deduce that there is a significant difference between age, balance limit and the amount of bill statement in September 2005. 

Chi-squares test   

We conducted a Chi-squares test to establish the association between education level and marital status. 

Pearson's Chi-squared test 

data: My$EDUCATION and My$MARRIAGE 

X-squared = 1187.6, df = 18, p-value < 2.2e-16 

The  = 1187.6, p-value < 2.2e-16. There is a significant association linking education level and marital status since p-value is smaller compared to alpha = 0.05 

Scatter plot 

Correlation test 

Pearson's product-moment correlation 

data: My$AGE and My$BILL_AMT1 

t = 9.7559, df = 29998, p-value < 2.2e-16 

alternative hypothesis: true correlation is not equal to 0 

95 percent confidence interval: 

0.04495120 0.06751151 

sample estimates: 

cor 

0.05623853 

The correlation coefficient is 0.05623853, p-value < 2.2e-16, which indicates that there is a significant correlation between age and the amount of credit given. Thus, we deduce that the correlation is significant. 

Regression A nalysis   

Simple regression 

Call: 

lm(formula = AGE ~ LIMIT_BAL) 

Residuals: 

Min 1Q Median 3Q Max 

-15.904 -7.409 -1.904 5.769 40.713 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 3.376e+01 8.599e-02 392.65 <2e-16 *** 

LIMIT_BAL 1.028e-05 4.059e-07 25.33 <2e-16 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 9.121 on 29998 degrees of freedom 

Multiple R-squared: 0.02094, Adjusted R-squared: 0.02091 

F-statistic: 641.6 on 1 and 29998 DF, p-value: < 2.2e-16 

Multiple regression 

Call: 

lm(formula = AGE ~ LIMIT_BAL + BILL_AMT1) 

Residuals: 

Min 1Q Median 3Q Max 

-16.724 -7.386 -1.908 5.774 40.033 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 3.371e+01 8.784e-02 383.825 < 2e-16 *** 

LIMIT_BAL 9.951e-06 4.234e-07 23.501 < 2e-16 *** 

BILL_AMT1 2.035e-06 7.461e-07 2.728 0.00638 ** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 9.12 on 29997 degrees of freedom 

Multiple R-squared: 0.02118, Adjusted R-squared: 0.02112 

F-statistic: 324.6 on 2 and 29997 DF, p-value: < 2.2e-16 

Regression with Dummy variables 

Call: 

lm(formula = AGE ~ EDUCATION + SEX) 

Residuals: 

Min 1Q Median 3Q Max 

-24.079 -7.033 -1.090 5.910 44.207 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 34.49551 0.21528 160.23 <2e-16 *** 

EDUCATION 2.05725 0.06601 31.17 <2e-16 *** 

SEX -1.75987 0.10666 -16.50 <2e-16 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 9.035 on 29997 degrees of freedom 

Multiple R-squared: 0.03937, Adjusted R-squared: 0.0393 

F-statistic: 614.6 on 2 and 29997 DF, p-value: < 2.2e-16 

Regression with interaction 

Call: 

lm(formula = AGE ~ EDUCATION * LIMIT_BAL + SEX) 

Residuals: 

Min 1Q Median 3Q Max 

-28.416 -6.750 -1.623 5.513 40.536 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 2.907e+01 2.855e-01 101.84 <2e-16 *** 

EDUCATION 3.744e+00 1.090e-01 34.35 <2e-16 *** 

LIMIT_BAL 2.618e-05 9.846e-07 26.59 <2e-16 *** 

SEX -1.800e+00 1.044e-01 -17.23 <2e-16 *** 

EDUCATION:LIMIT_BAL -7.024e-06 5.127e-07 -13.70 <2e-16 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.836 on 29995 degrees of freedom 

Multiple R-squared: 0.08136, Adjusted R-squared: 0.08124 

F-statistic: 664.1 on 4 and 29995 DF, p-value: < 2.2e-16 

Logistic regression when output is dichotomous 

Call: 

glm(formula = default.payment.next.month ~ AGE + LIMIT_BAL + 

SEX + EDUCATION + BILL_AMT1 + MARRIAGE, family = "binomial", 

data = My) 

Deviance Residuals: 

Min 1Q Median 3Q Max 

-1.0157 -0.7683 -0.6471 -0.4215 2.6574 

Coefficients: 

Estimate Std. Error z value Pr(>|z|) 

(Intercept) -2.777e-01 1.111e-01 -2.498 0.012480 * 

AGE 5.712e-03 1.679e-03 3.402 0.000668 *** 

LIMIT_BAL -3.755e-06 1.415e-07 -26.545 < 2e-16 *** 

SEX -1.602e-01 2.888e-02 -5.546 2.92e-08 *** 

EDUCATION -7.063e-02 1.931e-02 -3.658 0.000254 *** 

BILL_AMT1 1.177e-06 2.221e-07 5.297 1.18e-07 *** 

MARRIAGE -1.843e-01 2.993e-02 -6.158 7.36e-10 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1) 

Null deviance: 31705 on 29999 degrees of freedom 

Residual deviance: 30792 on 29993 degrees of freedom 

AIC: 30806 

Number of Fisher Scoring iterations: 4 

Applying F-test or AIC to find the best model 

We will use AIC to find the best model. The best model is identified by a low AIC value. A stepwise analysis will be conducted by removing variables from the model and comparing the AIC value ( Schabenberger & Gotway, 2017 ). We start with model 1 shown above. For model 2, we remove one variable increasing AIC. Model 3 rises the AIC as we remove the second variable. This process continues as the AIC keeps on increasing. Therefore, model 1 is the best model for the study. 

References 

Schabenberger, O., & Gotway, C. A. (2017).  Statistical methods for spatial data analysis . CRC press. 

R-codes 

> My=read.csv(file.choose(), header = T) 

> attach(My) 

> names(My) 

> hist(LIMIT_BAL) 

> hist(AGE) 

> hist(AGE,col = "red") 

> hist(BILL_AMT1) 

> ggplot(My) + geom_bar(aes(x = SEX)) 

> ggplot(My) + geom_bar(aes(x = EDUCATION)) 

>L=c("married","single","others") 

> A=table(My$MARRIAGE) 

> pie(A, labels = L) 

>D=c("YES","NO") 

> G=table(My$default.payment.next.month.1) 

> pie(G, labels = D) 

> summary(My) 

> t.test(AGE,LIMIT_BAL, paired = TRUE, alternative = "two.sided") 

> da=aov(AGE~LIMIT_BAL+BILL_AMT1,data=My) 

> summary(da) 

> chisq.test(My$EDUCATION,My$MARRIAGE) 

> plot(AGE,BILL_AMT1) 

> cor.test(My$AGE,My$BILL_AMT1) 

> fit=lm(AGE~LIMIT_BAL) 

> summary(fit) 

> fit1=lm(AGE~LIMIT_BAL+BILL_AMT1) 

> summary(fit1) 

> fit2=lm(AGE~EDUCATION+SEX) 

> summary(fit2) 

> fit3=lm(AGE~EDUCATION*LIMIT_BAL+SEX) 

> summary(fit3) 

> fit4=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1+MARRIAGE,data = My,family = "binomial") 

> summary(fit4) 

> fit5=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1+MARRIAGE,data = My,family = "binomial") 

> summary(fit5) 

> fit51=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1,data = My,family = "binomial") 

> summary(fit51) 

> fit61=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION,data = My,family = "binomial") 

> summary(fit61) 

> fit611=glm(default.payment.next.month~AGE+LIMIT_BAL,,data = My,family = "binomial") 

> summary(fit611) 

Illustration
Cite this page

Select style:

Reference

StudyBounty. (2023, September 14). R-Studio: Data Recovery Software.
https://studybounty.com/r-studio-data-recovery-software-research-paper

illustration

Related essays

We post free essay examples for college on a regular basis. Stay in the know!

17 Sep 2023
Statistics

Scatter Diagram: How to Create a Scatter Plot in Excel

Trends in statistical data are interpreted using scatter diagrams. A scatter diagram presents each data point in two coordinates. The first point of data representation is done in correlation to the x-axis while the...

Words: 317

Pages: 2

Views: 186

17 Sep 2023
Statistics

Calculating and Reporting Healthcare Statistics

10\. The denominator is usually calculated using the formula: No. of available beds x No. of days 50 bed x 1 day =50 11\. Percentage Occupancy is calculated as: = =86.0% 12\. Percentage Occupancy is calculated...

Words: 133

Pages: 1

Views: 150

17 Sep 2023
Statistics

Survival Rate for COVID-19 Patients: A Comparative Analysis

Null: There is no difference in the survival rate of COVID-19 patients in tropical countries compared to temperate countries. Alternative: There is a difference in the survival rate of COVID-19 patients in tropical...

Words: 255

Pages: 1

Views: 250

17 Sep 2023
Statistics

5 Types of Regression Models You Should Know

Theobald et al. (2019) explore the appropriateness of various types of regression models. Despite the importance of regression in testing hypotheses, the authors were concerned that linear regression is used without...

Words: 543

Pages: 2

Views: 175

17 Sep 2023
Statistics

The Motion Picture Industry - A Comprehensive Overview

The motion picture industry is among some of the best performing industries in the country. Having over fifty major films produced each year with different performances, it is necessary to determine the success of a...

Words: 464

Pages: 2

Views: 86

17 Sep 2023
Statistics

Spearman's Rank Correlation Coefficient (Spearman's Rho)

The Spearman’s rank coefficient, sometimes called Spearman’s rho is widely used in statistics. It is a nonparametric concept used to measure statistical dependence between two variables. It employs the use of a...

Words: 590

Pages: 2

Views: 309

illustration

Running out of time?

Entrust your assignment to proficient writers and receive TOP-quality paper before the deadline is over.

Illustration