Selection of Dataset
The study used the database of the default of credit card clients. The study aimed to examine the customer default payment in Taiwan. The continuous variables that were selected for this study are the Amount of the given credit (NT dollar) denoted as “LIMIT_BAL”, age of customer and amount of bill statement in September 2005 denoted as “BILL_ATM1”. The categorical variables include the education level of the customer (1 = graduate school; 2 = university; 3 = high school; 4 = others), Marital status (1 = married; 2 = single; 3 = others) and gender (1 = male; 2 = female). The dichotomous variable will be the default payment (Yes = 1, No = 0).
2. Descriptive Statistics
Amount of the given credit (NT dollar)
The chart above histogram represents distribution of the amount of the given credit. The histogram shows that the data is skewed to the right.
Delegate your assignment to our experts and they will do the rest.
Age
The graph above shows the age of the credit card clients. The graph indicates the data is not normal and its skewed to the right since most of the client age is less compared to the mean.
Amount of Bill statement in September, 2005
The bar plot above indicates that the number of females is grater compared to the of male. The male gender I represented by 1 while female 2.
The bar graph above if for the education level of the credit rad clients. It shows that most of the client were at the university and graduate levels. While the least number had high school and others.
The pie chats showed the marital status of the credit card clients. It shows that most of the client are “others”, followed by singles. The least clients are married.
The pie charts showed the default payment among credit card clients. It shows that most of the customers defaulted on payment while a small fraction did not.
Summary M easures
The study will use the measures f centrality and dispersion. The measures f centrality consists of the mean, median and mode. Besides, the measures of dispersion comprise the range and standard deviation. The following are the descriptive statistics for the data.
default.payment.next.month LIMIT_BAL SEX EDUCATION
Min. :0.0000 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.:0.0000 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :0.0000 Median : 140000 Median :2.000 Median :2.000
Mean :0.2212 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:0.0000 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :1.0000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE BILL_AMT1
Min. :0.000 Min. :21.00 Min. :-165580
1st Qu.:1.000 1st Qu.:28.00 1st Qu.: 3559
Median :2.000 Median :34.00 Median : 22382
Mean :1.552 Mean :35.49 Mean : 51223
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 67091
Max. :3.000 Max. :79.00 Max. : 964511
The means age of the client is 35.49 with a maximum and minimum ages being 79 and 21 respectively. The mean amount of credit given is 167484 with a maximum and minimum ages being 1000000 and 10000 respectively. Further, 51223 is the average amount of bill statements in September 2005.
3 . Inferential Statistics
T-test
> t.test(AGE,LIMIT_BAL, paired = TRUE, alternative = "two.sided")
Paired t-test
data: AGE and LIMIT_BAL
t = -223.54, df = 29999, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-168917.1 -165980.6
sample estimates:
mean of the differences
-167448.8
The test statistic, t = -223.54, p < 2.2e-16 . Since the p-value is smaller compared to alpha = 0.05, we reject the null hypothesis. Therefore, we deduce that there is a significant difference linking the mean of age and Amount of credit given.
ANOVA
We will conduct ANOVA for Age, balance limit and the amount of bill statement in September, 2005.
> summary(da)
Df Sum Sq Mean Sq F value Pr(>F)
LIMIT_BAL 1 53381 53381 641.79 < 2e-16 ***
BILL_AMT1 1 619 619 7.44 0.00638 **
Residuals 29997 2495008 83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The F (1,29997) = 641.79 and F (1,29997) = 7.44, p < 0.01. Given that the p-value is smaller compared to alpha = 0.01, we reject the null hypothesis. Hence, we deduce that there is a significant difference between age, balance limit and the amount of bill statement in September 2005.
Chi-squares test
We conducted a Chi-squares test to establish the association between education level and marital status.
Pearson's Chi-squared test
data: My$EDUCATION and My$MARRIAGE
X-squared = 1187.6, df = 18, p-value < 2.2e-16
The = 1187.6, p-value < 2.2e-16. There is a significant association linking education level and marital status since p-value is smaller compared to alpha = 0.05
Scatter plot
Correlation test
Pearson's product-moment correlation
data: My$AGE and My$BILL_AMT1
t = 9.7559, df = 29998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.04495120 0.06751151
sample estimates:
cor
0.05623853
The correlation coefficient is 0.05623853, p-value < 2.2e-16, which indicates that there is a significant correlation between age and the amount of credit given. Thus, we deduce that the correlation is significant.
Regression A nalysis
Simple regression
Call:
lm(formula = AGE ~ LIMIT_BAL)
Residuals:
Min 1Q Median 3Q Max
-15.904 -7.409 -1.904 5.769 40.713
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.376e+01 8.599e-02 392.65 <2e-16 ***
LIMIT_BAL 1.028e-05 4.059e-07 25.33 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.121 on 29998 degrees of freedom
Multiple R-squared: 0.02094, Adjusted R-squared: 0.02091
F-statistic: 641.6 on 1 and 29998 DF, p-value: < 2.2e-16
Multiple regression
Call:
lm(formula = AGE ~ LIMIT_BAL + BILL_AMT1)
Residuals:
Min 1Q Median 3Q Max
-16.724 -7.386 -1.908 5.774 40.033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.371e+01 8.784e-02 383.825 < 2e-16 ***
LIMIT_BAL 9.951e-06 4.234e-07 23.501 < 2e-16 ***
BILL_AMT1 2.035e-06 7.461e-07 2.728 0.00638 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.12 on 29997 degrees of freedom
Multiple R-squared: 0.02118, Adjusted R-squared: 0.02112
F-statistic: 324.6 on 2 and 29997 DF, p-value: < 2.2e-16
Regression with Dummy variables
Call:
lm(formula = AGE ~ EDUCATION + SEX)
Residuals:
Min 1Q Median 3Q Max
-24.079 -7.033 -1.090 5.910 44.207
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.49551 0.21528 160.23 <2e-16 ***
EDUCATION 2.05725 0.06601 31.17 <2e-16 ***
SEX -1.75987 0.10666 -16.50 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.035 on 29997 degrees of freedom
Multiple R-squared: 0.03937, Adjusted R-squared: 0.0393
F-statistic: 614.6 on 2 and 29997 DF, p-value: < 2.2e-16
Regression with interaction
Call:
lm(formula = AGE ~ EDUCATION * LIMIT_BAL + SEX)
Residuals:
Min 1Q Median 3Q Max
-28.416 -6.750 -1.623 5.513 40.536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.907e+01 2.855e-01 101.84 <2e-16 ***
EDUCATION 3.744e+00 1.090e-01 34.35 <2e-16 ***
LIMIT_BAL 2.618e-05 9.846e-07 26.59 <2e-16 ***
SEX -1.800e+00 1.044e-01 -17.23 <2e-16 ***
EDUCATION:LIMIT_BAL -7.024e-06 5.127e-07 -13.70 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.836 on 29995 degrees of freedom
Multiple R-squared: 0.08136, Adjusted R-squared: 0.08124
F-statistic: 664.1 on 4 and 29995 DF, p-value: < 2.2e-16
Logistic regression when output is dichotomous
Call:
glm(formula = default.payment.next.month ~ AGE + LIMIT_BAL +
SEX + EDUCATION + BILL_AMT1 + MARRIAGE, family = "binomial",
data = My)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0157 -0.7683 -0.6471 -0.4215 2.6574
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.777e-01 1.111e-01 -2.498 0.012480 *
AGE 5.712e-03 1.679e-03 3.402 0.000668 ***
LIMIT_BAL -3.755e-06 1.415e-07 -26.545 < 2e-16 ***
SEX -1.602e-01 2.888e-02 -5.546 2.92e-08 ***
EDUCATION -7.063e-02 1.931e-02 -3.658 0.000254 ***
BILL_AMT1 1.177e-06 2.221e-07 5.297 1.18e-07 ***
MARRIAGE -1.843e-01 2.993e-02 -6.158 7.36e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 31705 on 29999 degrees of freedom
Residual deviance: 30792 on 29993 degrees of freedom
AIC: 30806
Number of Fisher Scoring iterations: 4
Applying F-test or AIC to find the best model
We will use AIC to find the best model. The best model is identified by a low AIC value. A stepwise analysis will be conducted by removing variables from the model and comparing the AIC value ( Schabenberger & Gotway, 2017 ). We start with model 1 shown above. For model 2, we remove one variable increasing AIC. Model 3 rises the AIC as we remove the second variable. This process continues as the AIC keeps on increasing. Therefore, model 1 is the best model for the study.
References
Schabenberger, O., & Gotway, C. A. (2017). Statistical methods for spatial data analysis . CRC press.
R-codes
> My=read.csv(file.choose(), header = T)
> attach(My)
> names(My)
> hist(LIMIT_BAL)
> hist(AGE)
> hist(AGE,col = "red")
> hist(BILL_AMT1)
> ggplot(My) + geom_bar(aes(x = SEX))
> ggplot(My) + geom_bar(aes(x = EDUCATION))
>L=c("married","single","others")
> A=table(My$MARRIAGE)
> pie(A, labels = L)
>D=c("YES","NO")
> G=table(My$default.payment.next.month.1)
> pie(G, labels = D)
> summary(My)
> t.test(AGE,LIMIT_BAL, paired = TRUE, alternative = "two.sided")
> da=aov(AGE~LIMIT_BAL+BILL_AMT1,data=My)
> summary(da)
> chisq.test(My$EDUCATION,My$MARRIAGE)
> plot(AGE,BILL_AMT1)
> cor.test(My$AGE,My$BILL_AMT1)
> fit=lm(AGE~LIMIT_BAL)
> summary(fit)
> fit1=lm(AGE~LIMIT_BAL+BILL_AMT1)
> summary(fit1)
> fit2=lm(AGE~EDUCATION+SEX)
> summary(fit2)
> fit3=lm(AGE~EDUCATION*LIMIT_BAL+SEX)
> summary(fit3)
> fit4=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1+MARRIAGE,data = My,family = "binomial")
> summary(fit4)
> fit5=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1+MARRIAGE,data = My,family = "binomial")
> summary(fit5)
> fit51=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1,data = My,family = "binomial")
> summary(fit51)
> fit61=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION,data = My,family = "binomial")
> summary(fit61)
> fit611=glm(default.payment.next.month~AGE+LIMIT_BAL,,data = My,family = "binomial")
> summary(fit611)