R-Studio: Data Recovery Software Free Essay Example

Selection of Dataset

The study used the database of the default of credit card clients. The study aimed to examine the customer default payment in Taiwan. The continuous variables that were selected for this study are the Amount of the given credit (NT dollar) denoted as “LIMIT_BAL”, age of customer and amount of bill statement in September 2005 denoted as “BILL_ATM1”. The categorical variables include the education level of the customer (1 = graduate school; 2 = university; 3 = high school; 4 = others), Marital status (1 = married; 2 = single; 3 = others) and gender (1 = male; 2 = female). The dichotomous variable will be the default payment (Yes = 1, No = 0).

2. Descriptive Statistics

Amount of the given credit (NT dollar)

The chart above histogram represents distribution of the amount of the given credit. The histogram shows that the data is skewed to the right.

It’s time to jumpstart your paper!

Delegate your assignment to our experts and they will do the rest.

Get custom essay

Age

The graph above shows the age of the credit card clients. The graph indicates the data is not normal and its skewed to the right since most of the client age is less compared to the mean.

Amount of Bill statement in September, 2005

The bar plot above indicates that the number of females is grater compared to the of male. The male gender I represented by 1 while female 2.

The bar graph above if for the education level of the credit rad clients. It shows that most of the client were at the university and graduate levels. While the least number had high school and others.

The pie chats showed the marital status of the credit card clients. It shows that most of the client are “others”, followed by singles. The least clients are married.

The pie charts showed the default payment among credit card clients. It shows that most of the customers defaulted on payment while a small fraction did not.

Summary M easures

The study will use the measures f centrality and dispersion. The measures f centrality consists of the mean, median and mode. Besides, the measures of dispersion comprise the range and standard deviation. The following are the descriptive statistics for the data.

default.payment.next.month LIMIT_BAL SEX EDUCATION

Min. :0.0000 Min. : 10000 Min. :1.000 Min. :0.000

1st Qu.:0.0000 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000

Median :0.0000 Median : 140000 Median :2.000 Median :2.000

Mean :0.2212 Mean : 167484 Mean :1.604 Mean :1.853

3rd Qu.:0.0000 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000

Max. :1.0000 Max. :1000000 Max. :2.000 Max. :6.000

MARRIAGE AGE BILL_AMT1

Min. :0.000 Min. :21.00 Min. :-165580

1st Qu.:1.000 1st Qu.:28.00 1st Qu.: 3559

Median :2.000 Median :34.00 Median : 22382

Mean :1.552 Mean :35.49 Mean : 51223

3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 67091

Max. :3.000 Max. :79.00 Max. : 964511

The means age of the client is 35.49 with a maximum and minimum ages being 79 and 21 respectively. The mean amount of credit given is 167484 with a maximum and minimum ages being 1000000 and 10000 respectively. Further, 51223 is the average amount of bill statements in September 2005.

3 . Inferential Statistics

T-test

> t.test(AGE,LIMIT_BAL, paired = TRUE, alternative = "two.sided")

Paired t-test

data: AGE and LIMIT_BAL

t = -223.54, df = 29999, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-168917.1 -165980.6

sample estimates:

mean of the differences

-167448.8

The test statistic, t = -223.54, p < 2.2e-16 . Since the p-value is smaller compared to alpha = 0.05, we reject the null hypothesis. Therefore, we deduce that there is a significant difference linking the mean of age and Amount of credit given.

ANOVA

We will conduct ANOVA for Age, balance limit and the amount of bill statement in September, 2005.

> summary(da)

Df Sum Sq Mean Sq F value Pr(>F)

LIMIT_BAL 1 53381 53381 641.79 < 2e-16 ***

BILL_AMT1 1 619 619 7.44 0.00638 **

Residuals 29997 2495008 83

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The F (1,29997) = 641.79 and F (1,29997) = 7.44, p < 0.01. Given that the p-value is smaller compared to alpha = 0.01, we reject the null hypothesis. Hence, we deduce that there is a significant difference between age, balance limit and the amount of bill statement in September 2005.

Chi-squares test

We conducted a Chi-squares test to establish the association between education level and marital status.

Pearson's Chi-squared test

data: My$EDUCATION and My$MARRIAGE

X-squared = 1187.6, df = 18, p-value < 2.2e-16

The = 1187.6, p-value < 2.2e-16. There is a significant association linking education level and marital status since p-value is smaller compared to alpha = 0.05

Scatter plot

Correlation test

Pearson's product-moment correlation

data: My$AGE and My$BILL_AMT1

t = 9.7559, df = 29998, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.04495120 0.06751151

sample estimates:

cor

0.05623853

The correlation coefficient is 0.05623853, p-value < 2.2e-16, which indicates that there is a significant correlation between age and the amount of credit given. Thus, we deduce that the correlation is significant.

Regression A nalysis

Simple regression

Call:

lm(formula = AGE ~ LIMIT_BAL)

Residuals:

Min 1Q Median 3Q Max

-15.904 -7.409 -1.904 5.769 40.713

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.376e+01 8.599e-02 392.65 <2e-16 ***

LIMIT_BAL 1.028e-05 4.059e-07 25.33 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.121 on 29998 degrees of freedom

Multiple R-squared: 0.02094, Adjusted R-squared: 0.02091

F-statistic: 641.6 on 1 and 29998 DF, p-value: < 2.2e-16

Multiple regression

Call:

lm(formula = AGE ~ LIMIT_BAL + BILL_AMT1)

Residuals:

Min 1Q Median 3Q Max

-16.724 -7.386 -1.908 5.774 40.033

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.371e+01 8.784e-02 383.825 < 2e-16 ***

LIMIT_BAL 9.951e-06 4.234e-07 23.501 < 2e-16 ***

BILL_AMT1 2.035e-06 7.461e-07 2.728 0.00638 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.12 on 29997 degrees of freedom

Multiple R-squared: 0.02118, Adjusted R-squared: 0.02112

F-statistic: 324.6 on 2 and 29997 DF, p-value: < 2.2e-16

Regression with Dummy variables

Call:

lm(formula = AGE ~ EDUCATION + SEX)

Residuals:

Min 1Q Median 3Q Max

-24.079 -7.033 -1.090 5.910 44.207

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 34.49551 0.21528 160.23 <2e-16 ***

EDUCATION 2.05725 0.06601 31.17 <2e-16 ***

SEX -1.75987 0.10666 -16.50 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.035 on 29997 degrees of freedom

Multiple R-squared: 0.03937, Adjusted R-squared: 0.0393

F-statistic: 614.6 on 2 and 29997 DF, p-value: < 2.2e-16

Regression with interaction

Call:

lm(formula = AGE ~ EDUCATION * LIMIT_BAL + SEX)

Residuals:

Min 1Q Median 3Q Max

-28.416 -6.750 -1.623 5.513 40.536

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.907e+01 2.855e-01 101.84 <2e-16 ***

EDUCATION 3.744e+00 1.090e-01 34.35 <2e-16 ***

LIMIT_BAL 2.618e-05 9.846e-07 26.59 <2e-16 ***

SEX -1.800e+00 1.044e-01 -17.23 <2e-16 ***

EDUCATION:LIMIT_BAL -7.024e-06 5.127e-07 -13.70 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.836 on 29995 degrees of freedom

Multiple R-squared: 0.08136, Adjusted R-squared: 0.08124

F-statistic: 664.1 on 4 and 29995 DF, p-value: < 2.2e-16

Logistic regression when output is dichotomous

Call:

glm(formula = default.payment.next.month ~ AGE + LIMIT_BAL +

SEX + EDUCATION + BILL_AMT1 + MARRIAGE, family = "binomial",

data = My)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.0157 -0.7683 -0.6471 -0.4215 2.6574

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.777e-01 1.111e-01 -2.498 0.012480 *

AGE 5.712e-03 1.679e-03 3.402 0.000668 ***

LIMIT_BAL -3.755e-06 1.415e-07 -26.545 < 2e-16 ***

SEX -1.602e-01 2.888e-02 -5.546 2.92e-08 ***

EDUCATION -7.063e-02 1.931e-02 -3.658 0.000254 ***

BILL_AMT1 1.177e-06 2.221e-07 5.297 1.18e-07 ***

MARRIAGE -1.843e-01 2.993e-02 -6.158 7.36e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 31705 on 29999 degrees of freedom

Residual deviance: 30792 on 29993 degrees of freedom

AIC: 30806

Number of Fisher Scoring iterations: 4

Applying F-test or AIC to find the best model

We will use AIC to find the best model. The best model is identified by a low AIC value. A stepwise analysis will be conducted by removing variables from the model and comparing the AIC value ( Schabenberger & Gotway, 2017 ). We start with model 1 shown above. For model 2, we remove one variable increasing AIC. Model 3 rises the AIC as we remove the second variable. This process continues as the AIC keeps on increasing. Therefore, model 1 is the best model for the study.

References

Schabenberger, O., & Gotway, C. A. (2017). Statistical methods for spatial data analysis . CRC press.

R-codes

> My=read.csv(file.choose(), header = T)

> attach(My)

> names(My)

> hist(LIMIT_BAL)

> hist(AGE)

> hist(AGE,col = "red")

> hist(BILL_AMT1)

> ggplot(My) + geom_bar(aes(x = SEX))

> ggplot(My) + geom_bar(aes(x = EDUCATION))

>L=c("married","single","others")

> A=table(My$MARRIAGE)

> pie(A, labels = L)

>D=c("YES","NO")

> G=table(My$default.payment.next.month.1)

> pie(G, labels = D)

> summary(My)

> t.test(AGE,LIMIT_BAL, paired = TRUE, alternative = "two.sided")

> da=aov(AGE~LIMIT_BAL+BILL_AMT1,data=My)

> summary(da)

> chisq.test(My$EDUCATION,My$MARRIAGE)

> plot(AGE,BILL_AMT1)

> cor.test(My$AGE,My$BILL_AMT1)

> fit=lm(AGE~LIMIT_BAL)

> summary(fit)

> fit1=lm(AGE~LIMIT_BAL+BILL_AMT1)

> summary(fit1)

> fit2=lm(AGE~EDUCATION+SEX)

> summary(fit2)

> fit3=lm(AGE~EDUCATION*LIMIT_BAL+SEX)

> summary(fit3)

> fit4=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1+MARRIAGE,data = My,family = "binomial")

> summary(fit4)

> fit5=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1+MARRIAGE,data = My,family = "binomial")

> summary(fit5)

> fit51=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION+BILL_AMT1,data = My,family = "binomial")

> summary(fit51)

> fit61=glm(default.payment.next.month~AGE+LIMIT_BAL+SEX+EDUCATION,data = My,family = "binomial")

> summary(fit61)

> fit611=glm(default.payment.next.month~AGE+LIMIT_BAL,,data = My,family = "binomial")

> summary(fit611)

R-Studio: Data Recovery Software

Related essays

Scatter Diagram: How to Create a Scatter Plot in Excel

Calculating and Reporting Healthcare Statistics

Survival Rate for COVID-19 Patients: A Comparative Analysis

5 Types of Regression Models You Should Know

The Motion Picture Industry - A Comprehensive Overview

Spearman's Rank Correlation Coefficient (Spearman's Rho)

Running out of time?