본문 바로가기
기본소양/CODE

1. Linear Regression CODE [0] 시작은 언제나 EDA

by EXUPERY 2021. 2. 2.
반응형

 

 

시작은 언제나 EDA

Linear Regression CODE 

 

 


0. Data Description

항상 먼저 확인 할 것

 

1. Profiling

pip install -U pandas-profiling

from pandas_profiling import ProfileReport
df.profile_report()

 

 

2. EDA

## 상관계수
df_cor = df.corr().copy()
print(df_cor.sort_values('target',ascending=False).price.head(5))

## Only Heatmap
import seaborn as sns
import matplotlib.pyplot as plt
df_cor = df.corr().copy()
fig, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation of features')
sns.heatmap(df_cor,cmap='gist_earth',linewidths=0.25, linecolor='k', annot=True)
plt.show()

## Only Bar (Seaborn)
plt.figure(figsize=(8,4))
sns.barplot(df_cor.sort_values('target',ascending=False).target,df_cor.sort_values('target',ascending=False).target.index,orient='h')
plt.title('Pearson Correlation(barh)')
plt.show()

## Heatmap& Bar(Mat)
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1,2,figsize=(10,5))
plt.subplots_adjust(wspace=0.5)
sns.heatmap(df_cor,ax=axes[0], cmap='Blues')
axes[0].set_title('Pearson Correlation(Heatmap)')
axes[1].barh(df_cor.sort_values('target',ascending=True).target.index,df_cor.sort_values('target',ascending=True).price)
axes[1].set_title('Pearson Correlation(barh)')
plt.show()

## Scatter로 Outlier확인
plt.figure(figsize=(5,5))
sns.scatterplot(df.variable,df.target,color= 'red',alpha=0.5)
plt.grid()
plt.show() 

# Pairplot
plt.figure(figsize=(5,5))
sns.pairplot(df)
plt.show() 

#Countplot
sns.countplot(x='age_5', hue='cardio', data = df_1, palette="Set2")

 

 

2. OneHotEncoding

! pip install category_encoders

from category_encoders import OneHotEncoder
encoder = OneHotEncoder(use_cat_names = True) #use_cat_names : 카테고리 이름 살릴지

df_OneHot = encoder.fit_transform(df) # fit & transform

print(df.shape)
print(df_OneHot.shape)

 

3. train_test_split

from sklearn.model_selection import train_test_split
X = df_OneHot.drop(columns='Price').copy()
y = df_OneHot.Price
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.2 , random_state=1)
print(df_OneHot.shape, X_train.shape, X_test.shape)

 

 

 

반응형

'기본소양 > CODE' 카테고리의 다른 글

1. Linear Regression CODE [2] Modeling  (0) 2021.02.03
1. Linear Regression CODE [1] Simple Regression  (0) 2021.02.02
3. Linear Algebra[4] CODE  (0) 2021.01.18
2. Statistics [4] CODE  (0) 2021.01.10
1. Data Preprocess & EDA [4] CODE  (0) 2021.01.02

댓글