3. Tabular 데이터 모델링 (딥러닝)

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

Juni_DEV

3. Tabular 데이터 모델링 (딥러닝) 본문

Artificial Intelligence

3. Tabular 데이터 모델링 (딥러닝)

junni :p 2022. 7. 18. 14:22

딥러닝 모델 프로세스

라이브러리 임포트
데이터 가져오기
데이터 분석
데이터 전처리
Train, Test 데이터셋 분할
데이터 정규화
딥러닝 모델 구현
모델 성능평가

1. 라이브러리 임포트

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. 데이터 가져오기

df = pd.read_csv('data.csv') # data.csv 파일을 읽어와서 df 변수에 저장

3. 데이터 분석

df.info() # index, 컬럼명, Non-Null Count, Dtype
df.head() # 앞에서 5개 컬럼 보여줌
df.tail() # 뒤에서 5개 컬럼 보여줌

4. 데이터 전처리

모든 데이터 값은 숫자형이어야 함, Object 타입 -> 숫자형 변경
Object 컬럼에 대해 Pandas get_dummies 함수 활용하여 One-Hot Encoding

cols = df.select_dtypes('object').columns.values # object 컬럼명 수집
df1 = pd.get_dummies(data=df, columns=cols) # 원핫인코딩

5. Train, Test 데이터셋 분할

from sklearn.model_selection import train_test_split

X = df1.drop('컬럼1')
y = df1['컬럼1'].values # 예측할 값

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.3, # train/test 비율
                                                    stratify=y, # y class 비율에 맞게 나누기
                                                    random_state=42) # 여러번 수행해도 같은 결과 나오게 고정

6. 데이터 정규화

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

7. 딥러닝 모델 구현

import tensorflow as tf
from tensorflow.keras models import Sequential
from tensorflow.keras.layer import Dense, Dropout

tf.random.set_seed(100) # 랜덤 함수를 사용할 때 기준이 되는 seed 값 설정

# 모델 구성
model = Sequential()
model.add(Dense(4, activation='relu', input_shape=(39,)))
model.add(Dropout(0.3))
model.add(Dense(3, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid')) # 이진 분류

#모델 확인
model.summary()

# 모델 컴파일
model.compile(optimizer='adam', 
              loss='binary_crossentropy', # 이진 분류
              metrics=['accuracy']) 

# 학습
history = model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=10, batch_size=10)

+ 다중 분류 딥러닝 모델

model = Sequential()
model.add(Dense(5, activation='relu', input_shape=(39,)))
model.add(Dropout(0.3))
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(2, activation='softmax')) # 다중 분류

# 모델 확인
model.summary()

# 모델 컴파일
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', # 다중 분류
              metrics=['accuracy']) 
              
# 학습
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=16)

Callback : 조기종료, 모델 저장

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# val_loss 모니터링해서 성능이 5번 지나도록 좋아지지 않으면 조기 종료
early_stop = EarlyStopping(monitor='val_loss', mode='min', 
                           verbose=1, patience=5)
                           
                           
# val_loss 가장 낮은 값을 가질때마다 모델 저장
check_point = ModelCheckpoint('best_model.h5', 
                              monitor='val_loss', mode='min', save_best_only=True, verbose=1)
                              
# 학습
history = model.fit(x=X_train, y=y_train, 
          epochs=50 , batch_size=20,
          validation_data=(X_test, y_test), verbose=1,
          callbacks=[early_stop, check_point])

8. 모델 성능 평가

result = pd.DataFrame(history.history)
result.head()

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

pred = model.predict(X_test)
y_pred = np.argmax(pred, axis=1)

# 정확도
accuracy_score(y_test, y_pred)
# 정밀도
precision_score(y_test, y_pred)
# 재현율
recall_score(y_test, y_pred)
#f1 score
f1_score(y_test, y_pred)

# 성능 한번에 보기
print(classification_report(y_test, y_pred))

+ 결과 시각화

result[['loss', 'val_loss']].plot()
result[['loss', 'val_loss', 'accuracy', 'val_accuracy']].plot()

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(['acc', 'val_acc'])
plt.show()

저작자표시 (새창열림)

'Artificial Intelligence' 카테고리의 다른 글

[Kaggle] Playground Season4, Episode 8 - Binary Prediction of Poisonous Mushrooms (4)	2024.09.01
AutoGluon 사용법 (0)	2024.08.25
2. Tabular 데이터 모델링 (머신러닝) (0)	2022.07.11
1. Tabular 데이터 모델링 (전처리 및 시각화) (0)	2022.07.05
Anaconda 환경 세팅 및 CUDA, cuDNN 설치 (Window, AMD Ryzen GPU) (0)	2022.05.26

'Artificial Intelligence' Related Articles

Comments

Juni_DEV

3. Tabular 데이터 모델링 (딥러닝) 본문

3. Tabular 데이터 모델링 (딥러닝)

딥러닝 모델 프로세스

'Artificial Intelligence' 카테고리의 다른 글

티스토리툴바