파이썬 머신러닝 (Python Machine Learning)

딥러닝(Deep Learning) 인공신경망(Artificial Neural Network)을 이용한 심장병 예측

삐멜 2019. 4. 28. 15:43

이번 포스트에서는 캐글(Kaggle)의 심장병 데이터와 인공신경망을 이용해 심장병을 예측하는 머신러닝 코드를 짜보도록 한다. 편의를 위해 주피터 노트북의 형식으로 포스트를 제작했다. 데이터셋은 캐글의 Heart Disease UCI를 이용했다. 이 데이터의 장점은 모든 데이터가 숫자형으로 되어있고 None형의 데이터가 없어 데이터 정제화(Sanitization/Cleaning)이 필요 없었다는 점이다. 하지만 단점은 데이터의 양이 너무 적어 예측 모델에 큰 의미를 부여하기 힘들다는 점이다. 그래도 인공 신경망 연습하기에는 좋은 데이터셋이라 여겨 연습해 보았다.


Heart Disease UCI

데이터 확인

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
['heart.csv']

pandas 라이브러리를 이용해  'heart.csv'를 메모리로 불러오기

In [2]:
dataset_path = '../input/heart.csv'
dataset = pd.read_csv(dataset_path)
print(dataset.head())
   age  sex  cp  trestbps  chol   ...    oldpeak  slope  ca  thal  target
0   63    1   3       145   233   ...        2.3      0   0     1       1
1   37    1   2       130   250   ...        3.5      0   0     2       1
2   41    0   1       130   204   ...        1.4      2   0     2       1
3   56    1   1       120   236   ...        0.8      2   0     2       1
4   57    0   0       120   354   ...        0.6      2   0     2       1

[5 rows x 14 columns]

X와 y나누기

모든 값이 바이너리이거나 숫자형이므로 Cateogorization을 할 필요가 없다. X와 y로 데이터셋와 타겟을 나눈다.

In [3]:
X_len = len(dataset.columns) - 1
y_len = len(dataset.columns) - 1
X = dataset.iloc[:, 0:X_len]
y = dataset.iloc[:, y_len]

데이터셋을 트레이닝셋과 테스트셋으로 나눈다. 비율은 8 : 2로 한다.

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Feature Scaling

In [5]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/opt/conda/lib/python3.6/site-packages/sklearn/base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:4: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  after removing the cwd from sys.path.

인공 신경망 만들기 

3개의 레이어를 두었다. 유닛과 배치사이즈, 에폭은 실험을 걸쳐 가장 잘 나오는 것으로 했다.

In [6]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
# Initialising the ANN
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 25, kernel_initializer = 'uniform', activation = 'relu', input_dim = 13))
# Adding the second hidden layer
classifier.add(Dense(units = 15, kernel_initializer = 'uniform', activation = 'relu'))
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 40, epochs = 200)
Using TensorFlow backend.

WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Epoch 1/200 242/242 [==============================] - 1s 3ms/step - loss: 0.6928 - acc: 0.6405 Epoch 2/200 242/242 [==============================] - 0s 44us/step - loss: 0.6921 - acc: 0.7810 Epoch 3/200 242/242 [==============================] - 0s 44us/step - loss: 0.6909 - acc: 0.8264 Epoch 4/200 242/242 [==============================] - 0s 42us/step - loss: 0.6889 - acc: 0.8182 Epoch 5/200 242/242 [==============================] - 0s 43us/step - loss: 0.6857 - acc: 0.8388 Epoch 6/200 242/242 [==============================] - 0s 42us/step - loss: 0.6806 - acc: 0.8306 ... (생략) Epoch 196/200 242/242 [==============================] - 0s 45us/step - loss: 0.2321 - acc: 0.9132 Epoch 197/200 242/242 [==============================] - 0s 41us/step - loss: 0.2310 - acc: 0.9132 Epoch 198/200 242/242 [==============================] - 0s 47us/step - loss: 0.2300 - acc: 0.9132 Epoch 199/200 242/242 [==============================] - 0s 44us/step - loss: 0.2291 - acc: 0.9132 Epoch 200/200 242/242 [==============================] - 0s 47us/step - loss: 0.2282 - acc: 0.9174

Out[6]:
<keras.callbacks.History at 0x7f5dbd208f28>

모델로 y를 예측

In [7]:
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

Confusion Matrix

예측된 실제 y에 얼마나 차이가 있는지 확인해 본다.

In [8]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
loss, accuracy = classifier.evaluate(X_test, y_test)
61/61 [==============================] - 0s 678us/step

Confusion Matrix, loss and accuracy 확인하기

In [9]:
print('Loss : ', loss)
print('Accuracy : ', accuracy)
print('Confusion Matrix : \n', cm)

Loss : 0.3430554387999363 Accuracy : 0.9016393550106736 Confusion Matrix : [[23 4] [ 2 32]]


정확도가 90%로 꽤 높고 Confusion Matrix도 괜찮은 결과를 보였다. 오른쪽 위가 True Positive, 오른쪽 아래가 False Negative로 이 두개의 값이 많을수록 예측이 정확한 것이다.