딥러닝(Deep Learning) 인공신경망(ANN)을 이용한 면접 참석 여부 예측

파이썬 머신러닝 (Python Machine Learning) 2019. 2. 13. 15:59

작년에 온라인 딥러닝 강의를 들었는데, 혼자 작은 미니 프로젝트르 해보고 싶어서 Kaggle에서 데이터를 받아서 인공신경망으로 학습을 시켜봤다. 데이터소스는 Kaggle::The Interview Attendance Problem이고 New Kernel옆에 다운로드를 누르면 다운로드 할 수 있다. 이 포스트는 강의 포스트가 아니기 때문에 참고만 하길 바라고 질문이 있으면 댓글로 남겨주길 바란다. 나도 강의의 형태로 포스트를 올리기에는 이론적으로 자세히 아는 편은 아니다.

라이브러리와 툴

Tensorflow
Keras
Jupyter
Python 2.7

레퍼런스(Reference)

Sequantial (케라스 문서)
relu activation function (relu 위키)
sigmoid activation function (sigmoid 위키)
Keras Optimizer (케라스 Optimizer Documentation)
Confusion Matrix (Confusion Matrix 위키)

실험사항

레이어 정보 : [Input layer units = 20, 1st hidden layer units = 30, 2nd hidden layer units = 20, Output layer units =1]

실험1 : location데이터를 지웠을 때 모델이 트레이닝 당시 79% 의 정확도를 보였다. 테스트셋 정확도는 63%였다.
실험2 : Interview_Venue, Follow-Up_Call_OK, Venue_Clear가 제외 되었을 때 트레이닝에서 79%, 테스트셋 정확도는 64.8% 였다.
실험3 : location데이터를 제외한 트레이닝 데이터에서 82% 정확도를, 테스트셋에서는 62% 정확도를 보였다.
실험4 : 데이터를 Date, Candidate ID만 제외하고 epoch을 500으로 했을 때 트레이닝 셋에서 82% 정확도를, 테스트 셋에서 65% 정확도를 보였다.
실험5 : 실험4와 같은 조건에서 epoch을 1000으로 올렸더니 트레이닝 셋에서 85% 정확도를, 테스트셋에서 61% 정확도를 보였다.

레이어 정보 : [Input layer units = 20, 1st hidden layer units = 30, 2nd hidden layer units = 40, 3rd hidden layer units = 20, Output layer units =1]

실험6 : epoch=1000을 준 결과 트레이닝 셋에서 83% 정확도를, 테스트셋에서 65.9% 정확도를 보였다.
실험7 : epoch=500을 준 결과 트레이닝 셋에서 83% 정확도를, 테스트셋셋에서서 68% 정확도를 보였다.

Kaggle커널 자체를 보면 테스트셋에서 정확도가 62~75%정도 이므로 68%는 나쁘지 않은 결과라고 생각된다. 이 문제가 인공신경망으로 잘 풀릴만한 문제가 아니었을 수도 있다.

면접 참여 예측하기

데이터가 대충 어떻게 생겼는지 보자.

Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type, Name(Cand ID),Gender,Candidate Current Location,Candidate Job Location,Interview Venue, Candidate Native location,Have you obtained the necessary permission to start at the required time, Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark., Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status,,,,, 13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1, Male,Chennai,Hosur,Hosur,Hosur,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,No,Single

컬럼이 약간 이상한데. 정확히 확인하고 싶으면 Kaggle:The Interview Attendance Problem 여기서 확인하면 된다.

# Importing the libraries
import numpy as np
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('interview.csv')
print(dataset.head())

# remove unnecessary columns
dataset.pop('Date of Interview')
dataset.pop('Name(Cand ID)')

# whitespace/lower/uppercase fix
dataset = dataset.replace('Scheduled Walkin', 'Scheduled Walk In')
dataset = dataset.drop(dataset.index[1233])

일단 numpy와 pandas를 임포트하고, read_csv를 이용해 트레이닝/테스트 파일을 읽었다. 그리고 쓸모 없어 보이는 데이터들을 pop으로 없애버렸다. 이후 똑같은 의미인데 소문자 대문자가 다르거나 하는 것들을 없애버렸다. (Pre-processing)

X = dataset.iloc[:, 0:21] 
X.pop('Observed_Attendance')
X = pd.get_dummies(X, columns=X, drop_first=False).values
y = dataset.loc[:, 'Observed_Attendance'].map({'Yes': 1, 'No': 0, 'No ': 0}).values

모델에 파라미터로 X와 y분리하기. get_dummies를 이용해서 Categorical 값들을 인코딩해줬다. 그리고 y컬럼 값들도 1, 0만 갖도록 Sanitize했음.

X
array([[0, 0, 0, ..., 1, 0, 1],
       [0, 0, 0, ..., 1, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]], dtype=uint8)

y
array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

위에 X랑 y 카테고리컬 값들이 전부 인코딩 된것 확인했다.

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
y_train = y_train.astype(int)
y_test = y_test.astype(int)

이제 트레이닝 데이터와 테스트 데이터를 7:3 비율로 나눴음. 데이터가 많지 않아서 7:3으로 정했다.

# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.optimizers import Adam
from keras.layers import Dense

# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units=20, kernel_initializer='uniform', activation='relu', input_dim=159))

# Adding the second hidden layer
classifier.add(Dense(units=30, kernel_initializer='uniform', activation='relu'))

# Adding the third hidden layer 
classifier.add(Dense(units=30, kernel_initializer='uniform', activation='relu'))
# Adding the forth hidden layer
classifier.add(Dense(units=20, kernel_initializer='uniform', activation='relu'))

# Adding the output layer
classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
# Compiling the ANN
classifier.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, epochs=500, batch_size=25)

케라스 라이브러리를 불러와서 각 인공신경망 레이어를 Sequential Classifier에 더해주었다. 중간 부분 activation 함수는 relu로, 마지막은 sigmoid로 정하고 컴파일시 loss를 바이너리 크로스엔트로피로 정했음.

Epoch 1/500
863/863 [==============================] - 0s - loss: 637716903367269.0000 - acc: 0.6350
Epoch 2/500
863/863 [==============================] - 0s - loss: 491167284111941.5625 - acc: 0.6443     
Epoch 3/500
863/863 [==============================] - 0s - loss: 273787114647282.4375 - acc: 0.6512     
Epoch 4/500
863/863 [==============================] - 0s - loss: 112584550060568.9688 - acc: 0.7265
...
Epoch 496/500
986/986 [==============================] - 0s - loss: -150774026755377760.0000 - acc: 0.8266 
Epoch 497/500
986/986 [==============================] - 0s - loss: -150774026755377760.0000 - acc: 0.8316 
Epoch 498/500
986/986 [==============================] - 0s - loss: -150774026755377760.0000 - acc: 0.8316 
Epoch 499/500
986/986 [==============================] - 0s - loss: -150774026755377760.0000 - acc: 0.8306 
Epoch 500/500
986/986 [==============================] - 0s - loss: -150774026755377760.0000 - acc: 0.8296 
 32/247 [==>...........................] - ETA: 10s

이렇게 해서 트레이닝 셋에 82% 정확도로 트레이닝이 끝남.

# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

트레이닝이 완료된 classifier를 이용해 테스트를 해봤다. 해 보고나서 실제 값과 비교 해 봐야 하므로 Confusion Matrix를 이용해 비교했다.

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

Screen Shot 2018-03-31 at 8.14.34 PM.png

실험결과처럼 60몇퍼센트 정도 되는 것 같다. 근데 다시 확인 해 보니 트레이닝 데이터가 좀 변경된 것 같아 이

'파이썬 머신러닝 (Python Machine Learning)' 카테고리의 다른 글

딥러닝(Deep Learning) 인공신경망(Artificial Neural Network)을 이용한 심장병 예측 (5)	2019.04.28
딥러닝(Deep Learning)을 위한 케라스(Keras) 와 텐서플로우(TensorFlow) 설치 (2)	2019.02.12

인기포스트 MORE POST

ABOUT ME

삐멜 소프트웨어 엔지니어 삐멜 소프트웨어 엔지니어

라이브러리와 툴

레퍼런스(Reference)

실험사항

레이어 정보 : [Input layer units = 20, 1st hidden layer units = 30, 2nd hidden layer units = 20, Output layer units =1]

면접 참여 예측하기

'파이썬 머신러닝 (Python Machine Learning)' 카테고리의 다른 글

React.js, 스프링 부트, AWS로 배우는 웹 개발 101 개정판

티스토리툴바

인기포스트 MORE POST

ABOUT ME

라이브러리와 툴

레퍼런스(Reference)

실험사항

레이어 정보 : [Input layer units = 20, 1st hidden layer units = 30, 2nd hidden layer units = 20, Output layer units =1]

면접 참여 예측하기

'파이썬 머신러닝 (Python Machine Learning)' 카테고리의 다른 글

관련글 관련글 더보기

React.js, 스프링 부트, AWS로 배우는 웹 개발 101 개정판

티스토리툴바