Skip to content

alphasentaurii/detecting-dead-stars-in-deep-space

 
 

Repository files navigation

Predicting Pulsars with Decision Trees and XG Boost

GitHub repo size GitHub license

Predicting Pulsars with Decision Trees and XG Boost is a supervised machine learning feature classification project that uses Decision Trees and XGBoost to predict and classify signals as either a pulsar or noise.

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey.

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter.

As pulsars rotate, their emission beams sweep across the sky which produces a detectable pattern of broadband radio emission when crossing our line of sight. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation. Detection of a potential signal is known as a 'candidate', which is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find. Thus, legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class.

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).

Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency. The remaining four variables are similarly obtained from the DM-SNR curve.

* Mean of the integrated profile.
* Standard deviation of the integrated profile.
* Excess kurtosis of the integrated profile.
* Skewness of the integrated profile.
* Mean of the DM-SNR curve.
* Standard deviation of the DM-SNR curve.
* Excess kurtosis of the DM-SNR curve.
* Skewness of the DM-SNR curve.
* Class

HTRU 2 Summary:

* 17,898 total examples
        * 1,639 positive examples
        * 16,259 negative examples

Prerequisites

Before you begin, ensure you have met the following requirements:

  • You have installed the latest version of Jupyter Notebook
  • You have a <Windows/Linux/Mac> machine.

Running the Time Series Forecasting with SARIMAX and Gridsearch Project

To run this project locally, follow these steps:

In the command line/terminal:

git clone https://github.com/hakkeray/predicting-pulsars-with-decision-trees-and-xgboost
cd predicting-pulsars-with-decision-trees-and-xgboost
jupyter notebook

Summary

Pipeline

To begin the analysis, I used a pipeline to determine the most accurate models for predicting a pulsar. After performing Standard Scaling on the dataset, I split the dataset into train-test prediction models for Logistic Regression, Support Vector Machines, Decision Trees and XG Boost. All were fairly accurate, with Decision Trees and XG Boost topping the list for accuracy scores.

Decision Trees

Following this, I proceeded with a Decision Tree classifier with balanced class weights, which did fairly well, scoring 96% accuracy. However, because of the imbalanced classes, the F1 score is the most important validator for model accuracy, and the Decision Tree classifier scored 82%.

XGBoost

Moving on to XGBoost, I scored 98% accuracy with an 89% F1 score. I was able to successfully identify 466 pulsars, missing only 78 that we our model mistakenly identified as noise.

RECOMMENDATIONS

 * Focus on Kurtosis Integrated Profile

 * Focus on Standard Deviation DM-NSR Curve

 * Validate model predictions with analysis of other celestial objects 
 producing cosmic rays to see if they show the same attributes.

FUTURE WORK

  1. Improving the model, trying other ways of scaling, balancing class weights.

  2. Looking at stars right before they die - predicting whether or not it will become a pulsar or not (could be slightly impossible considering stars live for billions of years…)

Contact

If you want to contact me you can reach me at [email protected].

License

This project uses the following license: MIT License.

#         _ __ _   _
#  /\_/\ | '__| | | |
#  [===] | |  | |_| |
#   \./  |_|   \__,_|
#

Process Outline

* IMPORT PACKAGES + LIBRARIES

* OBTAIN DATA

* PRE-PROCESSING

* EDA + VISUALIZATIONS

* MODELING:
    * MODEL 1: DECISION TREES
    * MODEL 2: RANDOM FOREST
    * MODEL 3: XGBOOST
    * MODEL 4: GRIDSEARCH CV (ALL)
    
* INTERPRET RESULTS

* CONCLUSION + SUMMARY

* FUTURE WORK

Libraries

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree

from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score,roc_curve,auc,average_precision_score,recall_score,precision_score,f1_score,classification_report
from sklearn.tree import export_graphviz
from IPython.display import Image  
from pydotplus import graph_from_dot_data
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

Dataset

</style>
Mean of the integrated profile Standard deviation of the integrated profile Excess kurtosis of the integrated profile Skewness of the integrated profile Mean of the DM-SNR curve Standard deviation of the DM-SNR curve Excess kurtosis of the DM-SNR curve Skewness of the DM-SNR curve target_class
0 140.562500 55.683782 -0.234571 -0.699648 3.199833 19.110426 7.975532 74.242225 0
1 102.507812 58.882430 0.465318 -0.515088 1.677258 14.860146 10.576487 127.393580 0
2 103.015625 39.341649 0.323328 1.051164 3.121237 21.744669 7.735822 63.171909 0
3 136.750000 57.178449 -0.068415 -0.636238 3.642977 20.959280 6.896499 53.593661 0
4 88.726562 40.672225 0.600866 1.123492 1.178930 11.468720 14.269573 252.567306 0

Descriptive Statistics

</style>
MEAN_IP STD_IP KURTOSIS_IP SKEWNESS_IP MEAN_CURVE STD_CURVE KURTOSIS_CURVE SKEWNESS_CURVE TARGET
count 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000
mean 111.079968 46.549532 0.477857 1.770279 12.614400 26.326515 8.303556 104.857709 0.091574
std 25.652935 6.843189 1.064040 6.167913 29.472897 19.470572 4.506092 106.514540 0.288432
min 5.812500 24.772042 -1.876011 -1.791886 0.213211 7.370432 -3.139270 -1.976976 0.000000
25% 100.929688 42.376018 0.027098 -0.188572 1.923077 14.437332 5.781506 34.960504 0.000000
50% 115.078125 46.947479 0.223240 0.198710 2.801839 18.461316 8.433515 83.064556 0.000000
75% 127.085938 51.023202 0.473325 0.927783 5.464256 28.428104 10.702959 139.309331 0.000000
max 192.617188 98.778911 8.069522 68.101622 223.392140 110.642211 34.539844 1191.000837 1.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
MEAN_IP           17898 non-null float64
STD_IP            17898 non-null float64
KURTOSIS_IP       17898 non-null float64
SKEWNESS_IP       17898 non-null float64
MEAN_CURVE        17898 non-null float64
STD_CURVE         17898 non-null float64
KURTOSIS_CURVE    17898 non-null float64
SKEWNESS_CURVE    17898 non-null float64
TARGET            17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB

EXPLORE

Exploratory Data Analysis (EDA)

Comparing Attributes

Hotmap( )

def hotmap(df, figsize=(10,8)):
    ##### correlation heatmap
    corr = df.corr()
    fig, ax = plt.subplots(figsize=figsize)
    mask = np.zeros_like(corr, dtype=np.bool)
    idx = np.triu_indices_from(mask)
    mask[idx] = True
    
    sns.heatmap(np.abs(corr),square=True, mask=mask, annot=True,
            cmap=sns.color_palette("magma"),ax=ax,linewidth=2,edgecolor="k")
    ax.set_ylim(len(corr), -.5,.5)
    
    plt.title("CORRELATION BETWEEN VARIABLES")
    plt.show();
    
    ##### descriptive statistics heatmap
    fig, ax = plt.subplots(figsize=figsize)
    
    sns.heatmap(df.describe()[1:].transpose(),annot=True, ax=ax, 
                linecolor="w", linewidth=2,cmap=sns.color_palette("Set2")) #"Set2"
    ax.set_ylim(len(corr), -.5,.5)
    plt.title("Data summary")
    plt.show()
    
    plt.figure(figsize=(13,8))
    
    ### compare proportion of target classes
    plt.subplot(121)
    ax = sns.countplot(y = df["TARGET"],
                       palette=["b","lime"],
                       linewidth=1,
                       edgecolor="k"*2)
    for i,j in enumerate(df["TARGET"].value_counts().values):
        ax.text(.7,i,j,weight = "bold",fontsize = 27)
    plt.title("Count for target variable in datset")


    plt.subplot(122)
    plt.pie(df["TARGET"].value_counts().values,
            labels=["not pulsars","pulsars"],
            autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
    circ = plt.Circle((0,0),.7,color = "white")
    plt.gca().add_artist(circ)
    plt.subplots_adjust(wspace = .2)
    plt.title("Proportion of target variable in dataset")
    plt.show()

hotmap(df, figsize=(10,8))

png

png

png

Target Class Values are highly differentiated for the following features: * Kurtosis Integrated Profile * Skewness Integrated Profile

Other candidates include:

* Mean Curve
* Standard Deviation Cruve
* Kurtosis Curve
* Skewness Curve

Least likely to be important in distinguishing pulsars and RFI include:

* Mean Integrated Profile
* Standard Deviation IP

Comparing Pulsar vs NonPulsar : Mean and Standard Deviation

[GREEN == PULSAR , BLUE == NON-PULSAR]

png

The mean and standard deviation of the Skewness Curve if also a good candidate predictor for our target class.

Distribution of Target Class Variables

png

Pairplot

png

Scatterplot

png

Boxplot

png

Stackplot

png

3D Plot

png

Jointplot

png

png

Barplots

png

png

png

png

png

MODEL

Split Data

# create our feature set X and labels y:

y = df['TARGET'].copy()
X = df.drop(columns=['TARGET']).copy()
display(y.shape, X.shape)
(17898,)



(17898, 8)
# We'll do a 75/25 split on the dataset for training/test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)

Construct Pipelines

  1. Scale data using StandardScaler()
  2. Construct Pipelines
# logistic regression
pipe_lr = Pipeline([('scl', StandardScaler()),
                   ('pca', PCA(n_components=2)),
                   ('clf', LogisticRegression(class_weight='balanced'))])

# support vector
pipe_svm = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', svm.SVC(class_weight='balanced'))])


# decision tree
pipe_dt = Pipeline([('scl', StandardScaler()),
                   ('pca', PCA(n_components=2)),
                    ('clf', tree.DecisionTreeClassifier(class_weight='balanced'))])

# xgboost
pipe_xgb = Pipeline([('xgb', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                     ('clf', XGBClassifier(class_weight='balanced'))])
# List of pipelines for ease of iteration
pipelines = [pipe_lr, pipe_svm, pipe_dt, pipe_xgb]
# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 
             1: 'Support Vector Machine', 
             2: 'Decision Tree', 
             3: 'XG Boost'}
# Fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train, y_train)
# Compare accuracies
for idx, val in enumerate(pipelines):
    print('%s pipeline training accuracy: %.3f' % (pipe_dict[idx], val.score(X_train, y_train)))
    print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx], val.score(X_test, y_test)))
Logistic Regression pipeline training accuracy: 0.935
Logistic Regression pipeline test accuracy: 0.937
Support Vector Machine pipeline training accuracy: 0.955
Support Vector Machine pipeline test accuracy: 0.955
Decision Tree pipeline training accuracy: 1.000
Decision Tree pipeline test accuracy: 0.959
XG Boost pipeline training accuracy: 0.976
XG Boost pipeline test accuracy: 0.973
# Identify the most accurate model on test data
best_acc = 0.0
best_clf = 0
best_pipe = ''


for idx, val in enumerate(pipelines):
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx
        
print('Classifier with best accuracy: %s' % pipe_dict[best_clf])

# Save pipeline to file
joblib.dump(best_pipe, 'best_pipeline.pkl', compress=1)
print('Saved %s pipeline to file' % pipe_dict[best_clf])
Classifier with best accuracy: XG Boost
Saved XG Boost pipeline to file

Decision Tree

Decision Tree

Standardize

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

Create Instance

## Create an instance of decision tree classifier
dt_clf = DecisionTreeClassifier(criterion='entropy', class_weight='balanced')

Fit

# Fit the training data to the model
dt_clf.fit(X_train_transformed, y_train)
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

DOT Graph

# Create DOT data
dot_data = export_graphviz(dt_clf, out_file=None, 
                           feature_names=X.columns,  
                           class_names=np.unique(y).astype('str'), 
                           filled=True, rounded=True, special_characters=True)

# Draw graph
graph = graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

png

Make Predictions

# Make predictions for test data
y_pred = dt_clf.predict(X_test_transformed)

Evaluate

Accuracy

# Check the accuracy, AUC, and create a confusion matrix

acc = accuracy_score(y_test,y_pred) * 100
print('Accuracy is :{0}'.format(acc))
Accuracy is :96.56339935669544

AUC

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC is :{0}'.format(round(roc_auc, 2)))
AUC is :0.91

Confusion Matrix

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
Confusion Matrix
----------------
Predicted 0 1 All
True
0 5242 121 5363
1 82 462 544
All 5324 583 5907
# Print confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cnf_matrix)
Confusion Matrix:
 [[5242  121]
 [  82  462]]


TRUE POSITIVES: 462 Pulsars correctly identified,
TRUE NEGATIVES: 5242 correctly classified as noise
FALSE POSITIVES: 121 RFI/noise misclassified as pulsars
FALSE NEGATIVES: 82 Pulsars misclassifed as noise
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',cmap=plt.cm.Blues):
    
    import itertools
    # Check if normalize is set to True
    # If so, normalize the raw confusion matrix before visualizing
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')


    
    fig, ax = plt.subplots(figsize=(10,10))
    #mask = np.zeros_like(cm, dtype=np.bool)
    #idx = np.triu_indices_from(mask)
    
    #mask[idx] = True

    plt.imshow(cm, cmap=cmap, aspect='equal')
    
    # Add title and axis labels 
    plt.title('Confusion Matrix') 
    plt.ylabel('True label') 
    plt.xlabel('Predicted label')
    
    # Add appropriate axis scales
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    #ax.set_ylim(len(cm), -.5,.5)
    
    # Text formatting
    fmt = '.2f' if normalize else 'd'
    # Add labels to each cell
    thresh = cm.max() / 2.
    # iterate thru matrix and append labels  
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment='center',
                 color='darkgray' if cm[i, j] > thresh else 'black',
                size=14, weight='bold')
    
    # Add a legend
    plt.colorbar()
    plt.show() 
# Plot normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=['Non-Pulsar', 'Pulsar'], normalize=True,
                      title='Normalized confusion matrix')
Normalized confusion matrix

png

Parameter Tuning

* Create an array for max_depth values ranging from 1 - 32
* In a loop, train the classifier for each depth value (32 runs) 
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results

Max Depth

# Check for the best depth parameter
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []

# Identify the optimal tree depth for given data
for max_depth in max_depths:
    dt = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
   
    # Add auc score to previous train results
    train_results.append(roc_auc)
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
   
    # Add auc score to previous test results
    test_results.append(roc_auc)

# PLOT AUC curve
plt.figure(figsize=(11,8))
plt.plot(max_depths, train_results, 'k', label='Train AUC')
plt.plot(max_depths, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.title('MAX TREE DEPTH')
plt.legend()
plt.show()

png

Max tree depth optimal value does not improve beyond 3 for test data.

Min Sample Split

Now we'll check for the best min_samples_splits parameter for our decision tree.

* Create an array for min_sample_splits values ranging from 0.1 - 1 with an 
increment of 0.1
* In a loop, train the classifier for each min_samples_splits value (10 runs)
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results
# Identify the optimal min-samples-split for given data

min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
train_results = []
test_results = []

for min_samples_split in min_samples_splits:
    dt = DecisionTreeClassifier(criterion='entropy', min_samples_split=min_samples_split, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

plt.figure(figsize=(11,8))
plt.plot(min_samples_splits, train_results, 'k', label='Train AUC')
plt.plot(min_samples_splits, test_results, 'r', label='Test AUC')
plt.xlabel('Min. Sample splits')
plt.ylabel('AUC score')
plt.title('MIN SAMPLES SPLITS')
plt.legend()
plt.show()

png

AUC does not improve beyond 0.2 for test data.

Minimum Sample Leafs

Now we'll check for the best min_samples_leafs parameter value for our decision tree.

* Create an array for min_samples_leafs values ranging from 0.1 - 0.5 
with an increment of 0.1
* In a loop, train the classifier for each min_samples_leafs value (5 runs)
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results
# Calculate the optimal value for minimum sample leafs
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)
train_results = []
test_results = []

for min_samples_leaf in min_samples_leafs:
    dt = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=min_samples_leaf, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

# PLOT
plt.figure(figsize=(11,7))
plt.plot(min_samples_leafs, train_results, 'b', label='Train AUC')
plt.plot(min_samples_leafs, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Min. Sample Leafs')
plt.title('MIN SAMPLE LEAFS')
plt.legend()
plt.show()

png

Highest AUC for both train and test data maximized at 0.10.

Maximum Features

Now we'll check for the best max_features parameter value for our decision tree.

* Create an array for max_features values ranging from 1 - 12 (1 features vs all)
* In a loop, train the classifier for each max_features value (12 runs)
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results
# Find the best value for optimal maximum feature size
max_features = list(range(1, X_train.shape[1]))
train_results = []
test_results = []

for max_feature in max_features:
    dt = DecisionTreeClassifier(criterion='entropy', max_features=max_feature, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

plt.figure(figsize=(13,8))
plt.plot(max_features, train_results, 'b', label='Train AUC')
plt.plot(max_features, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('max features')
plt.title('MAX FEATURES')
plt.legend()
plt.show()

png

Increasing parameters has no clear effect on training data (flat AUC). 
Optimal value for test data is 5.

Retrain classifer

We'll now use the best values from each training phase above and feed it back to our classifier and see if have any improvement in predictive performance.

* Train the classifier with optimal values identified
* Compare the AUC with vanilla DT AUC
* Interpret the results of comparison
# Re-train DT classifier with optimal values identified above
dt_clf = DecisionTreeClassifier(criterion='entropy', class_weight='balanced',
                                max_features=5,
                                max_depth=3,
                                min_samples_split=0.2,
                                min_samples_leaf=0.1)
# fit model
dt_clf.fit(X_train_transformed, y_train)

# make predictions
y_pred = dt_clf.predict(X_test_transformed)

# roc_auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
0.9257396472014128

DOT Graph

# Create DOT data
dot_data = export_graphviz(dt_clf, out_file=None, 
                           feature_names=X.columns,  
                           class_names=np.unique(y).astype('str'), 
                           filled=True, rounded=True, special_characters=True)

# Draw graph
graph = graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

png

def modelX(algorithm, X_train, y_train, X_test, y_test, of_type):
    from sklearn.metrics import classification_report
    print ("**********"*7)
    print ("MODEL X")
    print ("**********"*7)
    algorithm.fit(X_train, y_train)
    y_pred = algorithm.predict(X_test)
    
    print (algorithm)
    print ("\n accuracy_score :", accuracy_score(y_test, y_pred))
    
    print ("\nclassification report :\n",(classification_report(y_test, y_pred)))
        
    plt.figure(figsize=(13,10))
    plt.subplot(221)
    sns.heatmap(confusion_matrix(y_test, y_pred),annot=True,fmt = "d",linecolor="k",linewidths=3)
    plt.title("CONFUSION MATRIX",fontsize=20)
    
    pred_probs = algorithm.predict_proba(X_test)[:,1]
    fpr,tpr,thresholds = roc_curve(y_test, pred_probs)
    plt.subplot(222)
    plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
    plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
    plt.legend(loc = "best")
    plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)
    
    if of_type == "feat":
        
        dataframe = pd.DataFrame(algorithm.feature_importances_, X_train.columns).reset_index()
        dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
        dataframe = dataframe.sort_values(by="coefficients",ascending = False)
        plt.subplot(223)
        ax = sns.barplot(x ="coefficients", y="features",data=dataframe, palette="husl")
        plt.title("FEATURE IMPORTANCES",fontsize =20)
        for i,j in enumerate(dataframe["coefficients"]):
            ax.text(.011,i,j,weight = "bold")
    
    elif of_type == "coef":
        try:
            dataframe = pd.DataFrame(algorithm.coef_.ravel(), X_train.columns).reset_index()
            dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
            dataframe = dataframe.sort_values(by="coefficients",ascending = False)
            plt.subplot(223)
            ax = sns.barplot(x = "coefficients" ,y ="features",data=dataframe,palette="husl")
            plt.title("FEATURE IMPORTANCES",fontsize =20)
            for i,j in enumerate(dataframe["coefficients"]):
                ax.text(.011,i,j,weight = "bold")
        except:
            print(f"{0} has no coef argument", str(algorithm))
# UNSCALED DATA
modelX(dt_clf, X_train, y_train, X_test, y_test, "feat")
**********************************************************************
MODEL X
**********************************************************************
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_features=5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.1, min_samples_split=0.2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

 accuracy_score : 0.9656339935669545

classification report :
               precision    recall  f1-score   support

           0       0.99      0.97      0.98      5363
           1       0.78      0.88      0.82       544

    accuracy                           0.97      5907
   macro avg       0.88      0.93      0.90      5907
weighted avg       0.97      0.97      0.97      5907

png

Kurtosis Integrated Profile ('KURTOSIS_IP') is by far the most important classifying feature when it comes to identifying Pulsars. Let's double check the other metrics with our scaled/transformed data:

# SCALED DATA
modelX(dt_clf, X_train_transformed, y_train, X_test_transformed, y_test, "coef")
**********************************************************************
MODEL X
**********************************************************************
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_features=5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.1, min_samples_split=0.2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

 accuracy_score : 0.9481970543423057

classification report :
               precision    recall  f1-score   support

           0       0.98      0.96      0.97      5363
           1       0.69      0.81      0.74       544

    accuracy                           0.95      5907
   macro avg       0.83      0.88      0.86      5907
weighted avg       0.95      0.95      0.95      5907

0 has no coef argument DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_features=5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.1, min_samples_split=0.2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

png

F1 Score The F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Harmonic Mean f1 = 2*(P*R / P+R)

f1 = f1_score(y_test, y_pred)
f1
0.8245462402765774

Because the data involves imbalanced classes, F1 score is most important metric for us to validate the model's accuracy. Let's compare the Decision Tree classifier performance to XGBoost next.

XG Boost

Moving ahead with XG Boost

Create Instance and Fit

# Fit XG Boost model  
# Instantiate XGBClassifier with balanced class weights
xgb_clf = XGBClassifier(class_weight='balanced')

# Fit XGBClassifier
xgb_clf.fit(X_train_transformed, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', class_weight='balanced',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

Make Predictions

# Predict on training and test sets
training_preds = xgb_clf.predict(X_train_transformed)
test_preds = xgb_clf.predict(X_test_transformed)

Evaluate

Accuracy

# Accuracy of training and test sets
training_accuracy = accuracy_score(y_train, training_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))
Training Accuracy: 98.38%
Validation accuracy: 98.02%

ROC_AUC and Confusion Matrix

# SCALED DATA
modelX(xgb_clf, X_train_transformed, y_train, X_test_transformed, y_test, "coef")
**********************************************************************
MODEL X
**********************************************************************
XGBClassifier(base_score=0.5, booster='gbtree', class_weight='balanced',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

 accuracy_score : 0.9801929913661758

classification report :
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      5363
           1       0.92      0.86      0.89       544

    accuracy                           0.98      5907
   macro avg       0.95      0.92      0.94      5907
weighted avg       0.98      0.98      0.98      5907

0 has no coef argument XGBClassifier(base_score=0.5, booster='gbtree', class_weight='balanced',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

png

GridSearchCV

Tuning XG Boost with a parameter Gridsearch

param_grid = {
    'learning_rate': [0.1, 0.2, 0.3],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.3, 0.5, 0.7],
    'n_estimators': [100],
}
grid_clf = GridSearchCV(xgb_clf, param_grid, scoring='accuracy', cv=None, n_jobs=1)
grid_clf.fit(X_train_transformed, y_train)

best_parameters = grid_clf.best_params_

print('Grid Search found the following optimal parameters: ')
for param_name in sorted(best_parameters.keys()):
    print('%s: %r' % (param_name, best_parameters[param_name]))

training_preds = grid_clf.predict(X_train_transformed)
test_preds = grid_clf.predict(X_test_transformed)
training_accuracy = accuracy_score(y_train, training_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print('')
print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))
Grid Search found the following optimal parameters: 
learning_rate: 0.3
max_depth: 6
min_child_weight: 1
n_estimators: 100
subsample: 0.7

Training Accuracy: 99.96%
Validation accuracy: 97.78%

Evaluate Model

AUC

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, test_preds)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC is :{0}'.format(round(roc_auc, 2)))
AUC is :0.92
# compare a few different regularization performances on the dataset:
C_param_range = [0.005, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8]
names = [0.005, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9]
colors = sns.color_palette('Set2', n_colors=len(names))

plt.figure(figsize=(10, 8))

for n, c in enumerate(C_param_range):

    # Predict
    #y_pred = tree_clf.predict(X_test)
    y_pred = dt_clf.predict(X_test_transformed)
    y_score = accuracy_score(y_test, y_pred)
    
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred)
    #roc_auc = auc(fpr, tpr)
    print('----------------------------------------------')
    print('AUC for {}: {}'.format(names[n], auc(fpr, tpr)))
    lw = 2
    plt.plot(fpr, tpr, color=colors[n],
             lw=lw, label='ROC curve Normalization Weight: {}'.format(names[n]))
    
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
----------------------------------------------
AUC for 0.005: 0.8839272493446382
----------------------------------------------
AUC for 0.1: 0.8839272493446382
----------------------------------------------
AUC for 0.2: 0.8839272493446382
----------------------------------------------
AUC for 0.3: 0.8839272493446382
----------------------------------------------
AUC for 0.5: 0.8839272493446382
----------------------------------------------
AUC for 0.6: 0.8839272493446382
----------------------------------------------
AUC for 0.7: 0.8839272493446382
----------------------------------------------
AUC for 0.8: 0.8839272493446382

png

Confusion matrix

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
pd.crosstab(y_test, test_preds, rownames=['True'], colnames=['Predicted'], margins=True)
Confusion Matrix
----------------
Predicted 0 1 All
True
0 5324 39 5363
1 78 466 544
All 5402 505 5907
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Print confusion matrix
cnf_matrix = confusion_matrix(y_test, test_preds)
print('Confusion Matrix:\n', cnf_matrix)
Confusion Matrix:
 [[5324   39]
 [  78  466]]
# Plot normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=['Non-Pulsar', 'Pulsar'], normalize=True,
                      title='Normalized confusion matrix')
Normalized confusion matrix

png

# Plot normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=['Non-Pulsar', 'Pulsar'], normalize=False,
                      title='Normalized confusion matrix')
Confusion matrix, without normalization

png

MSE and R2

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

# Make predictions and evaluate
print('MSE score:', mse(y_test, y_pred))
print('R-sq score:', r2_score(y_test,y_pred))
MSE score: 0.05180294565769426
R-sq score: 0.38044238299459265

Feature Importance

# Feature importance
xgb_clf.feature_importances_
array([0.04194515, 0.03668287, 0.68801844, 0.03353313, 0.02772439,
       0.09195759, 0.03966612, 0.04047231], dtype=float32)
importance = pd.Series(data=xgb_clf.feature_importances_, index=X_train.columns)
importance.sort_values(ascending=False)
KURTOSIS_IP       0.688018
STD_CURVE         0.091958
MEAN_IP           0.041945
SKEWNESS_CURVE    0.040472
KURTOSIS_CURVE    0.039666
STD_IP            0.036683
SKEWNESS_IP       0.033533
MEAN_CURVE        0.027724
dtype: float32
def plot_feature_importances(model):
    
    importance = pd.Series(data=model.feature_importances_, index=X_train.columns)
    importance = importance.sort_values(ascending=True)
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(importance.index, importance.values, align='center') 
    #plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(xgb_clf)

png

print("Testing Accuracy for XG Boost Classifier: {:.4}%".format(accuracy_score(y_test, y_pred) * 100))
Testing Accuracy for XG Boost Classifier: 94.82%
print("Testing F1 Score for XG Boost Classifier: {:.4}%".format(f1_score(y_test, y_pred) * 100))
Testing F1 Score for XG Boost Classifier: 74.11%

INTERPRET RESULTS

# Cross-Validation
from sklearn.model_selection import cross_val_score

xgb_cv_score = cross_val_score(xgb_clf, X_train_transformed, y_train, cv=3)
mean_xgb_cv_score = np.mean(xgb_cv_score)

print(f"Mean Cross Validation Score: {mean_xgb_cv_score :.2%}")
Mean Cross Validation Score: 98.02%
# Feature Importance
from xgboost import plot_importance

plot_importance(booster=xgb_clf)

png

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%