Predicting Pulsars with Decision Trees and XG Boost

Predicting Pulsars with Decision Trees and XG Boost is a supervised machine learning feature classification project that uses Decision Trees and XGBoost to predict and classify signals as either a pulsar or noise.

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey.

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter.

As pulsars rotate, their emission beams sweep across the sky which produces a detectable pattern of broadband radio emission when crossing our line of sight. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation. Detection of a potential signal is known as a 'candidate', which is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find. Thus, legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class.

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).

Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency. The remaining four variables are similarly obtained from the DM-SNR curve.

* Mean of the integrated profile.
* Standard deviation of the integrated profile.
* Excess kurtosis of the integrated profile.
* Skewness of the integrated profile.
* Mean of the DM-SNR curve.
* Standard deviation of the DM-SNR curve.
* Excess kurtosis of the DM-SNR curve.
* Skewness of the DM-SNR curve.
* Class

HTRU 2 Summary:

* 17,898 total examples
        * 1,639 positive examples
        * 16,259 negative examples

Prerequisites

Before you begin, ensure you have met the following requirements:

You have installed the latest version of Jupyter Notebook
You have a <Windows/Linux/Mac> machine.

Running the Time Series Forecasting with SARIMAX and Gridsearch Project

To run this project locally, follow these steps:

In the command line/terminal:

git clone https://github.com/hakkeray/predicting-pulsars-with-decision-trees-and-xgboost
cd predicting-pulsars-with-decision-trees-and-xgboost
jupyter notebook

Summary

Pipeline

To begin the analysis, I used a pipeline to determine the most accurate models for predicting a pulsar. After performing Standard Scaling on the dataset, I split the dataset into train-test prediction models for Logistic Regression, Support Vector Machines, Decision Trees and XG Boost. All were fairly accurate, with Decision Trees and XG Boost topping the list for accuracy scores.

Decision Trees

Following this, I proceeded with a Decision Tree classifier with balanced class weights, which did fairly well, scoring 96% accuracy. However, because of the imbalanced classes, the F1 score is the most important validator for model accuracy, and the Decision Tree classifier scored 82%.

XGBoost

Moving on to XGBoost, I scored 98% accuracy with an 89% F1 score. I was able to successfully identify 466 pulsars, missing only 78 that we our model mistakenly identified as noise.

RECOMMENDATIONS

 * Focus on Kurtosis Integrated Profile

 * Focus on Standard Deviation DM-NSR Curve

 * Validate model predictions with analysis of other celestial objects 
 producing cosmic rays to see if they show the same attributes.

FUTURE WORK

Improving the model, trying other ways of scaling, balancing class weights.
Looking at stars right before they die - predicting whether or not it will become a pulsar or not (could be slightly impossible considering stars live for billions of years…)

Contact

If you want to contact me you can reach me at [email protected].

License

This project uses the following license: MIT License.

#         _ __ _   _
#  /\_/\ | '__| | | |
#  [===] | |  | |_| |
#   \./  |_|   \__,_|
#

Process Outline

* IMPORT PACKAGES + LIBRARIES

* OBTAIN DATA

* PRE-PROCESSING

* EDA + VISUALIZATIONS

* MODELING:
    * MODEL 1: DECISION TREES
    * MODEL 2: RANDOM FOREST
    * MODEL 3: XGBOOST
    * MODEL 4: GRIDSEARCH CV (ALL)
    
* INTERPRET RESULTS

* CONCLUSION + SUMMARY

* FUTURE WORK

Libraries

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree

from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score,roc_curve,auc,average_precision_score,recall_score,precision_score,f1_score,classification_report
from sklearn.tree import export_graphviz
from IPython.display import Image  
from pydotplus import graph_from_dot_data
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

Dataset

</style>

	Mean of the integrated profile	Standard deviation of the integrated profile	Excess kurtosis of the integrated profile	Skewness of the integrated profile	Mean of the DM-SNR curve	Standard deviation of the DM-SNR curve	Excess kurtosis of the DM-SNR curve	Skewness of the DM-SNR curve
0	140.562500	55.683782	-0.234571	-0.699648	3.199833	19.110426	7.975532	74.242225
1	102.507812	58.882430	0.465318	-0.515088	1.677258	14.860146	10.576487	127.393580
2	103.015625	39.341649	0.323328	1.051164	3.121237	21.744669	7.735822	63.171909
3	136.750000	57.178449	-0.068415	-0.636238	3.642977	20.959280	6.896499	53.593661
4	88.726562	40.672225	0.600866	1.123492	1.178930	11.468720	14.269573	252.567306

Descriptive Statistics

</style>

	MEAN_IP	STD_IP	KURTOSIS_IP	SKEWNESS_IP	MEAN_CURVE	STD_CURVE	KURTOSIS_CURVE	SKEWNESS_CURVE	TARGET
count	17898.000000	17898.000000	17898.000000	17898.000000	17898.000000	17898.000000	17898.000000	17898.000000	17898.000000
mean	111.079968	46.549532	0.477857	1.770279	12.614400	26.326515	8.303556	104.857709	0.091574
std	25.652935	6.843189	1.064040	6.167913	29.472897	19.470572	4.506092	106.514540	0.288432
min	5.812500	24.772042	-1.876011	-1.791886	0.213211	7.370432	-3.139270	-1.976976	0.000000
25%	100.929688	42.376018	0.027098	-0.188572	1.923077	14.437332	5.781506	34.960504	0.000000
50%	115.078125	46.947479	0.223240	0.198710	2.801839	18.461316	8.433515	83.064556	0.000000
75%	127.085938	51.023202	0.473325	0.927783	5.464256	28.428104	10.702959	139.309331	0.000000
max	192.617188	98.778911	8.069522	68.101622	223.392140	110.642211	34.539844	1191.000837	1.000000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
MEAN_IP           17898 non-null float64
STD_IP            17898 non-null float64
KURTOSIS_IP       17898 non-null float64
SKEWNESS_IP       17898 non-null float64
MEAN_CURVE        17898 non-null float64
STD_CURVE         17898 non-null float64
KURTOSIS_CURVE    17898 non-null float64
SKEWNESS_CURVE    17898 non-null float64
TARGET            17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB

EXPLORE

Exploratory Data Analysis (EDA)

Comparing Attributes

`Hotmap( )`

def hotmap(df, figsize=(10,8)):
    ##### correlation heatmap
    corr = df.corr()
    fig, ax = plt.subplots(figsize=figsize)
    mask = np.zeros_like(corr, dtype=np.bool)
    idx = np.triu_indices_from(mask)
    mask[idx] = True
    
    sns.heatmap(np.abs(corr),square=True, mask=mask, annot=True,
            cmap=sns.color_palette("magma"),ax=ax,linewidth=2,edgecolor="k")
    ax.set_ylim(len(corr), -.5,.5)
    
    plt.title("CORRELATION BETWEEN VARIABLES")
    plt.show();
    
    ##### descriptive statistics heatmap
    fig, ax = plt.subplots(figsize=figsize)
    
    sns.heatmap(df.describe()[1:].transpose(),annot=True, ax=ax, 
                linecolor="w", linewidth=2,cmap=sns.color_palette("Set2")) #"Set2"
    ax.set_ylim(len(corr), -.5,.5)
    plt.title("Data summary")
    plt.show()
    
    plt.figure(figsize=(13,8))
    
    ### compare proportion of target classes
    plt.subplot(121)
    ax = sns.countplot(y = df["TARGET"],
                       palette=["b","lime"],
                       linewidth=1,
                       edgecolor="k"*2)
    for i,j in enumerate(df["TARGET"].value_counts().values):
        ax.text(.7,i,j,weight = "bold",fontsize = 27)
    plt.title("Count for target variable in datset")


    plt.subplot(122)
    plt.pie(df["TARGET"].value_counts().values,
            labels=["not pulsars","pulsars"],
            autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
    circ = plt.Circle((0,0),.7,color = "white")
    plt.gca().add_artist(circ)
    plt.subplots_adjust(wspace = .2)
    plt.title("Proportion of target variable in dataset")
    plt.show()

hotmap(df, figsize=(10,8))

Target Class Values are highly differentiated for the following features: * Kurtosis Integrated Profile * Skewness Integrated Profile

Other candidates include:

* Mean Curve
* Standard Deviation Cruve
* Kurtosis Curve
* Skewness Curve

Least likely to be important in distinguishing pulsars and RFI include:

* Mean Integrated Profile
* Standard Deviation IP

Comparing Pulsar vs NonPulsar : Mean and Standard Deviation

[GREEN == PULSAR , BLUE == NON-PULSAR]

The mean and standard deviation of the Skewness Curve if also a good candidate predictor for our target class.

Distribution of Target Class Variables

Pairplot

Scatterplot

Boxplot

Stackplot

3D Plot

Jointplot

Barplots

MODEL

Split Data

# create our feature set X and labels y:

y = df['TARGET'].copy()
X = df.drop(columns=['TARGET']).copy()

display(y.shape, X.shape)

(17898,)



(17898, 8)

# We'll do a 75/25 split on the dataset for training/test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)

Construct Pipelines

Scale data using StandardScaler()
Construct Pipelines

# logistic regression
pipe_lr = Pipeline([('scl', StandardScaler()),
                   ('pca', PCA(n_components=2)),
                   ('clf', LogisticRegression(class_weight='balanced'))])

# support vector
pipe_svm = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', svm.SVC(class_weight='balanced'))])


# decision tree
pipe_dt = Pipeline([('scl', StandardScaler()),
                   ('pca', PCA(n_components=2)),
                    ('clf', tree.DecisionTreeClassifier(class_weight='balanced'))])

# xgboost
pipe_xgb = Pipeline([('xgb', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                     ('clf', XGBClassifier(class_weight='balanced'))])

# List of pipelines for ease of iteration
pipelines = [pipe_lr, pipe_svm, pipe_dt, pipe_xgb]

# Dictionary of pipelines and classifier types for ease of reference
pipe_dict = {0: 'Logistic Regression', 
             1: 'Support Vector Machine', 
             2: 'Decision Tree', 
             3: 'XG Boost'}

# Fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train, y_train)

# Compare accuracies
for idx, val in enumerate(pipelines):
    print('%s pipeline training accuracy: %.3f' % (pipe_dict[idx], val.score(X_train, y_train)))
    print('%s pipeline test accuracy: %.3f' % (pipe_dict[idx], val.score(X_test, y_test)))

Logistic Regression pipeline training accuracy: 0.935
Logistic Regression pipeline test accuracy: 0.937
Support Vector Machine pipeline training accuracy: 0.955
Support Vector Machine pipeline test accuracy: 0.955
Decision Tree pipeline training accuracy: 1.000
Decision Tree pipeline test accuracy: 0.959
XG Boost pipeline training accuracy: 0.976
XG Boost pipeline test accuracy: 0.973

# Identify the most accurate model on test data
best_acc = 0.0
best_clf = 0
best_pipe = ''


for idx, val in enumerate(pipelines):
    if val.score(X_test, y_test) > best_acc:
        best_acc = val.score(X_test, y_test)
        best_pipe = val
        best_clf = idx
        
print('Classifier with best accuracy: %s' % pipe_dict[best_clf])

# Save pipeline to file
joblib.dump(best_pipe, 'best_pipeline.pkl', compress=1)
print('Saved %s pipeline to file' % pipe_dict[best_clf])

Classifier with best accuracy: XG Boost
Saved XG Boost pipeline to file

`Decision Tree`

Decision Tree

Standardize

# Standardize the data
std = StandardScaler()
X_train_transformed = std.fit_transform(X_train)
X_test_transformed = std.transform(X_test)

Create Instance

## Create an instance of decision tree classifier
dt_clf = DecisionTreeClassifier(criterion='entropy', class_weight='balanced')

Fit

# Fit the training data to the model
dt_clf.fit(X_train_transformed, y_train)

DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

DOT Graph

# Create DOT data
dot_data = export_graphviz(dt_clf, out_file=None, 
                           feature_names=X.columns,  
                           class_names=np.unique(y).astype('str'), 
                           filled=True, rounded=True, special_characters=True)

# Draw graph
graph = graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

Make Predictions

# Make predictions for test data
y_pred = dt_clf.predict(X_test_transformed)

Evaluate

Accuracy

# Check the accuracy, AUC, and create a confusion matrix

acc = accuracy_score(y_test,y_pred) * 100
print('Accuracy is :{0}'.format(acc))

Accuracy is :96.56339935669544

AUC

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC is :{0}'.format(round(roc_auc, 2)))

AUC is :0.91

Confusion Matrix

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Confusion Matrix
----------------

Predicted	0	1	All
True
0	5242	121	5363
1	82	462	544
All	5324	583	5907

# Print confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cnf_matrix)

Confusion Matrix:
 [[5242  121]
 [  82  462]]


TRUE POSITIVES: 462 Pulsars correctly identified,
TRUE NEGATIVES: 5242 correctly classified as noise
FALSE POSITIVES: 121 RFI/noise misclassified as pulsars
FALSE NEGATIVES: 82 Pulsars misclassifed as noise

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',cmap=plt.cm.Blues):
    
    import itertools
    # Check if normalize is set to True
    # If so, normalize the raw confusion matrix before visualizing
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')


    
    fig, ax = plt.subplots(figsize=(10,10))
    #mask = np.zeros_like(cm, dtype=np.bool)
    #idx = np.triu_indices_from(mask)
    
    #mask[idx] = True

    plt.imshow(cm, cmap=cmap, aspect='equal')
    
    # Add title and axis labels 
    plt.title('Confusion Matrix') 
    plt.ylabel('True label') 
    plt.xlabel('Predicted label')
    
    # Add appropriate axis scales
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    #ax.set_ylim(len(cm), -.5,.5)
    
    # Text formatting
    fmt = '.2f' if normalize else 'd'
    # Add labels to each cell
    thresh = cm.max() / 2.
    # iterate thru matrix and append labels  
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment='center',
                 color='darkgray' if cm[i, j] > thresh else 'black',
                size=14, weight='bold')
    
    # Add a legend
    plt.colorbar()
    plt.show()

# Plot normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=['Non-Pulsar', 'Pulsar'], normalize=True,
                      title='Normalized confusion matrix')

Normalized confusion matrix

Parameter Tuning

* Create an array for max_depth values ranging from 1 - 32
* In a loop, train the classifier for each depth value (32 runs) 
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results

Max Depth

# Check for the best depth parameter
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []

# Identify the optimal tree depth for given data
for max_depth in max_depths:
    dt = DecisionTreeClassifier(criterion='entropy', max_depth=max_depth, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
   
    # Add auc score to previous train results
    train_results.append(roc_auc)
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
   
    # Add auc score to previous test results
    test_results.append(roc_auc)

# PLOT AUC curve
plt.figure(figsize=(11,8))
plt.plot(max_depths, train_results, 'k', label='Train AUC')
plt.plot(max_depths, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.title('MAX TREE DEPTH')
plt.legend()
plt.show()

Max tree depth optimal value does not improve beyond 3 for test data.

Min Sample Split

Now we'll check for the best min_samples_splits parameter for our decision tree.

* Create an array for min_sample_splits values ranging from 0.1 - 1 with an 
increment of 0.1
* In a loop, train the classifier for each min_samples_splits value (10 runs)
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results

# Identify the optimal min-samples-split for given data

min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
train_results = []
test_results = []

for min_samples_split in min_samples_splits:
    dt = DecisionTreeClassifier(criterion='entropy', min_samples_split=min_samples_split, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

plt.figure(figsize=(11,8))
plt.plot(min_samples_splits, train_results, 'k', label='Train AUC')
plt.plot(min_samples_splits, test_results, 'r', label='Test AUC')
plt.xlabel('Min. Sample splits')
plt.ylabel('AUC score')
plt.title('MIN SAMPLES SPLITS')
plt.legend()
plt.show()

AUC does not improve beyond 0.2 for test data.

Minimum Sample Leafs

Now we'll check for the best min_samples_leafs parameter value for our decision tree.

* Create an array for min_samples_leafs values ranging from 0.1 - 0.5 
with an increment of 0.1
* In a loop, train the classifier for each min_samples_leafs value (5 runs)
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results

# Calculate the optimal value for minimum sample leafs
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)
train_results = []
test_results = []

for min_samples_leaf in min_samples_leafs:
    dt = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=min_samples_leaf, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

# PLOT
plt.figure(figsize=(11,7))
plt.plot(min_samples_leafs, train_results, 'b', label='Train AUC')
plt.plot(min_samples_leafs, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('Min. Sample Leafs')
plt.title('MIN SAMPLE LEAFS')
plt.legend()
plt.show()

Highest AUC for both train and test data maximized at 0.10.

Maximum Features

Now we'll check for the best max_features parameter value for our decision tree.

* Create an array for max_features values ranging from 1 - 12 (1 features vs all)
* In a loop, train the classifier for each max_features value (12 runs)
* Calculate the training and test AUC for each run
* Plot a graph to show under/over fitting and optimal value
* Interpret the results

# Find the best value for optimal maximum feature size
max_features = list(range(1, X_train.shape[1]))
train_results = []
test_results = []

for max_feature in max_features:
    dt = DecisionTreeClassifier(criterion='entropy', max_features=max_feature, class_weight='balanced')
    dt.fit(X_train_transformed, y_train)
    train_pred = dt.predict(X_train_transformed)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    y_pred = dt.predict(X_test_transformed)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)

plt.figure(figsize=(13,8))
plt.plot(max_features, train_results, 'b', label='Train AUC')
plt.plot(max_features, test_results, 'r', label='Test AUC')
plt.ylabel('AUC score')
plt.xlabel('max features')
plt.title('MAX FEATURES')
plt.legend()
plt.show()

Increasing parameters has no clear effect on training data (flat AUC). 
Optimal value for test data is 5.

Retrain classifer

We'll now use the best values from each training phase above and feed it back to our classifier and see if have any improvement in predictive performance.

* Train the classifier with optimal values identified
* Compare the AUC with vanilla DT AUC
* Interpret the results of comparison

# Re-train DT classifier with optimal values identified above
dt_clf = DecisionTreeClassifier(criterion='entropy', class_weight='balanced',
                                max_features=5,
                                max_depth=3,
                                min_samples_split=0.2,
                                min_samples_leaf=0.1)
# fit model
dt_clf.fit(X_train_transformed, y_train)

# make predictions
y_pred = dt_clf.predict(X_test_transformed)

# roc_auc
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

0.9257396472014128

DOT Graph

# Create DOT data
dot_data = export_graphviz(dt_clf, out_file=None, 
                           feature_names=X.columns,  
                           class_names=np.unique(y).astype('str'), 
                           filled=True, rounded=True, special_characters=True)

# Draw graph
graph = graph_from_dot_data(dot_data)  

# Show graph
Image(graph.create_png())

def modelX(algorithm, X_train, y_train, X_test, y_test, of_type):
    from sklearn.metrics import classification_report
    print ("**********"*7)
    print ("MODEL X")
    print ("**********"*7)
    algorithm.fit(X_train, y_train)
    y_pred = algorithm.predict(X_test)
    
    print (algorithm)
    print ("\n accuracy_score :", accuracy_score(y_test, y_pred))
    
    print ("\nclassification report :\n",(classification_report(y_test, y_pred)))
        
    plt.figure(figsize=(13,10))
    plt.subplot(221)
    sns.heatmap(confusion_matrix(y_test, y_pred),annot=True,fmt = "d",linecolor="k",linewidths=3)
    plt.title("CONFUSION MATRIX",fontsize=20)
    
    pred_probs = algorithm.predict_proba(X_test)[:,1]
    fpr,tpr,thresholds = roc_curve(y_test, pred_probs)
    plt.subplot(222)
    plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
    plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
    plt.legend(loc = "best")
    plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)
    
    if of_type == "feat":
        
        dataframe = pd.DataFrame(algorithm.feature_importances_, X_train.columns).reset_index()
        dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
        dataframe = dataframe.sort_values(by="coefficients",ascending = False)
        plt.subplot(223)
        ax = sns.barplot(x ="coefficients", y="features",data=dataframe, palette="husl")
        plt.title("FEATURE IMPORTANCES",fontsize =20)
        for i,j in enumerate(dataframe["coefficients"]):
            ax.text(.011,i,j,weight = "bold")
    
    elif of_type == "coef":
        try:
            dataframe = pd.DataFrame(algorithm.coef_.ravel(), X_train.columns).reset_index()
            dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
            dataframe = dataframe.sort_values(by="coefficients",ascending = False)
            plt.subplot(223)
            ax = sns.barplot(x = "coefficients" ,y ="features",data=dataframe,palette="husl")
            plt.title("FEATURE IMPORTANCES",fontsize =20)
            for i,j in enumerate(dataframe["coefficients"]):
                ax.text(.011,i,j,weight = "bold")
        except:
            print(f"{0} has no coef argument", str(algorithm))

# UNSCALED DATA
modelX(dt_clf, X_train, y_train, X_test, y_test, "feat")

**********************************************************************
MODEL X
**********************************************************************
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_features=5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.1, min_samples_split=0.2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

 accuracy_score : 0.9656339935669545

classification report :
               precision    recall  f1-score   support

           0       0.99      0.97      0.98      5363
           1       0.78      0.88      0.82       544

    accuracy                           0.97      5907
   macro avg       0.88      0.93      0.90      5907
weighted avg       0.97      0.97      0.97      5907

Kurtosis Integrated Profile ('KURTOSIS_IP') is by far the most important classifying feature when it comes to identifying Pulsars. Let's double check the other metrics with our scaled/transformed data:

# SCALED DATA
modelX(dt_clf, X_train_transformed, y_train, X_test_transformed, y_test, "coef")

**********************************************************************
MODEL X
**********************************************************************
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_features=5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.1, min_samples_split=0.2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

 accuracy_score : 0.9481970543423057

classification report :
               precision    recall  f1-score   support

           0       0.98      0.96      0.97      5363
           1       0.69      0.81      0.74       544

    accuracy                           0.95      5907
   macro avg       0.83      0.88      0.86      5907
weighted avg       0.95      0.95      0.95      5907

0 has no coef argument DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_features=5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=0.1, min_samples_split=0.2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

F1 Score The F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Harmonic Mean f1 = 2*(P*R / P+R)

f1 = f1_score(y_test, y_pred)
f1

0.8245462402765774

Because the data involves imbalanced classes, F1 score is most important metric for us to validate the model's accuracy. Let's compare the Decision Tree classifier performance to XGBoost next.

`XG Boost`

Moving ahead with XG Boost

Create Instance and Fit

# Fit XG Boost model  
# Instantiate XGBClassifier with balanced class weights
xgb_clf = XGBClassifier(class_weight='balanced')

# Fit XGBClassifier
xgb_clf.fit(X_train_transformed, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', class_weight='balanced',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

Make Predictions

# Predict on training and test sets
training_preds = xgb_clf.predict(X_train_transformed)
test_preds = xgb_clf.predict(X_test_transformed)

Evaluate

Accuracy

# Accuracy of training and test sets
training_accuracy = accuracy_score(y_train, training_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))

Training Accuracy: 98.38%
Validation accuracy: 98.02%

ROC_AUC and Confusion Matrix

# SCALED DATA
modelX(xgb_clf, X_train_transformed, y_train, X_test_transformed, y_test, "coef")

**********************************************************************
MODEL X
**********************************************************************
XGBClassifier(base_score=0.5, booster='gbtree', class_weight='balanced',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

 accuracy_score : 0.9801929913661758

classification report :
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      5363
           1       0.92      0.86      0.89       544

    accuracy                           0.98      5907
   macro avg       0.95      0.92      0.94      5907
weighted avg       0.98      0.98      0.98      5907

0 has no coef argument XGBClassifier(base_score=0.5, booster='gbtree', class_weight='balanced',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

GridSearchCV

Tuning XG Boost with a parameter Gridsearch

param_grid = {
    'learning_rate': [0.1, 0.2, 0.3],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.3, 0.5, 0.7],
    'n_estimators': [100],
}

grid_clf = GridSearchCV(xgb_clf, param_grid, scoring='accuracy', cv=None, n_jobs=1)
grid_clf.fit(X_train_transformed, y_train)

best_parameters = grid_clf.best_params_

print('Grid Search found the following optimal parameters: ')
for param_name in sorted(best_parameters.keys()):
    print('%s: %r' % (param_name, best_parameters[param_name]))

training_preds = grid_clf.predict(X_train_transformed)
test_preds = grid_clf.predict(X_test_transformed)
training_accuracy = accuracy_score(y_train, training_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print('')
print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))

Grid Search found the following optimal parameters: 
learning_rate: 0.3
max_depth: 6
min_child_weight: 1
n_estimators: 100
subsample: 0.7

Training Accuracy: 99.96%
Validation accuracy: 97.78%

Evaluate Model

AUC

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, test_preds)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC is :{0}'.format(round(roc_auc, 2)))

AUC is :0.92

# compare a few different regularization performances on the dataset:
C_param_range = [0.005, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8]
names = [0.005, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9]
colors = sns.color_palette('Set2', n_colors=len(names))

plt.figure(figsize=(10, 8))

for n, c in enumerate(C_param_range):

    # Predict
    #y_pred = tree_clf.predict(X_test)
    y_pred = dt_clf.predict(X_test_transformed)
    y_score = accuracy_score(y_test, y_pred)
    
    
    fpr, tpr, thresholds = roc_curve(y_test, y_pred)
    #roc_auc = auc(fpr, tpr)
    print('----------------------------------------------')
    print('AUC for {}: {}'.format(names[n], auc(fpr, tpr)))
    lw = 2
    plt.plot(fpr, tpr, color=colors[n],
             lw=lw, label='ROC curve Normalization Weight: {}'.format(names[n]))
    
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

----------------------------------------------
AUC for 0.005: 0.8839272493446382
----------------------------------------------
AUC for 0.1: 0.8839272493446382
----------------------------------------------
AUC for 0.2: 0.8839272493446382
----------------------------------------------
AUC for 0.3: 0.8839272493446382
----------------------------------------------
AUC for 0.5: 0.8839272493446382
----------------------------------------------
AUC for 0.6: 0.8839272493446382
----------------------------------------------
AUC for 0.7: 0.8839272493446382
----------------------------------------------
AUC for 0.8: 0.8839272493446382

Confusion matrix

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
pd.crosstab(y_test, test_preds, rownames=['True'], colnames=['Predicted'], margins=True)

Confusion Matrix
----------------

Predicted	0	1	All
True
0	5324	39	5363
1	78	466	544
All	5402	505	5907

# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Print confusion matrix
cnf_matrix = confusion_matrix(y_test, test_preds)
print('Confusion Matrix:\n', cnf_matrix)

Confusion Matrix:
 [[5324   39]
 [  78  466]]

# Plot normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=['Non-Pulsar', 'Pulsar'], normalize=True,
                      title='Normalized confusion matrix')

Normalized confusion matrix

# Plot normalized confusion matrix
plot_confusion_matrix(cnf_matrix, classes=['Non-Pulsar', 'Pulsar'], normalize=False,
                      title='Normalized confusion matrix')

Confusion matrix, without normalization

MSE and R2

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

# Make predictions and evaluate
print('MSE score:', mse(y_test, y_pred))
print('R-sq score:', r2_score(y_test,y_pred))

MSE score: 0.05180294565769426
R-sq score: 0.38044238299459265

Feature Importance

# Feature importance
xgb_clf.feature_importances_

array([0.04194515, 0.03668287, 0.68801844, 0.03353313, 0.02772439,
       0.09195759, 0.03966612, 0.04047231], dtype=float32)

importance = pd.Series(data=xgb_clf.feature_importances_, index=X_train.columns)
importance.sort_values(ascending=False)

KURTOSIS_IP       0.688018
STD_CURVE         0.091958
MEAN_IP           0.041945
SKEWNESS_CURVE    0.040472
KURTOSIS_CURVE    0.039666
STD_IP            0.036683
SKEWNESS_IP       0.033533
MEAN_CURVE        0.027724
dtype: float32

def plot_feature_importances(model):
    
    importance = pd.Series(data=model.feature_importances_, index=X_train.columns)
    importance = importance.sort_values(ascending=True)
    n_features = X_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(importance.index, importance.values, align='center') 
    #plt.yticks(np.arange(n_features), X_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(xgb_clf)

print("Testing Accuracy for XG Boost Classifier: {:.4}%".format(accuracy_score(y_test, y_pred) * 100))

Testing Accuracy for XG Boost Classifier: 94.82%

print("Testing F1 Score for XG Boost Classifier: {:.4}%".format(f1_score(y_test, y_pred) * 100))

Testing F1 Score for XG Boost Classifier: 74.11%

INTERPRET RESULTS

# Cross-Validation
from sklearn.model_selection import cross_val_score

xgb_cv_score = cross_val_score(xgb_clf, X_train_transformed, y_train, cv=3)
mean_xgb_cv_score = np.mean(xgb_cv_score)

print(f"Mean Cross Validation Score: {mean_xgb_cv_score :.2%}")

Mean Cross Validation Score: 98.02%

# Feature Importance
from xgboost import plot_importance

plot_importance(booster=xgb_clf)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
output		output
._README.md		._README.md
.gitignore		.gitignore
.learn		.learn
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
additional-ML-classifiers-attempted.ipynb		additional-ML-classifiers-attempted.ipynb
best_pipeline.pkl		best_pipeline.pkl
mod5-project-pulsars.pdf		mod5-project-pulsars.pdf
pulsar_stars.csv		pulsar_stars.csv
student-initial-models.ipynb		student-initial-models.ipynb
student.ipynb		student.ipynb

License

alphasentaurii/detecting-dead-stars-in-deep-space

Folders and files

Latest commit

History

Repository files navigation

Predicting Pulsars with Decision Trees and XG Boost

Prerequisites

Running the Time Series Forecasting with SARIMAX and Gridsearch Project

Summary

Pipeline

Decision Trees

XGBoost

RECOMMENDATIONS

FUTURE WORK

Contact

License

Process Outline

Libraries

Dataset

Descriptive Statistics

EXPLORE

Comparing Attributes

Hotmap( )

Comparing Pulsar vs NonPulsar : Mean and Standard Deviation

Distribution of Target Class Variables

Pairplot

Scatterplot

Boxplot

Stackplot

3D Plot

Jointplot

Barplots

MODEL

Split Data

Construct Pipelines

Decision Tree

Standardize

Create Instance

Fit

DOT Graph

Make Predictions

Evaluate

Accuracy

AUC

Confusion Matrix

Parameter Tuning

Max Depth

Min Sample Split

Minimum Sample Leafs

Maximum Features

Retrain classifer

DOT Graph

XG Boost

Create Instance and Fit

Make Predictions

Evaluate

Accuracy

ROC_AUC and Confusion Matrix

GridSearchCV

Evaluate Model

AUC

Confusion matrix

MSE and R2

Feature Importance

INTERPRET RESULTS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Hotmap( )`

`Decision Tree`

`XG Boost`

Packages