The Financial Journal (Global): Logistic Regression in Python by Mirko Stojiljkovic (plus some updates)

**0_MacOS_Python_setup.txt**
# Install on Terminal of MacOS #pip3 install -U numpy #pip3 install -U scikit-learn #pip3 install -U matplotlib #pip3 install -U statsmodels

1_MacOS_Terminal.txt

########## Run Terminal on MacOS and execute
### TO UPDATE
cd "YOUR_WORKING_DIRECTORY"

python3

#Logistic Regression in Python With scikit-learn: Example 1
from lr1 import *

#Logistic Regression in Python With scikit-learn: Example 2
from lr2 import *

#Logistic Regression in Python With StatsModels: Example
from lr3 import *

#Logistic Regression in Python: Handwriting Recognition
from lr4 import *

0. Logistic Regression in Python (Intro without executable code)

lr0.py

#################### Logistic Regression in Python (Intro without executable code) ####################

#Source:
#Logistic Regression in Python by Mirko Stojiljkovic
#https://realpython.com/logistic-regression-python/

########## Classification

#Keywords
#logistic regression, classification, supervised machine learning

#dependent variable (or output or response) y = mutually indedepent variables (or features, inputs, or predictors) b0 + b1x1 + b2x2 + ...
#The set of data x is related to a single observation.
#Supervised machine learning algorithms analyze a number of observations and build a model, i.e., a mathematical representation of dependence between inputs x and outputs y.

#Regression problems have continuous and usually UNBOUNDED outputs. (e.g., salary as an output, experience and education as inputs)
#Classification problems, such as, logistic regressions, have discrete and finite outputs called classed or categories. (an example of binary/binominal classification - predicting whether or not an employee is going to be promoted)

########## Logistic Regression Overview

#Logistic regression is essentially a method for binary classification, but it can also be applied to multiclass problems.
#b0, b1, b2, ... above are estimators of the regression coefficients, which are also called the predicted weights or just coefficients.

# Logistic regression function p(x) = 1/(1+exp(-f(x))) where y = f(x)
# p(x) is the sigmoid function of f(x). It's often close to either 0 or 1 (predicted probability that the output for a given x is equal to 1).

#Logistic regression determines the best predicted weights b0, b1, ..., b r such that the function p (x ) is as close as possible to all actual responses y i, i = 1, ..., n , where n is the number of observations. The process of calculating the best weights using available observations is called model training or fitting.

#To get the best weights, you usually maximize the log-likelihood function (LLF) for all observations i = 1, ..., n . This method is called the maximum likelihood estimation and is represented by the equation LLF.

#####Classification Performance

#You usually evaluate the performance of your classifier by comparing the actual and predicted outputs and counting the correct and incorrect predictions.
#
#(classification) accuracy = number of correct predictions / total number of predictions (or observations)
#positive predictive value = number of true positives / sum of the numbers of true and false positives
#negative predictive value = number of true negatives / sum of the numbers of true and false negatives
#sensitivity (aka recall or true positive rate) = number of true positives / number of actual positives
#specificity (or true negative rate) = number of true negatives / number of actual negatives

#####Regularization
#Overfitting is one of the most serious kinds of problems related to machine learning. It occurs when a model learns the training data too well. The model then learns not only the relationships among data but also the noise in the dataset. Overfitted models tend to have good performance with the data used to fit them (the training data), but they behave poorly with unseen data (or test data, which is data not used to fit the model).

#Overfitting usually occurs with complex models. Regularization normally tries to reduce or penalize the complexity of the model. Regularization techniques applied with logistic regression mostly tend to penalize large coe!icients b0, b1, ..., br:
#
# - L1 regularization penalizes the LLF with the scaled sum of the absolute values of the weights: | b0|+| b1|+...􏰀+| br|.
# - L2 regularization penalizes the LLF with the scaled sum of the squares of the weights: b 0^2+ b1^2+...􏰀+b r^2.
# - Elastic-net regularization is a linear combination of L1 and L2 regularization.
#
#Regularization can significantly improve model performance on unseen data.

########## Logistic Regression in Python

# You'll see the following:
# - Python packages for logistic regression (NumPy, scikit-learn, StatsModels, and Matplotlib) (as in lr1.py, lr2.py, lr3.py, and lr4.py)
# - Two illustrative examples of logistic regression solved with scikit-learn (as in lr1.py and lr2.py)
# - One conceptual example solved with StatsModels (as in lr3.py)
# - One real-world example of classifying handwritten digits (as in lr4.py)

1. Logistic Regression in Python With scikit-learn: Example 1

**lr1.py**
########## Logistic Regression in Python With scikit-learn: Example 1 # a single-variate binary classification problem # There are several general steps you’ll take when you’re preparing your classification models: #1. IMPORT packages, functions, and classes #2. GET DATA to work with and, if appropriate, transform it #3. CRATE A CLASSIFICATION MODEL and train (or fit) it with your existing data #4. EVALUATE YOUR MODEL to see if its performance is satisfactory #####Step 1: Import Packages, Functions, and Classes import matplotlib.pyplot as plt #visualization import numpy as np # array operations from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix #####Step 2: Get Data x = np.arange(10).reshape(-1, 1) #.reshape() to make x two-dimensional with the arguments -1 to get as many rows as needed and it to get one column y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) #The array x is required to be two-dimensional and have one column for each input (i.e., a single input, not multiple-input), and the number of rows (10 in this case) should be equal to the number of observations. x #array([[0], # [1], # [2], # [3], # [4], # [5], # [6], # [7], # [8], # [9]]) #y is one-dimensional with ten items in this case. Each item corresponds to one observation. It contains only zeros and ones since this is a binary classification problem. y #array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) #####Step 3: Create a Model and Train It model = LogisticRegression(solver='liblinear', random_state=0) #LogisticRegression has several optional parameters that define the behavior of the model and approach: # - penalty (omitted here) is a string ('l2' by default) that decides whether there is regularization and which approach to use. Other options are 'l1', 'elasticnet', and 'none'. # - C is a positive floating-point number (1.0 by default) that defines the relative strength of regularization. Smaller values indicate stronger regularization. # - random_state (=0 here) is an integer, an instance of numpy.RandomState, or None (default) that defines what pseudo-random number generator to use. # - solver is a string ('liblinear' by default and is used here) that decides what solver to use for fitting the model. Other options are 'newton-cg', 'lbfgs', 'sag', and 'saga'. # -- 'liblinear' solver doesn’t work without regularization. # -- 'newton-cg', 'sag', 'saga', and 'lbfgs' don’t support L1 regularization. # -- 'saga' is the only solver that supports elastic-net regularization. # - max_iter is an integer (100 by default) that defines the maximum number of iterations by the solver during model fitting. # There are other optional parameters. Please refer to the original document. # model fitting: a process of determining the coefficients b0, b1, ..., br that correspond to the best value of the cost function. model.fit(x, y) #You can use the fact that .fit() returns the model instance and chain the last two statements. They are equivalent to the following line of code: model = LogisticRegression(solver='liblinear', random_state=0).fit(x, y) #the array of distinct values that y takes: model.classes_ #array([0, 1]) #This is the example of binary classification, and y can be 0 or 1, as indicated above. # the value of the intercept b 0 of the linear function model.intercept_ #array([-1.04608067]) # # the value of the slope b1 of the linear function model.coef_ #array([[0.51491375]]) # #As you can see, b0 is given inside a one-dimensional array, while b 1 is inside a two- dimensional array. #####Step 4: Evaluate the Model model.predict_proba(x) #array([[0.74002157, 0.25997843], # [0.62975524, 0.37024476], # [0.5040632 , 0.4959368 ], # [0.37785549, 0.62214451], # [0.26628093, 0.73371907], # [0.17821501, 0.82178499], # [0.11472079, 0.88527921], # [0.07186982, 0.92813018], # [0.04422513, 0.95577487], # [0.02690569, 0.97309431]]) # # #In the matrix above, each row corresponds to a single observation. The first column is the probability of the predicted output being zero, that is 1 - ( ). The second column is the probability that the output is one, or ( ). #You can get the actual predictions, based on the probability matrix and the values of ( ), with .predict(): model.predict(x) #array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1]) #This function returns the predicted output values as a one-dimensional array. # #Actual responses and correct predictions #y #array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) # #model.predict(x) #array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1]) # #array([0, 0, 0, X, 1, 1, 1, 1, 1, 1]) #The X above shows the incorrect prediction which wrongly classified as 1. All other values are predicted correctly. #(classification) accuracy = number of correct predictions / total number of predictions (or observations) #9/10 = 0.9 model.score(x, y) #0.9 #confusion matrix # In the case of binary classification, the confusion matrix shows the numbers of the following: # - True negatives in the upper-left" position # - False negatives in the lower-left" position # - False positives in the upper-right position # - True positives in the lower-right position # # True negative False positive # False negative True positive # confusion_matrix(y, model.predict(x)) #array([[3, 1], # [0, 6]]) cm = confusion_matrix(y, model.predict(x)) fig, ax = plt.subplots(figsize=(8, 8)) ax.imshow(cm) ax.grid(False) ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s')) ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s')) ax.set_ylim(1.5, -0.5) for i in range(2): for j in range(2): ax.text(j, i, cm[i, j], ha='center', va='center', color='red') # show a figure #plt.show() # # save a figure as png plt.savefig('figure_ex1_1.png') print(classification_report(y, model.predict(x))) # precision recall f1-score support # # 0 1.00 0.75 0.86 4 # 1 0.86 1.00 0.92 6 # # accuracy 0.90 10 # macro avg 0.93 0.88 0.89 10 #weighted avg 0.91 0.90 0.90 10 #Note: It’s usually better to evaluate your model with the data you didn’t use for training. That’s how you avoid bias and detect overfitting. You’ll see an example later in this tutorial. #####Improve the Model #You can improve your model by setting di!erent parameters. For example, let’s work with the regularization strength C equal to 10.0, instead of the default value of 1.0: model = LogisticRegression(solver='liblinear', C=10.0, random_state=0) model.fit(x, y) model.intercept_ #array([-3.51335372]) # model.coef_ #array([[1.12066084]]) # #As you can see, the absolute values of the intercept b0 and the coe!icient b1 are larger. This is the case because the larger value of C means weaker regularization, or weaker penalization related to high values of b0 and b1. model.predict_proba(x) #array([[0.97106534, 0.02893466], # [0.9162684 , 0.0837316 ], # [0.7810904 , 0.2189096 ], # [0.53777071, 0.46222929], # [0.27502212, 0.72497788], # [0.11007743, 0.88992257], # [0.03876835, 0.96123165], # [0.01298011, 0.98701989], # [0.0042697 , 0.9957303 ], # [0.00139621, 0.99860379]]) model.predict(x) #array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) # #Actual responses and correct predictions #y #array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) model.score(x, y) #1.0 confusion_matrix(y, model.predict(x)) #array([[4, 0], # [0, 6]]) print(classification_report(y, model.predict(x))) # precision recall f1-score support # # 0 1.00 1.00 1.00 4 # 1 1.00 1.00 1.00 6 # # accuracy 1.00 10 # macro avg 1.00 1.00 1.00 10 #weighted avg 1.00 1.00 1.00 10 cm2 = confusion_matrix(y, model.predict(x)) fig, ax = plt.subplots(figsize=(8, 8)) ax.imshow(cm2) ax.grid(False) ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s')) ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s')) ax.set_ylim(1.5, -0.5) for i in range(2): for j in range(2): ax.text(j, i, cm2[i, j], ha='center', va='center', color='red') # show a figure #plt.show() # # save a figure as png plt.savefig('figure_ex1_2.png')

2. Logistic Regression in Python With scikit-learn: Example 2

lr2.py

######### Logistic Regression in Python With scikit-learn: Example 2

##### Step 1: Import packages, functions, and classes
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

##### Step 2: Get data
x = np.arange(10).reshape(-1, 1)
y = np.array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1])

##### Step 3: Create a model and train it
model = LogisticRegression(solver='liblinear', C=10.0, random_state=0)
model.fit(x, y)

##### Step 4: Evaluate the model
p_pred = model.predict_proba(x)
y_pred = model.predict(x)
score_ = model.score(x, y)
conf_m = confusion_matrix(y, y_pred)
report = classification_report(y, y_pred)

print('x:', x, sep='\n')

print('y:', y, sep='\n', end='\n\n')

print('intercept:', model.intercept_)

print('coef:', model.coef_, end='\n\n')

print('p_pred:', p_pred, sep='\n', end='\n\n')

print('y_pred:', y_pred, end='\n\n')

print('score_:', score_, end='\n\n')

print('conf_m:', conf_m, sep='\n', end='\n\n')

print('report:', report, sep='\n')

#####added
import matplotlib.pyplot as plt #visualization

#use derived conf_m above

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(conf_m)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, conf_m[i, j], ha='center', va='center', color='red')

# show a figure
#plt.show()
#
# save a figure as png
plt.savefig('figure_ex2.png')

3. Logistic Regression in Python With StatsModels: Example

lr3.py

######### Logistic Regression in Python With StatsModels: Example

#####Step 1: Import Packages
import numpy as np

#Execute the following script on Terminal first:
#pip3 install -U statsmodels
import statsmodels.api as sm

#####Step 2: Get Data

#StatsModels doesn’t take the intercept b 0 into account, and you need to include the additional column of ones in x. You do that with add_constant():

x = np.arange(10).reshape(-1, 1)

y = np.array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1])

#add_constant() takes the array x as the argument and returns a new array with the additional column of ones.
x = sm.add_constant(x)

x
#array([[1., 0.],
# [1., 1.],
# [1., 2.],
# [1., 3.],
# [1., 4.],
# [1., 5.],
# [1., 6.],
# [1., 7.],
# [1., 8.],
# [1., 9.]])
#
#The first column of x corresponds to the intercept 0. The second column contains the original values of x.

y
#array([0, 1, 0, 0, 1, 1, 1, 1, 1, 1])

#####Step 3: Create a Model and Train It
model = sm.Logit(y, x)

result = model.fit(method='newton')
#or, if you want to apply L1 regularization,
#result = model.fit_regularized(method='newton')

result.params
#array([-1.972805 , 0.82240094])
#intercept b0, slope b1

#####Step 4: Evaluate the Model

result.predict(x)
#array([0.12208792, 0.24041529, 0.41872657, 0.62114189, 0.78864861,
# 0.89465521, 0.95080891, 0.97777369, 0.99011108, 0.99563083])

(result.predict(x) >= 0.5).astype(int)
#array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

result.pred_table()
#array([[2., 1.],
# [1., 6.]])

result.summary()

result.summary2()

4. Logistic Regression in Python: Handwriting Recognition

**lr4.py**
######### Logistic Regression in Python: Handwriting Recognition # This example is about image recognition.To be more precise, you’ll work on the recognition of handwritten digits. You’ll use a dataset with 1797 observations, each of which is an image of one handwritten digit. Each image has 64 px, with a width of 8 px and a height of 8 px. #The inputs (x ) are vectors with 64 dimensions or values. Each input vector describes one image. Each of the 64 values represents one pixel of the image. The input values are the integers between 0 and 16, depending on the shade of gray for the corresponding pixel. #The output (y ) for each observation is an integer between 0 and 9, consistent with the digit on the image. There are ten classes in total, each corresponding to one image. #####Step 1: Import Packages import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_digits from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report, confusion_matrix from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler ##### import pandas library import pandas as pd #####Step 2a: Get Data x, y = load_digits(return_X_y=True) x #type(x) #<class 'numpy.ndarray'> # pd.DataFrame(data=x, dtype ="Int64").to_csv("x.csv") pd.DataFrame(data=x, dtype="Int64").to_csv("x.csv") xtmp = pd.read_csv("x.csv", index_col=0) xtmp = xtmp.values #type(xtmp) #<class 'numpy.ndarray'> #xtmp #type(tmp) #<class 'numpy.ndarray'> #pd.DataFrame(data=xtmp, dtype="Int64").to_csv("xtmp.csv") x = xtmp y #type(y) #<class 'numpy.ndarray'> pd.DataFrame(data=y, dtype="Int64").to_csv("y.csv") ytmp = pd.read_csv("y.csv", index_col=0) ytmp = ytmp.values #type(ytmp) #<class 'numpy.ndarray'> #ytmp #type(ymp) #<class 'numpy.ndarray'> #pd.DataFrame(data=ytmp, dtype="Int64").to_csv("ytmp.csv") y = ytmp #####Step 2b: Split Data x_train, x_test, y_train, y_test =\ train_test_split(x, y, test_size=0.2, random_state=0) #####Step 2c: Scale Data # #Standardization is the process of transforming data in a way such that the mean of each column becomes equal to zero, and the standard deviation of each column is one. This way, you obtain the same scale for all columns. scaler = StandardScaler() x_train = scaler.fit_transform(x_train) #Now, x_train is a standardized input array. #####Step 3: Create a Model and Train It model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr', random_state=0) #multi_class parameter of LogisticRegression: # - 'ovr' to make the binary fit for each class # - 'multinomial' to apply the multinomial loss fit. model.fit(x_train, y_train) #####Step 4: Evaluate the Model x_test = scaler.transform(x_test) #standardize x_test y_pred = model.predict(x_test) model.score(x_train, y_train) #0.964509394572025 # model.score(x_test, y_test) #0.9416666666666667 #The training set accuracy is much higher might indicate overfitting. The test set accuracy is more relevant for evaluating the performance on unseen data since it’s not biased. confusion_matrix(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) fig, ax = plt.subplots(figsize=(8, 8)) ax.imshow(cm) ax.grid(False) ### added font_size font_size = 10 ax.set_xlabel('Predicted outputs', fontsize=font_size, color='black') ax.set_ylabel('Actual outputs', fontsize=font_size, color='black') ax.xaxis.set(ticks=range(10)) ax.yaxis.set(ticks=range(10)) ax.set_ylim(9.5, -0.5) for i in range(10): for j in range(10): ax.text(j, i, cm[i, j], ha='center', va='center', color='white') # show a figure #plt.show() # # save a figure as png plt.savefig('figure_ex4.png') print(classification_report(y_test, y_pred))

Reference

Logistic Regression in Python by Mirko Stojiljkovic
https://realpython.com/logistic-regression-python/

The Financial Journal (Global)

AdSense

Monday, May 18, 2020

Logistic Regression in Python by Mirko Stojiljkovic (plus some updates)

No comments:

Post a Comment

Deep Learning (Regression, Multiple Features/Explanatory Variables, Supervised Learning): Impelementation and Showing Biases and Weights

Report Abuse