Supervised Learning

Data points have known outcome.

Goal is to predct the nature of relationship between input parameters and target variables.

Parameters: of a machine learning model are 1 or more variables that changes their values as the model learns. No of parameters can range from very few to trillions of parameters.

Hyperparameters: are parameters that not learned directly from the data but relates to implementation.

Two types of problems:


Some terms of interests are:


Interpretation vs Prediction Objective:


Linear Regression

Measures of errors are: They can be used for any regression model


It is not a requirement that the target variable is normally distributed; but normally distributed target variable gives better result. What is required is that the error needs to be normally distributed.

If the target variable is not normally distributed, you can make it by transforming it. Then, fit our regression to the transformed values.

To see if the target variable is normally distributed, we can see manually or do a statistical test.

To transform a variable (target variable) to make it normally distributed, commonly used techniques are:


How to use:

  1. Import sklearn library as sklearn.linear_model.LinearRegression
  2. create an object as LR=LinearRegression() . You can also pass many other hyperparameters into the object creation.
  3. Create X df from the actual dataset by dropping the target variable column so that it is easy for further computation. Similarly, create Y by just grabbing the target variable as it column.
  4. Fit and transform the X data with the polynomial feature object.
    • First create an object as pf = PolynomialFeature(degree=2,include_bias=False) , that is from sklearn.preprocessing . The include_bias is False because later on LinearRegression will take care of that part.
    • Fit and transform as X_pf = pf.fit_transform(X) . X_pf now has a lot more columns that what X had.
  5. Test-Train split:
    • x_train, x_test, y_train, y_test = test_train_split(X_pf,y,test_size=0.3,random_state="some int value") where test_train_split is from sklearn.model_selection .
  6. Now apply standard scaler to the train data.
    • s = StandardScalar()
    • x_train_s = s.fit_transform(x_train)
  7. Now to bring the target variable to the normal distribution, we will use boxcox as discussed above.
    • y_train_bc = boxcox(y_train)[0]
    • We also need the lambda value for later when we need to compute inverse. so, lam=boxcox(y_train)[1]
  8. Fit the train data as LR = LR.fit(x_train_s, y_train_bc) . Here we used:
    • standard scalar fit and transfored x- data.
    • boxcoxed y- data
  9. Since the x- data used for modeling was transformed data, let us fit and transform the x_test data to StandardScalar, i.e. x_test_s = s.transform(x_test)
  10. The predicted value will be boxcoxed transformed; i.e. y_pred_bc = lr.predict(x_test_s) .
  11. To find the y_pred_bc back to the same scale as y, we can inverse boxcox. y_pred = inv_boxcox(y_pred_bc,lam) .
  12. Finally, compute the R2 score using R2 = r2_score(y_pred,y_test) .
  13. Using boxcox on the target variable improves the R2 score (higher is better). In the above example, if done without boxcox, the R2 score will be lower.

Data Splits and Cross Validation

To split to have a hold out data, so that its used for cross validation.

Training data:

Test data:

Syntax for test train split


Categorical Data is one-hot encoded. The number of columns that is made as a result of one-hot encoding is equal to no of category values - 1. The process is outline below:

Using sklearn.preprocessing.OneHotEncoder , instead of pd.get_dummies

Another way of dealing with categorical data is using LabelEncoder class in sklearn. This encoder should be used for encoding the target value, y, not the input variables X. It transforms the columns into numerical values from 0 to (num. of unique values -1).

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder
df['col1'] = le.fit_transform(df['col1'])


Applying scalar like StandardScalar or MinMaxScalar , make sure to call fit_transform() on the training data; but only transform() on the test data. This should be done before applying linearregression. An example is given below.

Cross Validation

Here we use multitple validation sets. These are test sets that are disjoint of each other. For each case of validation sets, the train dataset can be a subset of the remaining dataset. There should be no overlap between the validation or test splits in each iteration of experiment. Training data can have overlap.

The average error across all these validation sets is the cross validation result .

As the model gets more complex, the train error will minimize. But with the cross validation, there is an inflection point, and as the complexity increases, the error will increase. This is because too complex model will not generalize properly, and overfit. This we should stop increasing the complexity as soon as the crossvalidation error starts to increase.

Coding example:

Using Pipeline and Crossvalidation


Using pipeline


HyperParameter Tuning, Lasso Regression, PolynomialFeature

You can add PolynomialFeature to the pipeline. What a PolynomialFeature does is very well explained at this link . In a nutshell, it raises the input variables to a polynomial degree. So, after applying the PolynomialFeature, the input dimension now increases.

The above example thus becomes:

After going through the different scores based on various alpha values, we can find the best among all. And finally train the model as follows:

Stratified cross validation is not equivalent to k-fold cross validation with k=N-1 where N is no of features.

For a linear regression model, stratified cross-validation with same k will not increase the variance of estimated parameters, as compared to k-fold cross validation.

For k-fold cross validation, variance of the estimated model parameters will increase across subsamples with increase in k.


Grid Search CV

polynomial Regression

Following are some of the approaches for dealing with fundamental problems of prediction and interpretation :

We want to capture higher order features of data by adding polynomial features. The relation may still be linear, but with a higher degree terms.

This can also include variable interactions,like x1x2 in addition to x1^2 and x2^2 from x1 and x2.

Using polynomial features example as somewhere above, we do following:

This will create the interaction terms also.


Bias and Variance

Bias : is a tendency to miss. They are guided by similar error, and are generally with low variance and the data points are consistently errored.

Variance : is a tendency to be inconsistent. It can be thought of as a metric to refer to model's sensitivity.

Above image shows an example for definition of Bias and Variance in terms of shooting to an target. Ideally, we would want our model to have an outcome that is top-left outcome , with small bias and the preditions (or shootings) very close to the target.

Three Source of Errors in a Model :

For polynomial regression, higher degree means model is complex. Thus high variance and lowered bias.


Regularization and Model Selection

Regularization is an approach to handle over-fitting.

Add a new cost function to the old cost function, with a Regularization Strength Parameter . Thus a new cost adjusted cost function is created as M(w) + lambda * R(w) , where M(w) is the model error, R(w) is the function of estimated paramters and lambda is the strength parameter.

The adjustable regularization strength paramter can be used to penalize the model if it is too complex. This can be used to dumb down the model. lambda adds a penalty propoortional to the size of the estimated model parameter.

Regularization strength parameter allows us to manage complexity tradeoffs:


Ridge Regression


Lasso Regression


Elastic Net


Recursive Feature Elimination (RFE): RFE is an approach that combines

RFE repeatedly applies the model, measures feature importance and recursively removes the less important features.

RFE example is as follows:
from sklearn.feature_selection import RFE
rfe = RFE(est, n_features_to_select=5)
rfe = rfe.fit(X_train, y_train)
y_pred = rfe.predict(X_test)

A class RFECV , uses RFE with cross validation.


So overall steps as of now for LinearRegression are:


For Lasso Regression, we introduce the PolynomialFeatures with certain degree also.


Logistic Regression

Used for classification. An extension of Linear Regression but handles what Linear Regression gets wrong.

Sigmoid Function

The use of sigmoid can be useful for classification in some of the obvious separable classes (for example), as below image prsents.

We can see how it can classify between two classes with a clear decision boundary.

import sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l2', c=10.0)
lr = lr.fit(x_train, y_train)
y_predict = lr.predict(x_test)
lr.coef_

Above, c is the regularization constant (here higher c means less lambda, so is inverse).

LogisticRegression also comes with LogisticRegressionCV.


Confusion Matrix :


ROC Curve: Receiver Operating Characteristic

What is the right approach for choosing the classifier:

In sklearn
from sklearn.metrics import accuracy_score
accuracy_value = accuracy_score(y_test,y_pred)

Other accuracy metrics,
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, precision_recall_curve


Logistic regression example


k-Nearest Neighbour Classification

We can also do regression using kNN; the value predicted is the mean value of its neighbors.

Distance measurement in kNN:

from sklearn.neighbors import KNeighborsClassifier kNN = KNeighborsClassifier(n_neighbors=3) kNN = kNN.fit(X_train, y_train) y_predict = kNN.predict(X_test)

For regression, KNeighborsRegressor can be used instead, and make sure that the y_train and y_test sets are continuous values.


Support Vector Machines (SVM)

Non Linear Decision boundaries using kernels.

When to choose what??

Features Data Model
Many (~10K) Small (1K rows) Simple, Logistic or Linear SVC
Few, near about 100 Medium (~10K rows) SBC with RBF
Few, near about 100 Many (>100K rows) Add features, Logistic, LinearSVC or Kernel Approx


Decision Trees


Ensemble Methods and Bagging (MetaClassifiers)

Random Forest

from sklearn.ensemble import RandomForestClassifier RC = RandomForestClassifier(n_estimators=50) RC = RC.fit(X_train,y_train) y_predict = RC.predict(X_test)

For regression use, RandomForestRegressor .

In case when random forest does not reduce the variance, we will introduce more randomness.

Boosting

Bagging Boosting
Uses a subsample of data, i.e. only on a bootstraped sample. Can fit an entire set of data.
Each base learners, or the small tress are independent of each other. Base trees created successively; each learner builds on top of previous steps.
uses only bootstraped data uses all data, including residual from previous model
No weighting used. Incorrect or misclassified points are waighted heavily.
No overfitting possible. There is a chance of overfitting.

from sklearn.ensemble import GradientBoostingClassifier GBC = GradientBoostingClassifier(learning_rate= 0.1,max_features=2,subsample=0.5,n_estimators=200) GBC = GBC.fit(X_train,y_train) y_predict = GBC.predict(X_test)

from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier ABC = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), learning_rate=0.1,n_estimators=200) ABC = ABC.fit(X_train,y_train) y_predict=ABC.predict(X_test)

Stacking

from sklearn.ensemble import VotingClassifier VC = VotingClassifier(estimator_list) VC = VC.fit(X_train,y_train) y_predict = VC.predict(X_test)


Unbalanced Classes