Pra proses data data Science

https://luthfan.com/cara-meningkatkan-daya-ingat-otak-menurut-islam/

Analisa data pada data sain dimulai dengan suatu proses dimana mesin belajar dari data pelatihan(training set), sehingga jika data pelatihan mengandung data kotor maka informasi yang diperoleh adalah informasi yang kotor. Artinya kesimpulan dari analisa lemah. Oleh karena itu mengolah data mentah yang dikumpulkan untuk di proses lebih dahulu agar menjadi data yang bersih sangat penting dalam analisa data. Kualitas data dari data training sangat penting dalam menghasilkan analis suatu data. Pra proses data dalam rangkain pembelajaran mesin dalam data sain merupakan tahapan awal dari keseluruhan proses data sain. Seperti pada gambar berikut

Overview

Dealing with missing data
Handling categorical data
Partitioning a dataset in training and test sets
Bringing features onto the same scale
Selecting meaningful features
- Sparse solutions with L1 regularization
- Sequential feature selection algorithms
Assessing feature importance with random forests
Summary

Praproses Data

Dalam data sain pra proses data dapat dilakukan dengan beberapa macam

Menyelesaikan data hilang ( Missing Values)
Seleksi Fitur (Feature Selection)
Ektraksi Fitur (Feature Extraction)
Transformasi Data
Penskalaan Data

https://medium.com/datadriveninvestor/finding-outliers-in-dataset-using-python-efc3fce6ce32 mencari outlier

Menyelesain Data Hilang

Menyelesain data hilang dilakukan dengan cara : - Mengeliminasi sampel atau fitur dari data yang hilang - Mengimputan nilai yang hilang

Data yang hilang sebabkan oleh bebera hal diantaranya

Kesalahan dalam mengumpulkan data
Pengukuran alat tidak befungsi Banyak algoritma pembelajaran mesin tidak handal jika ada data yang hilang. Oleh karena itu kita perlu melakukan proses penyelesaian data hilang sebelum melakukan pelatihan model.

Kita gunakan pandas ( library dari python untuk analisa data) terkiat dengan data hilang seperti contoh berikut :

from IPython.display import Image

%matplotlib inline
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,,,8.0
10.0,,12.0,'''

# Jika anda menggunakan Python 2.7, anda perlu 
# untuk konversi string ke unicode:
# csv_data = unicode(csv_data)

df = pd.read_csv(StringIO(csv_data))
df

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

<ipython-input-8-8686f6354da8> in <module>()
     11 # csv_data = unicode(csv_data)
     12 
---> 13 df = pd.read_csv(StringIO(csv_data))
     14 df


TypeError: initial_value must be unicode or None, not str

Ada 4 kolom sebagai fitur A, B, C, D

Baris 0, 1, 2 adalah sampel-sample data (objek).

NaN adalah tanda bahwa nilai hilang (Missing values), tidak ada bilang bilangan yang diinputkan

  File "<ipython-input-3-01b1e7e37d99>", line 1
    Ada 4 kolom sebagai fitur A, B, C, D
        ^
SyntaxError: invalid syntax

Perintah berikut untuk menyatakan bahwa jika cell tersebut adalah nilai hilang makan bernilai

df.isnull()

	A	B	C	D
0	False	False	False	False
1	False	True	True	False
2	False	True	False	True

Perintah berikut untuk menyatakan jumlah data yang hilang pada fitur(kolom) tertentu

df.isnull().sum(axis=0)

A    0
B    2
C    1
D    1
dtype: int64

Menghilang sample dimana ada fitur yang nilainya hilang (missing values)

Salah satu strategi sederhananya adalah menghilangkan sample ( baris dalam tabel) atau fitur ( kolom dari tabel) dimana terdapat nilai hilang (missing values) didasarkan beberapa kriteria

# menghilangkan sample atau baris
df.dropna()

	A	B	C	D
0	1.0	2.0	3.0	4.0

# Menghilang fitur atau kolomg
df.dropna(axis=1)

	A
0	1.0
1	5.0
2	10.0

# hanya menghilang baris dimana semua kolom/fitur adalah NaN
df.dropna(how='all')

	A	B	C	D
0	1.0	2.0	3.0	4.0
1	5.0	NaN	NaN	8.0
2	10.0	NaN	12.0	NaN

# Menhgilangkan baris yang memiliki paling sedikit mangandung 4 fitur yang tidak NaN 
df.dropna(thresh=4)

	A	B	C	D
0	1.0	2.0	3.0	4.0

# hanya menghilangkan baris dimana  NaN  muncul pada kolom tertentu (disini kolom : 'C')
df.dropna(subset=['C'])

	A	B	C	D
0	1.0	2.0	3.0	4.0
2	10.0	NaN	12.0	NaN

Menghilangkan data mungkin tidak diharapkan karena data akan menjadi lebih sedikit, oleh karena itu perlu penyelesaian data hilang dengan cara mengimputkan data. Teknik ini disebut dengan teknik inputer. Salah satunya dengan menggantikan dengan nilai statistik

from sklearn.preprocessing import Imputer

# pilihan dari library imputer terdiri dari niai  mean, median, modus ( nilai yang paling dering/most_frequent)

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  2. ,  7.5,  8. ],
       [10. ,  2. , 12. ,  6. ]])

Misalkan diatas 7.5 diatas rata-rata dari 3 dan 12. Dan 6 adalah rata rata dari 4 dan 8.

df.values

array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,  nan,   8.],
       [ 10.,  11.,  12.,  nan]])

Kita dapat melakukan ini lebih baik dari ini dengan memilih hanya baris/obek yang lebih mirip untuk interpolasi, dibandingkan jika memilih semua baris. Ini bagaiman sistem rekomendasi dapat bekerja yaitu memprediksi rating dari suatu film atau buku. Silahkan baca buku Programming Collective Intelligence: Building Smart Web 2.0 Applications, by Toby Segaran * buku ini sangat baik untuk sistem rekomendasi dan dan search engine.

Understanding the scikit-learn estimator API

Transformer class for data transformation * imputer

Key methods * fit() for fitting from (training) ata * transform() for transforming future data based on the fitted data

Good API designs are consistent. For example, the fit() method has similar meanings for different classes, such as transformer and estimator.

Transformer

Estimator

Mengatasi beberapa macam bentuk tipe data

Ada beberapa bentuk tipe fitur data, yaitu : numerik dan kategorikal. Fitur numerik adalah bilangan dan sering kontine seperti bilangan riel. Fitur Kategorikal adalah diskrit dan berupa tipe nominal atau ordinal * Nilai ordinal adalah diskrit tetapi mempunyai makan numerik sehingga dapat diurutkan atau bermakna tingkatan * Nilai nominal tidak mempunyai makna numerik.

Dalam conoth berikut: * warna adalah tipe nominal ( tidak punya makna numerik) * ukuran adalah tipe ordinal ( dapat diurutkan ) * harga adalah tipe numerik Suat data dapat berisi titpe data yang berbeda-beda .Oleh karena itu kita sangat penting untuk menangani secara hati hati. Kita tidak dapat memperlakukan nilai nominal sebagi numerik tanpa memetakan tipe tersebut sebagai mana mestinya

import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'kelas1'],
                   ['red', 'L', 13.5, 'kelas2'],
                   ['blue', 'XL', 15.3, 'kelas1']])

df.columns = ['warna', 'ukuran', 'harga', 'kelas']
df

	warna	ukuran	harga	kelas
0	green	M	10.1	kelas1
1	red	L	13.5	kelas2
2	blue	XL	15.3	kelas1

Konversi data

Untuk beberapa metode klasifikasi misalkan pohon keputusan yang menangani satu fitur satu waktu. Tidak ada masalah jika kita tidak melakukan konversi fitur dari fitur yang ada.

Akan tetapi untuk metode klasifikasi yang lain, butuh untuk menangani beberapa fitur bersama, kita perlu mengkonversinya ke bentuk yang sesuai sebelum diproses lebih lanjut.: 1. mengkonversi nilai kategorikal ke nilai numerik 2. menskalakan/ menormalisasi nilai numerik

Pemetaan fitur ordinal

Fitur Ordinal dapat dikonversi ke bilangan, tetapi pengkonversian selalu bergantung pada semantik dan kemudian perlu secara manual ditentukan oleh orang kemudian digantikan secara otomatis oleh mesin.

Dalam contoh berikut, kita dapat memetakan ukuran ke bilangan. Biasanya ukuran besar dinyatakan dengan angka yang besar.

Contoh berikut, kita gunakan kamus python untuk mendefinisikan pemetaan

pemetaan_ukuran = {'XL': 3,
                'L': 2,
                'M': 1}

df['ukuran'] = df['ukuran'].map(pemetaaan_ukuran)
df

	warna	ukuran	harga	kelas
0	green	1	10.1	kelas1
1	red	2	13.5	kelas2
2	blue	3	15.3	kelas1

inv_petaaan_ukuran = {v: k for k, v in pemetaaan_ukuran.items()}
df['ukuran'].map(inv_petaaan_ukuran)

0     M
1     L
2    XL
Name: ukuran, dtype: object

Mengkodekan label kelas

Label Kelas sering perlu dinyatakan sebagai bilangan bulat dalam pustaka pembelajaran mesin * biasanya menggunakan bilangan bilangan kecil seperti 0,1, 2 dan seterusnya1, … * bukan ordinal

import numpy as np

pemetaan_kelas = {label: idx for idx, label in enumerate(np.unique(df['kelas']))}
pemetaan_kelas

{0: 0, 1: 1}

# forward map
df['kelas'] = df['kelas'].map(pemetaan_kelas)
df

	warna	ukuran	harga	kelas
0	green	1	10.1	0
1	red	2	13.5	1
2	blue	3	15.3	0

# inverse dari peta kelas
inv_pemetaan_kelas = {v: k for k, v in pemetaan_kelas.items()}
df['kelas'] = df['kelas'].map(inv_pemetaan_kelas)
df

	warna	ukuran	harga	kelas
0	green	1	10.1	0
1	red	2	13.5	1
2	blue	3	15.3	0

Kita dapat menggunakan LabelEncoder dalam scikit learn untuk mengkonversi label kelas secara otomatis

from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['kelas'].values)
df['kelas'] = y
df

	warna	ukuran	harga	kelas
0	green	1	10.1	0
1	red	2	13.5	1
2	blue	3	15.3	0

class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

Melakukan one-hot encoding pada fitur nominal

Akan tetapi, tidak seperti dilakukan pada label kelas, kita tidak dapat hanya mengkonversi fitur nominal ( seperti warna ) secara langsung ke integer.

Kesalahan yang sering dilakukan adalah memetakan fitur nominal langsung ke nilai numerik misalkan untuk warna common mistake is to map nominal features into numerical values, e.g. for colors * biru $\rightarrow$ 0 * hijau $\rightarrow$ 1 * merah $\rightarrow$ 2

X = df[['warna', 'ukuran', 'harga']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

Untuk fitur kategorikal, yang penting adalah memperthanakan nilai “ “equal distance” * keculai anda punya alasan lain Misalkan, untuk warna meraah, hijau, biru, kita ingin mengkonversi nya ke nilai sehingga masing masing warna memiliki jarak sama satu dengan yang lain.

Ini tidak dapat dilakukan dalam 1D tetapi dapat dilakukan dalam 2D

One hot encoding adalah cara yang tepat untuk melakukan ini dengan memetakan n-nilai fitur nominal ke n-dimensi vektor biner. * biru $\rightarrow$ (1, 0, 0) * hijau $\rightarrow$ (0, 1, 0) * merah $\rightarrow$ (0, 0, 1)

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()

array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

# secara otomatis denga metode get_dummies dalamp pandas (pd)
pd.get_dummies(df[['harga', 'warna', 'ukuran']])

	harga	ukuran	warna_blue	warna_green	warna_red
0	10.1	1	0	1	0
1	13.5	2	0	0	1
2	15.3	3	1	0	0

df

	warna	ukuran	harga	kelas
0	green	1	10.1	0
1	red	2	13.5	1
2	blue	3	15.3	0

Binning

Often we need to do the reverse of what we’ve done above. That is, convert continuous features to discrete values. For instance, we want to convert the output to 0 or 1 depending on the threshold.

from sklearn.datasets import load_iris

iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
feature_names = iris_dataset.feature_names

Now we’ll binarize the sepal width with 0 or 1 indicating whether the current value is below or above mean.

X[:, 1]

array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])

from sklearn.preprocessing import Binarizer
X[:, 1:2] = Binarizer(threshold=X[:, 1].mean()).fit_transform(X[:, 1].reshape(-1, 1))
X[:, 1]

array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])

Membagi data ke dalam data pelatihan dan data uji

Data pelatihan untuk melatih model

Data uji untuk mengevaluasi model yang telah dilatih

Membagi menjadi dua untuk menghindari over-fitting yitu model baik secara umum untuk data uji * well trained models should generalize to unseen, test data

Validation set untuk memilih hyper-parameter * parameters yang telah dilatih oleh algoritma * hyper-parameters dipilih oleh orang * will talk this later you do’not worry i will explain that more clearly at next chapter

Data Anggur

Data yang dipakai untuk mengklasifikasikan anggur berdasarkan 13 fitur dan ada 178 data.

#wine_data_remote = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
import numpy as np
import pandas as pd
wine_data_local = 'M:/Dataset\Machine learning/pyml-master/code/datasets/wine/wine.data'
#wine_data_local = '../datasets/wine/wine.data'

df_wine = pd.read_csv(wine_data_local,
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()

Class labels [1 2 3]

	Class label	Alcohol	Malic acid	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

Bagaimana menentukan proporsi dari data pelatihan dan data uji?

Seberapa banyak data pelatihan memungkinkan untuk akurasi model

Seberapa banyak data uji yang memungkin untuk evaluasi

Aturan umumnya adalah 60:40, 70:30, 80:20

Semakin besar data semakin banyak proporsi untuk data pelatihan datasets can have more portions for training * yaitu. 90:10

kemungkinan kemungkinan pembagian lain nanti akan dibahas dalam bab dalam evaluasi model dan memilih parameter

if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=0)
    
print(X.shape)

import numpy as np
print(np.unique(y))

(178, 13)
[1 2 3]

Menjadikan fitur fitur pada skala yang sama

Sebagian besar algoritma-algoritma pembelajaran mesin menjadi lebih baik ketika fitur fiturnya pada skala yang sama

Kecuali * Pohon keputusan/decision tree * Random forest

Contoh

Dua fitur, dalam sakal [0 1] dan [0 100000]

Coba bayangkan kira kira apa yang terjadi ketiak kita menggunakan * perceptron * KNN

Ada dua pendekatan umum

Nomalisasi

Skala min-max : $$\frac{x-x{min}}{x{max}-x_{min}}$$

Standarisiasi

Skala standar: $$\frac{x-x\mu}{x\sigma}$$ * $x\mu$: rata rata dari nilai-nilai x * $x\sigma$: standar deviasi dari nilai-nilai x

Standarisiasi lebih umum dari normalisasi karena normalisasi sensitif atau peka terhada data outlier

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

Contoh kasus: standarisasi dan normalisasi

import pandas as pd
ex = pd.DataFrame([0, 1, 2, 3, 4, 5])

# standarisasi
ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)

# Silahkan anda perhatikan bahwa pandas menggunakan ddof=1 (standar deviasi sample ) 
# dengan default, dimana metode NumPy's std  dan  StandardScaler
# menggunakan ddof=0 (standar deviasi populasi )

# normalisasi
ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
ex.columns = ['input', 'standardisasi', 'normalisasi']
ex

	input	standardisasi	normalisasi
0	0	-1.46385	0.0
1	1	-0.87831	0.2
2	2	-0.29277	0.4
3	3	0.29277	0.6
4	4	0.87831	0.8
5	5	1.46385	1.0

Memilih fitur penting

Overfitting adalah masalah umum pada pemebelajaran mesin. * model sangat sesuai untuk training data and gagal secara umum untuk data riel * model terlalu komplek untuk data pelatihan yang diberikan

Cara mengatasi overfitting

Kumpulkan lebih banyak data pelatihan (untuk mengurangi overfitting)
Mengurangi kompleksitas model, misalkan banyak parameter
Mengurangi kompleksitas model secara tidak langsung melalui aturan aturan (regularization)
Mengurangi dimensi data, yang akan memaksa mereduksi suatu model

Jumlah data harus cukup untuk memodelkan kompleksitas

Objective

Kita dapat menjumlahkan faktor loss dan regularization sebagai total objective: $$\Phi(\mathbf{X}, \mathbf{T}, \Theta) = L\left(\mathbf{X}, \mathbf{T}, \mathbf{Y}=f(\mathbf{X}, \Theta)\right) + P(\Theta)$$

Selama pelatihan, tujuannya adlah mengoptimalkan parameter $\Theta$ terhadap data pelatihan yang diberikan $\mathbf{X}$ and $\mathbf{T}$: $$argmin_\Theta \; \Phi(\mathbf{X}, \mathbf{T}, \Theta)$$ dan berharap model setelah dilatih secara umum baik untuk data nantinya.

Loss

Setiap tugas pembelajaran mesin sebagai tujuan yang dapt dinyatakan sebagai fungsi loss : $$L(\mathbf{X}, \mathbf{T}, \mathbf{Y})$$ , dimana $\mathbf{T}$ adalah bentuk dari target atau informasi tambahan, seperti: * label untuk klasifikasi terawasi * jumlah klaster untuk pembelajaran tak terawasi (unsupervised clustering) * lingkungan untuk reinforcement learning

Regularization

Selanjutnya untuk objective, kita sering digunakan untuk menyederhanakan suatu model, untuk lebih mengefisienkan dan mengernarilasis (menghindari over-fitting). Kompleksitas dari suatu model dapat diukur dengan fungsi penalty : $$P(\Theta)$$ Beberapa fungsi penalty terdiri dari bilangan dan atau besaran parameter

Regularization

Untuk vektor bobot $\mathbf{w}$ dari suatu model (misalkan. perceptron atau SVM)

$L_2$: $ |\mathbf{w}|2^2 = \sum{k} w_k^2 $

$L_1$: $ |\mathbf{w}|1 = \sum{k} \left|w_k \right| $

$L_1$ lebih cenderung menghasilakan solusi yang sparse daripada $L_2$ * banyak bobot nol * lebih mirip seleksi fitur

We are more likely to bump into sharp corners of an object.

Experiment: drop a circle and a square into a flat floor. What is the probability of hitting any point on the shape?

How about a non-flat floor, e.g. concave or convex with different curvatures?

Regularization in scikit-learn

Many ML models support regularization with different * methods (e.g. $L_1$ and $L_2$) * strength (the $C$ value inversely proportional to regularization strength)

import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression

# l1 regularization
lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)

# compare training and test accuracy to see if there is overfitting
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-4-1346b47a2850> in <module>()
      5 # l1 regularization
      6 lr = LogisticRegression(penalty='l1', C=0.1)
----> 7 lr.fit(X_train_std, y_train)
      8 
      9 # compare training and test accuracy to see if there is overfitting


NameError: name 'X_train_std' is not defined

# 3 sets of parameters due to one-versus-rest with 3 classes
lr.intercept_

array([-0.38379411, -0.15807589, -0.70040327])

# 13 coefficients for 13 wine features; notice many of them are 0
lr.coef_

array([[ 0.28031365,  0.        ,  0.        , -0.02824937,  0.        ,
         0.        ,  0.70991702,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.23620609],
       [-0.64400971, -0.06874145, -0.05721952,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        , -0.9267147 ,
         0.06019752,  0.        , -0.37102696],
       [ 0.        ,  0.06145033,  0.        ,  0.        ,  0.        ,
         0.        , -0.63649346,  0.        ,  0.        ,  0.4983057 ,
        -0.3581246 , -0.57078233,  0.        ]])

import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression

# l2 regularization
lr = LogisticRegression(penalty='l2', C=0.1)
lr.fit(X_train_std, y_train)

# compare training and test accuracy to see if there is overfitting
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-3e29b8113667> in <module>()
      5 # l2 regularization
      6 lr = LogisticRegression(penalty='l2', C=0.1)
----> 7 lr.fit(X_train_std, y_train)
      8 
      9 # compare training and test accuracy to see if there is overfitting


NameError: name 'X_train_std' is not defined

# notice the disappearance of 0 coefficients due to L2
lr.coef_

array([[ 0.58228361,  0.04305595,  0.27096654, -0.53333363,  0.00321707,
         0.29820868,  0.48418851, -0.14789735, -0.00451997,  0.15005795,
         0.08295104,  0.38799131,  0.80127898],
       [-0.71490217, -0.35035394, -0.44630613,  0.32199115, -0.10948893,
        -0.03572165,  0.07174958,  0.04406273,  0.20581481, -0.71624265,
         0.39941835,  0.17538899, -0.72445229],
       [ 0.18373457,  0.32514838,  0.16359432,  0.15802432,  0.09025052,
        -0.20530058, -0.53304855,  0.1117135 , -0.21005439,  0.62841547,
        -0.4911972 , -0.55819761, -0.04081495]])

Plot regularization

$C$ is inverse to the regularization strength

import warnings
warnings.filterwarnings('ignore') # menghilangkan warning dari jupyter notebook

import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.subplot(111)
    
colors = ['blue', 'green', 'red', 'cyan', 
          'magenta', 'yellow', 'black', 
          'pink', 'lightgreen', 'lightblue', 
          'gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4, 6,dtype  = float):
    lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10**c)

weights = np.array(weights)

for column, color in zip(range(weights.shape[1]), colors):
    plt.plot(params, weights[:, column],
             label=df_wine.columns[column + 1],
             color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center', 
          bbox_to_anchor=(1.38, 1.03),
          ncol=1, fancybox=True)
# plt.savefig('./figures/l1_path.png', dpi=300)
plt.show()

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-3-21815fe97d12> in <module>()
     13 
     14 weights, params = [], []
---> 15 for c in np.arange(-4, 6,dtype  = float):
     16     lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
     17     lr.fit(X_train_std, y_train)


NameError: name 'np' is not defined

png

Dimensionality reduction

$L_1$ regularization implicitly selects features via zero out

Feature selection * explicit - you specify how many features to select, the algorithm picks the most relevant (not important) ones * forward, backward * next topic

Note: 2 important features might be highly correlated, and thus it is relevant to select only 1

Feature extraction * implicit * can build new, not just select original, features * e.g. PCA * next chapter

Algoritm seleksi fitur Sequential

Seleksi fitur adalah cara mengurangi dimensid data. Anda bisa mengatakan juga seleksi fitur mengurangi jumlah kolom dari data Bagaiman kita memutuskan fitur mana yang akan dipertahankan. Tentunya, kita ingin mempertahankan fitur fitur yang relevan dan membuang yang tidak relevan.

Kita dapat memilih fitur ini secara berurutan apakah secara maju atau mundur (backward atau backward)

Seleksi mundur (Backward Selection)

Seleksi fitur mundur secara berurutan (Sequential backward selection /(SBS) adalah menyelidik diri sendiri. Ide dasarnya adalah mulai dengan $n$ fitur, dan mempertahankan semua kemungkindan $n-1$ subfitur, dan menghapus satu yang paling penting untuk model pelatihan.

Kemudian kita berpindah untuk mengurangi jumlah fitur selanjutnya ($[n-2, n-3, \cdots]$) sampai jumlah fitur yang diharapkan

from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split


class SBS():
    def __init__(self, estimator, k_features, scoring=accuracy_score,
                 test_size=0.25, random_state=1):
        self.scoring = scoring
        self.estimator = clone(estimator)
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):
        
        X_train, X_test, y_train, y_test = \
            train_test_split(X, y, test_size=self.test_size,
                             random_state=self.random_state)

        dim = X_train.shape[1]
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train, 
                                 X_test, y_test, self.indices_)
        self.scores_ = [score]

        while dim > self.k_features:
            scores = []
            subsets = []

            for p in combinations(self.indices_, r=dim - 1):
                score = self._calc_score(X_train, y_train, 
                                         X_test, y_test, p)
                scores.append(score)
                subsets.append(p)

            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1

            self.scores_.append(scores[best])
        self.k_score_ = self.scores_[-1]

        return self

    def transform(self, X):
        return X[:, self.indices_]

    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train[:, indices], y_train)
        y_pred = self.estimator.predict(X_test[:, indices])
        score = self.scoring(y_test, y_pred)
        return score

Below we try to apply the SBS class above. We use the KNN classifer, which can suffer from curse of dimensionality.

import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=2)

# selecting features
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)

# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]

plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.1])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
# plt.savefig('./sbs.png', dpi=300)
plt.show()

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-10-a121ec32336f> in <module>()
      6 # selecting features
      7 sbs = SBS(knn, k_features=1)
----> 8 sbs.fit(X_train_std, y_train)
      9 
     10 # plotting performance of feature subsets


NameError: name 'X_train_std' is not defined

# list the 5 most important features
k5 = list(sbs.subsets_[8]) # 5+8 = 13
print(df_wine.columns[1:][k5])

Index(['Alcohol', 'Malic acid', 'Alcalinity of ash', 'Hue', 'Proline'], dtype='object')

knn.fit(X_train_std, y_train)
print('Training accuracy:', knn.score(X_train_std, y_train))
print('Test accuracy:', knn.score(X_test_std, y_test))

Training accuracy: 0.983870967742
Test accuracy: 0.944444444444

knn.fit(X_train_std[:, k5], y_train)
print('Training accuracy:', knn.score(X_train_std[:, k5], y_train))
print('Test accuracy:', knn.score(X_test_std[:, k5], y_test))

Training accuracy: 0.959677419355
Test accuracy: 0.962962962963

Note the improved test accuracy by fitting lower dimensional training/test data.

Forward selection

This is essetially the reverse of backward selection; we will leave this as an exercise.

Assessing Feature Importances with Random Forests

Recall * a decision tree is built by splitting nodes * each node split is to maximize information gain * random forest is a collection of decision trees with randomly selected features

Information gain (or impurity loss) at each node can measure the importantce of the feature being split

# feature_importances_ from random forest classifier records this info
from sklearn.ensemble import RandomForestClassifier

feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=10000,
                                random_state=0,
                                n_jobs=-1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

plt.title('Feature Importances')
plt.bar(range(X_train.shape[1]), 
        importances[indices],
        color='lightblue', 
        align='center')

plt.xticks(range(X_train.shape[1]), 
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('./random_forest.png', dpi=300)
plt.show()

 1) Color intensity                0.182483
 2) Proline                        0.158610
 3) Flavanoids                     0.150948
 4) OD280/OD315 of diluted wines   0.131987
 5) Alcohol                        0.106589
 6) Hue                            0.078243
 7) Total phenols                  0.060718
 8) Alcalinity of ash              0.032033
 9) Malic acid                     0.025400
10) Proanthocyanins                0.022351
11) Magnesium                      0.022078
12) Nonflavanoid phenols           0.014645
13) Ash                            0.013916

png

threshold = 0.15
if False: #Version(sklearn_version) < '0.18':
    X_selected = forest.transform(X_train, threshold=threshold)
else:
    from sklearn.feature_selection import SelectFromModel
    sfm = SelectFromModel(forest, threshold=threshold, prefit=True)
    X_selected = sfm.transform(X_train)

X_selected.shape

(124, 3)

Now, let’s print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):

for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

 1) Color intensity                0.182483
 2) Proline                        0.158610
 3) Flavanoids                     0.150948

Summary

Data is important for machine learning: garbage in, garbage out. So pre-process data is important. This chapter covers various topics for data processing, such as handling missing data, treating different types of data (numerical, categorical), and how to avoid over-fitting which can improve both accuracy and speed.

Reading

PML Chapter 4
IML Chapter 6.2