100-Days-Of-ML-Code-Day 1

Hits: 0

Day1 – [data preprocessing]

Original: https://github.com/Avik-Jain/100-Days-Of-ML-Code
Translation: https://github.com/MLEveryday/100-Days-Of-ML-Code


    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import Imputer
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.cross_validation import train_test_split 
    from sklearn.preprocessing import StandardScaler

    # step 2: Import dataset 
    data_set = pd.read_csv('datasets/Data.csv') 
    X = data_set.iloc[ :, :3].values ​​# Extract all data except the last column before closing and opening 
    Y = data_set.iloc[ :, 3].values ​​# Extract the data of the last column 

    # step 3: Handling missing data Only the first and second columns have missing data in the data 
    imputer = Imputer(missing_values='NaN', strategy= 'mean', verbose=0) 
    imputer = imputer.fit(X[:, 1:3]) 
    X[ : , 1:3] = imputer.transform(X[:, 1:3]) 
    # print(X) 

    # step 4: Code analysis data classification 
    labelencoder_X = LabelEncoder() 
    X[ : , 0] = labelencoder_X.fit_transform(X[:, 0])
    # print(X) 

    # Create a dummy variable 
    onehotencoder = OneHotEncoder(categorical_features=[0]) 
    X = onehotencoder.fit_transform(X).toarray() 
    labelencoder_Y = LabelEncoder() 
    Y = labelencoder_Y.fit_transform(Y) 
    # print(X) 
    # print(Y)

    # step 5: Divide the dataset into training set and test set 
    X_train,  X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

    # step 6: Feature scaling normalization 
    sc_X = StandardScaler() 
    X_train = sc_X.fit_transform(X_train) 
    X_test = sc_X.fit_transform(X_test)

step 1: import library

  • Numpy: Mainly used to deal with arrays and matrices.
  • Pandas: Based on Numpy, it is mainly used to solve data analysis tasks.

step 2: import data

[pandas] .read_csv(): Read csv file.

Common pandas operation methods for reading and writing files:

file format read method write method
CSV read_csv to_csv
JSON read_json to_json
HTML read_html to_html
EXCEL read_excel to_excel

step 3: deal with missing data

  • sklearn: A machine learning library that provides a large number of high-quality datasets that can implement different models.
  • sklearn.processing: preprocessing class.
  • sklearn.processing.Imputer: class that can impute missing values ​​in the data.

Code Explanation:

imputer = Imputer(missing_values='NaN', strategy='mean', verbose=0)
    impute = impute.fit(X[:, 1:3])
    X[:, 1:3] = imputer.transform(X[:, 1:3])

The above code actually uses the array X to train an Imputer class, and then uses the trained class object to deal with the missing values ​​in X. The way to deal with the missing values ​​is to use the mean in X to replace NaN, and axis=0 represents Do it by column. This is done for the X array itself, but it can also be trained using X to process the Y array.

step 4: Encoding and parsing categorical data

  • Categorical data refers to variables that contain label values ​​instead of numerical values. The value range is usually fixed, such as: “Yes”, “No” cannot be used in the mathematical calculation of the model, so it needs to be parsed into numbers.

  • sklearn.preprocessing.LabelEncoder(): Standardize labels and uniformly convert label values ​​into a range.

  • In the data, we can see that Country in X and Purchased in Y are category strings, which cannot be processed by machine learning, so to perform category encoding, LabelEncoder and OneHotEncoder are used for encoding.

Step 5: Divide the dataset into test set and training set

step 6: Feature scaling

This is what we call normalization processing, in order to prevent some data from being too large to affect the effect of machine learning.

You may also like...

Leave a Reply

Your email address will not be published.