Day1 – [data preprocessing]
import numpy as np import pandas as pd from sklearn.preprocessing import Imputer from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler # step 2: Import dataset data_set = pd.read_csv('datasets/Data.csv') print(data_set) X = data_set.iloc[ :, :3].values # Extract all data except the last column before closing and opening Y = data_set.iloc[ :, 3].values # Extract the data of the last column # step 3: Handling missing data Only the first and second columns have missing data in the data imputer = Imputer(missing_values='NaN', strategy= 'mean', verbose=0) imputer = imputer.fit(X[:, 1:3]) X[ : , 1:3] = imputer.transform(X[:, 1:3]) # print(X) # step 4: Code analysis data classification labelencoder_X = LabelEncoder() X[ : , 0] = labelencoder_X.fit_transform(X[:, 0]) # print(X) # Create a dummy variable onehotencoder = OneHotEncoder(categorical_features=) X = onehotencoder.fit_transform(X).toarray() labelencoder_Y = LabelEncoder() Y = labelencoder_Y.fit_transform(Y) # print(X) # print(Y) # step 5: Divide the dataset into training set and test set X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0) # step 6: Feature scaling normalization sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.fit_transform(X_test)
step 1: import library
- Numpy: Mainly used to deal with arrays and matrices.
- Pandas: Based on Numpy, it is mainly used to solve data analysis tasks.
step 2: import data
[pandas] .read_csv(): Read csv file.
Common pandas operation methods for reading and writing files:
|file format||read method||write method|
step 3: deal with missing data
- sklearn: A machine learning library that provides a large number of high-quality datasets that can implement different models.
- sklearn.processing: preprocessing class.
- sklearn.processing.Imputer: class that can impute missing values in the data.
imputer = Imputer(missing_values='NaN', strategy='mean', verbose=0) impute = impute.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) print(X)
The above code actually uses the array X to train an Imputer class, and then uses the trained class object to deal with the missing values in X. The way to deal with the missing values is to use the mean in X to replace NaN, and axis=0 represents Do it by column. This is done for the X array itself, but it can also be trained using X to process the Y array.
step 4: Encoding and parsing categorical data
Categorical data refers to variables that contain label values instead of numerical values. The value range is usually fixed, such as: “Yes”, “No” cannot be used in the mathematical calculation of the model, so it needs to be parsed into numbers.
sklearn.preprocessing.LabelEncoder(): Standardize labels and uniformly convert label values into a range.
In the data, we can see that Country in X and Purchased in Y are category strings, which cannot be processed by machine learning, so to perform category encoding, LabelEncoder and OneHotEncoder are used for encoding.
Step 5: Divide the dataset into test set and training set
step 6: Feature scaling
This is what we call normalization processing, in order to prevent some data from being too large to affect the effect of machine learning.