Summary of 50 pandas high-frequency operations

The point, hit the blackboard

  • First of all, this article follows, traditional teaching, point to point ! Only some functions or processing methods that are frequently used by individuals are introduced.
  • The examples in this article are for demonstration purposes only. The examples generally do not modify the original data. If the code will modify the original data, it will be marked (modification on the original data) , and you must pay attention to whether the original data has been modified when you use it yourself . Once an error is reported, first check whether your code has changed the original data.

</p> <h1>Unmodified original data</h1> <p>df.drop( 'name' , axis = 1 )<br /> # Modify original data df.drop( 'name' , axis = 1 , inplace= True )<br /> # Modify original data df = df.drop( 'name' , axis = 1 )

  • pandasThe reason why it is powerful is that it has various data processing functions, each function is combined with each other, flexible and changeable, and interacts with many scientific computing libraries such as numpy, matplotlib, sklearn, > pyspark, sklearnand so on. It is the only way to truly understand the actual combat.
  • Originality is not easy, and coding is also very tiring. If you think the article is good, 💗Be sure to remember it for three consecutive times💗 ~ Thank you in advance.

Article directory

DataFrame creation

Create empty DataFrame

Creates an empty one, containing three rows of empty data.

df = pd.DataFrame(columns=['name', 'age'], index=[0, 1, 2])

General DataFrame creation method

Only three common ones are introduced here, array creation, dictionary creation, and external file creation. When reading common files, you can directly specify the xlsxfile may be multiple files in sheetthis case sheet_name.

# Array creation 
df = pd.DataFrame(data=[[ 'Luffy' , 18],
                       [ 'Sauron' , 19],
                       [ 'Nami' , 18]],
                  columns=[ 'name' , 'age' ])
 # Create 
df from a dictionary = pd.DataFrame({ 'name' : [ 'Luffy' , ' Zoro' , 'Nami' ],
                      'age' :[18 , 19, 18]})
 # create from external files, csv, xlsx, json, etc. 
df = pd.read_csv( 'XXX.csv' )

DataFrame storage

Common storage methods (csv, json, excel, pickle)

When saving, it is generally not necessary to save the index, because the index will be automatically generated when reading.

df.to_csv( 'test.csv' , index =False)   # ignore index 
df.to_excel( 'test.xlsx' , index =False)   # ignore index 
df.to_json( 'test.json' )   # save as json 
df. to_pickle( 'test.pkl' )   # save in binary format

DataFrame View data information

Show summary information

DataFrameBefore we use , we will look at the information of the data, personal preference info, it shows the row and column information of the data set, and the number of non-null values ​​in each column.

df.info()

Show descriptive statistics

The basic statistical information of the numeric column can be viewed more intuitively.

df.describe()

Display first/last n lines

5 lines are displayed by default, and the number of displayed lines can be specified.

df.head(n)   # You can specify an integer and output the first n lines 
df.tail(n)   # You can specify an integer and output the next n lines

Display index and column information

Displays basic information about indexes and columns.

df.columns   # column information 
df.index   # index information

Display the data type of each column

Displays the name of the column and the corresponding data type.

df.dtypes

Display the amount of memory occupied

Displays the size of the memory occupied by the column, in bytes.

df.memory_usage()

locate a row of data

Important : The main point of using either locor is to specify the row first, then specify the column , and separate . Such as: get the data of all rows and all columns.iloc.loc[:, :]

Use loc() to locate

For example, if you want to locate the line of data of [Sauron], there are the following

df.loc[1, :]   # loc[index , columns] row index, column name, return Series object 
df.loc[df[ 'age' ] > 18]   # return DataFrame object 
# or df[df['age' ] > 18] 
# df.loc[df['name'] == 'Sauron']

Use iloc to locate

Use to ilocfetch the second row (indexed from 0), all columns of data.

df .iloc [1, :]   # iloc [index1, index2] row index, column index

add a row of data

Use loc to locate and add

Use to loclocate index = 3the row, and then assign it (modify on the original data)

df .loc[len(df)] = ['Chopper', 3]

add using append

appendWhen adding data, you need to specify the column name and column value. If a column is not specified, it will be filled by default NaN.

delete data

Delete column based on column name

Use dropto delete a column, specify the axis to delete, and the name /index of the corresponding column/row .

df.drop( 'name' , axis = 1)   # delete a single column 
df.drop([ 'name' , 'age' ], axis = 1)   # delete multiple columns

delete row based on index

Similar to dropping a column above, but specifying an index here.

df.drop(0, axis=0)   # delete a single row 
df.drop([0, 1], axis=0)   # delete multiple rows

Use loc to locate data and delete

First use to loclocate the data of a certain condition, then get the index index, and then use to dropdelete .

df.drop(df.loc[df[ 'name' ] == 'Nami' ].index, axis=0)   # delete the located row

delete column using del

delIt is to modify the original data, and it is necessary to pay attention to the use.

del df['age']

Delete rows and columns at the same time

dropYou can also specify the row and column to delete, here delete the first and second row and delete agethe column .

df.drop(columns=['age'], index=[0, 1])

remove duplicate values

  • If specified subset, deduplication will be performed according to the specified column as a reference, that is, if two rows have the same avalue , the row with the second occurrence will be deleted, and only the first row will be retained.
  • If not specified subset, deduplication will be performed based on all columns as a reference, and deduplication will be performed only if two rows of data are identical .

df.drop_duplicates(subset=['a'], keep='first')
df.drop_duplicates(keep='first')

Screen for duplicates

sample data

df = pd.DataFrame({'name':['Python',
                        'Python',
                        'Java',
                        'Java',
                        'C'],
                   'count': [2, 2, 6, 8, 10]})

Check if a column has duplicate values

Use values_counts()to count the number of occurrences of each value in the column. The results are sorted in descending order by default, and it is only necessary to determine whether the number of occurrences of the first row value is 1 to determine whether there are duplicate values.

df['a'].value_counts()

Use drop_duplicates()to delete the duplicate value, keep only the value that appears for the first time, and judge whether the processed value is dfequal to the original , if Falseit means that there is a duplicate value.

df.equals(df.drop_duplicates(subset=['a'], keep='first'))

False

Check if a DataFrame has duplicate rows

It is also used drop_duplicates()to delete duplicate values, and only the value that appears for the first time is retained. At this time, the subsetparameter setting column is not used, and the default is all columns. It is judged whether the processed value is dfequal value. If Falseit is, it means that there is a duplicate value.

df.equals(df.drop_duplicates(keep='first'))

False

Count the number of duplicate rows

Note that the statistics here refer to all columns, and only two rows that are identical will be judged as duplicate rows, so the statistical result is 1.

len(df) - len(df.drop_duplicates(keep="first"))

Show duplicate rows of data

First delete the duplicate rows, keep only the first occurrence, get a row- unique data set, and then use delete to drop_duplicates()delete dfall the duplicated data, this time do not keep the first occurrence of the duplicate value, the above two results The data sets are merged, and drop_duplicates()the newly generated data set is deduplicated using to get the data of the duplicate rows.

df.drop_duplicates(keep="first")\
  .append(df.drop_duplicates(keep=False))\
  .drop_duplicates(keep=False)

Missing value handling

Find missing values

Missing values True, non-missing values False.

df.isnull()

Find non-missing values

Non-missing values True, missing values False.

df.notnull()

Show rows with missing values

.isnull : Find missing values, mainly to mark the location of missing values True.
.T : Transpose the row and column to anyprepare .
.any : If one of a sequence is satisfied True, return True.

df[df.isnull().T.any()]

remove missing values

The parameters here need to be paid more attention to, here is the focus.

  • axis : 0 rows, 1 column
  • how
    • any : If there are NaNs, delete the row or column.
    • all : If all values ​​are NaN, delete the row or column.
  • thresh : Specify the number of NaNs, and delete them when the number of NaNs is reached.
  • subset : The data range to be considered, for example: to delete missing rows, use subset to specify the reference column, the default is all columns.
  • inplace : Whether to modify the original data.

# If a row has missing values, delete this row 
df.dropna(axis=0, how= 'any' )  
 # If a column has missing values, delete this column 
df.dropna(axis=1, how= 'any' ' )

Fill missing values

Number or string padding

Directly specify the number or string to be filled.

df.fillna(0)

Fill with values ​​before/after missing values

  • Fill with the previous value of the missing value (the value above the column), if the missing value is in the first row, do not fill
  • Fill with the value after the missing value (the value below the column), not fill if the missing value is in the last row

df.fillna(method='pad')
df.fillna(method='bfill')

Fill with the mean/median, etc. of the column where the missing value is located

It can be populated with statistics from this column. Such as using mean, median, max, min, sumpadding, etc.

df.fillna(df.mean())

column operation

Modify column names

df.columnsIs to directly specify the new column name to replace all column names. (Modify on the original data) You
rename() need to specify the original name and the new name to replace.

df.columns = ['new_name', 'new_age']
df.rename(columns=({'name':'new_name','age':'new_age'}))

Modify column type

Use astypeto modify the column type.

df['age'].astype(str)

Split a column to get multiple columns

splitOnly string columns can be split.

df[['name1', 'name2']] = df['name'].str.split(',', expand=True)

Combine multiple columns into a new column

Also, only columns of string type can be merged.

df['all'] = df['name'] + '/' + df['age'].astype(str)

Partitioning numeric columns

For numerical columns, in actual use, it may be necessary to change these values ​​into label values ​​according to the specified range, such as passing or failing the indicators for measuring the product, whether the results are excellent, etc. To use, you need to specify the value column, the critical value of each label, the opening and closing of the critical value (in the example: default left=True, specified right=False, that is, left closed and right open), and finally specify the name of the label.

df[ 'Evaluation' ] = pd.cut(df[ 'Grade' ], [ 0 , 60 , 80 , np.inf],
             right= False , labels=[ 'Failed' , 'Good' , 'Excellent' ])

sort

index sort

Sort row index descending

df.sort_index(axis=0, ascending=False)

Sort column index in descending order

df.sort_index(axis=1, ascending=False)

reset index

Reorder the index, the original index is not preserved.

df.reset_index(drop=True)

Value ordering

First sort by name in descending order, and then sort the scores under the same name in descending order.

df.sort_values( by =[ 'name' , 'grade' ], axis= 0 , ascending =False)

Create a ranking column

Use rankto rank, methodthe meaning of the value of the main parameter is as follows:

method meaning
average The default value, in the same ranking group, assigns an average ranking (average) to each value, and there are jumps between rankings
min Using the smallest rank in the grouping, there are jumps between ranks
max Use the largest rank in the grouping, there are jumps between ranks
first Rank the values ​​in the order in which they appear in the original data, with jumps between rankings
dense The ranking of the same group is the same, and there is no jump between rankings

Now rank each row of data according to the grade column, and create a new ranking column. Several ranking methods are given below.

df[ 'rank' ] = df[ 'grade' ].rank(method= 'average' , ascending =False)

grouping

Group statistics on rows

Now calculate the scores of each person in groups, and calculate the sum, mean, and maximum value respectively.

df .groupby ( ['name'] ) .sum ()
 df .groupby ( ['name'] ) .mean ()
 df .groupby ( ['name'] ) .max ()

Note: The index at this time is the name. If you want to reset the index, you can use the following method.

df .groupby ( ['name'] ) .sum () .reset_index ()

Use different statistical functions for different columns

agg()Is the specified function to use on a certain sequence, and then return a scalar value.
apply()It is the process of splitting data >>> then applying >>> finally summarizing (only a single function can be applied). Returns multidimensional data.

df.groupby([ 'name' ]).agg({ 'grade' :[ 'sum' , 'mean' , 'max' ]})
df.groupby([ 'name' ]).apply(max)

DataFrame merge

[The merge functions in pandas] are mainly: merge(), concat(), append(), which are generally used to connect two or more DataFrame. Among them, concat(), append()is used to connect DataFrameobjects , and is used to connect objects horizontallymerge() .DataFrame

Comparison of the three:

concat()

  • Concatenate multiple DataFrames
  • Set a specific key (key)

append()

  • Concatenate multiple DataFrames

merge()

  • Specify columns to join DataFrames

merge()

onIf specified, the column must appear in both DataFrame, and the default value DataFrameis the intersection of the two columns. In this example, even if it is not specified on, the actual default value will be merged according to namethe column .
howParameter details:

  • inner: Take the intersection according to the onspecified column.
  • outer: Take the union according to the onspecified column.
  • left: Merge in a left join manner according to the onspecified column.
  • right: Merge with right join according to the onspecified column.

pd.merge(df1, df2, on='name', how = "inner")
pd.merge(df1, df2, on='name', how = "outer")
pd.merge(df1, df2, on='name', how = "left")
pd.merge(df1, df2, on='name', how = "right")

concat()

concat()Multiple can DataFramebe merged, and vertical or horizontal merging can be selected according to the actual situation. See the example below for details.

# Vertically merge multiple DataFrames and take the intersection 
pd.concat([df1, df2], ignore_index=True, join = 'inner' , axis= 0 )
 # Vertically combine multiple DataFrames and take the union 
pd.concat([df1, df2] , ignore_index=True, join = 'outer' ,axis= 0 )
 # Multiple DataFrames are merged horizontally and intersected 
pd.concat([df1, df2], ignore_index=False, join = 'inner' ,axis= 1 )
 # Multiple DataFrame horizontally merge and take the union 
pd.concat([df1, df2], ignore_index=False, join = 'outer' ,axis= 1 )

In addition, you can specify keyto add the name of the original data in the index position.

pd.concat([df1, df2], ignore_index=False, join='outer', keys=['df1', 'df2'])

append()

append()It is often used for vertical merging, and multiple s can also DataFramebe merged.

df1.append(df2, ignore_index=True)
df1.append([df1, df2], ignore_index=True)

DataFrame time processing

sample data

Convert string column to time series

Sometimes the time read from a csv or xlsx file is a string ( Object) type, then it needs to be converted into a datetimetype to facilitate subsequent processing of the time.

pd.to_datetime(df['datetime'])

use the time column as an index

For most time series data, we can use this column as an index to maximize the use of time. Here drop=Falsechoose not to delete datetimethe column.

df.set_index('datetime', drop=False)

The data for January is obtained by index , and the first five rows are displayed here.

df.loc['2021-1'].head()

Obtain data from January to March through the index .

df.loc['2021-1':'2021-3'].info()

Get the properties of time

The properties that may be used in general requirements are given here, and examples of each method are given.

Common properties describe
date get date
time Get Time
year get year
month get month
day get days
hour get hours
minute get minutes
second get seconds
dayofyear The day of the year the data is in
weekofyear The week of the year the data is in (used in the new version isocalendar().week)
weekday The day of the week the data is in (digital Monday is 0)
day_name() The day of the week the data is in (Monday in English)
quarter What quarter of the year the data is in
is_leap_year Is it a leap year

Here, the date in line 100 is randomly selected as an example, and the results of each attribute are displayed in the form of comments.

df['datetime'].dt.date[100]
# datetime.date(2021, 4, 11)
df['datetime'].dt.time[100]
# datetime.time(11, 50, 58, 995000)
df['datetime'].dt.year[100]
# 2021
df['datetime'].dt.month[100]
# 4
df['datetime'].dt.day[100]
# 11
df['datetime'].dt.hour[100]
# 11
df['datetime'].dt.minute[100]
# 50
df['datetime'].dt.second[100]
# 58
df['datetime'].dt.dayofyear[100]
# 101
df['datetime'].dt.isocalendar().week[100]
# 14
df['datetime'].dt.weekday[100]
# 6
df['datetime'].dt.day_name()[100]
# 'Sunday'
df['datetime'].dt.quarter[100]
# 2
df['datetime'].dt.is_leap_year[100]
# False

Resampling resample()

There are two types of resampling : downsampling and upsampling .

Downsampling means that the time frequency of sampling is lower than the time frequency of the original time series, and at the same time it is an aggregation operation. See the example below to get the countcolumn . QRepresentative quartermeans quarterly sampling.

df.resample('Q',on='datetime')["count"].mean()

Note: The maximum output time at this time is 06-30, not 05-31 in the actual data. But it does not affect the calculation.

Upsampling is the opposite of downsampling, which means that the time frequency of sampling is higher than the time frequency of the original time series, which is equivalent to obtaining time data with a finer latitude, but this often results in a large number of null values ​​in the data, which is not practically used. , will not be explained here.

DataFrame traversal method

iterrows() – iterates over rows (index, sequence of column values)

Traverse by row to obtain the row index and column value sequence, the speed is slow, see the example directly.

for  index , row in df.iterrows():
     print ( index )
     print (row)
     print (row[ 'name' ])
     print (row[ 'age' ])
     break  # For demonstration, only one line is displayed

iteritems() – iterate over (column names, values)

Traverse by column, get the column name and the value of the column.

for column, value in df.iteritems():
     print (column)
     print (value)
     break  # For demonstration, only one line is displayed

itertuples() – iterates over rows (index, column value A, column value B…)

Traverse by row, get the index and column value of the row, and iterrowsthe difference is that the index and column value are included together, use info[0]to get the index

for info in df.itertuples():
     print (info[0])
     print (info[1])
     print (info[2])
     break  # For demonstration, only one line is displayed

Add my WeChat, note [Communication Group], and pull you into the Python Communication Group👇🏻

That’s all for this article, if it feels good. ❤ Remember to support three consecutive times! ! ! ❤We

will continue to share all kinds of dry goods in the future. If you are interested, you can click and follow to avoid getting lost~.

Leave a Comment

Your email address will not be published. Required fields are marked *