Understand Pandas data merging, this piece is enough

Hits: 0

Data merging is a necessary link in the data processing process. As a powerful tool for data analysis, pandas provides four commonly used data merging methods. Let’s see how to use these methods!

1.concat

concat()It can be used for inline or outline splicing operations in the row/column direction DataFramebetween , and the default is to take the [union]It can be used for inline or outline splicing operations in the row/column directionDataFramebetween , and the default is to take the [union]y of rows (along the axis) .[]</p> <p><strong>How to use</strong></p> <pre><code>pd.concat( objs: Union[Iterable[~FrameOrSeries], Mapping[Union[Hashable, NoneType], ~FrameOrSeries]], axis=0, join='outer', ignore_index: bool = False, keys=None, levels=None, names=None, verify_integrity: bool = False, sort: bool = False, copy: bool = True, ) </code></pre> <p><strong>The main parameters</strong></p> <ul> <li><code>objs</code>: A sequence or a Map of Series, DataFrame objects.</li> <li><code>axis</code>: connected axis, 0 ('index', row), 1 ('columns', column), default is 0.</li> <li><code>join</code>: Connection mode, inner (intersection), outer (union), the default is outer.</li> <li><code>ignore_index</code>: Whether to reset the index value of the concatenated axis. If True, reset indices to 0, ..., n - 1.</li> <li><code>keys</code>: Create a hierarchical index. Can be a list or array of arbitrary values, an array of tuples, a list of arrays (if levels are set to a multilevel array)</li> <li><code>names</code>: The name of the level in the resulting hierarchical index.</li> </ul> <h4>Example</h4> <p>Create two <code>DataFrame</code>.</p> <pre><code>df1 = pd.DataFrame( {'char': ['a', 'b'], 'num': [1, 2]}) df2 = pd.DataFrame( {'char': ['b', 'c'], 'num': [3, 4]}) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257469.png" /></p> <p><code>concat()</code>By default, the splicing operation and connection method will be performed in the row direction <code>outer</code>.</p> <pre><code>pd.concat([d1, d2]) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257472.png" /></p> <p>Clear existing indexes and reset indexes.</p> <pre><code>pd.concat( [ d1, d2 ], ignore_index=True) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257474.png" /></p> <p>Add a hierarchical index at the outermost level of the data via the <code>keys</code>parameter .</p> <pre><code>pd.concat( [ d1, d2 ], keys=['d1', 'd2']) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257475.png" /></p> <p>Specify the <code>names</code>parameter to mark the created index key.</p> <pre><code>pd.concat( [ d1, d1 ], keys=['d1', 'd2'], names=['DF Name', 'Row ID']) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257476.png" /></p> <p>Combines two <code>DataFrame</code>with overlapping columns and returns everything. Column padding outside the intersection <code>NaN</code>.</p> <pre><code>df3 = pd.DataFrame( {'char': ['b', 'c'], 'float': [3.0, 4.0]}) pd.concat([df1, df3]) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257477.png" /></p> <p>Combines two <code>DataFrame</code>with overlapping columns, returning only the contents of the overlapping columns.</p> <pre><code>pd.concat( [df1, df3], join="inner") </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257478.png" /></p> <p>Specifies <code>axis=1</code>to group <code>DataFrame</code>objects horizontally along the x-axis.</p> <pre><code>df4 = pd.DataFrame( {'char': ['b', 'c', 'd'], 'num': [3, 4, 5]}, index=range(1, 4)) pd.concat([df1, df4], axis=1) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257479.png" /></p> <h3>2.merge()</h3> <p><code>merge()</code>It can only be used for inline or outer union operation <code>DataFrame</code>in columns. The default column is merged (along the <code>x</code>axis), and the intersection is taken (that is, the intersection of two <code>DataFrame</code>column names is used as the connection key)</p> <p><strong>How to use</strong></p> <pre><code>pd.merge( left, right, how: str = 'inner', on=None, left_on=None, right_on=None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes=('_x', '_y'), copy: bool = True, indicator: bool = False, validate=None, ) </code></pre> <p><strong>parameter</strong></p> <ul> <li><code>left</code>:DataFrame</li> <li><code>right</code>: DataFrame or Series with name</li> <li><code>how</code>: {'left', 'right', 'outer', 'inner'}, the default is 'inner', the connection method</li> <li><code>on</code>: The name of the column index used for the connection, which must exist in both the left and right DataFrames. By default, the intersection of the column names of the two DataFrames is used as the connection key.</li> <li><code>left_on</code>: The column name used for the connection key in the left DataFrame, this parameter is very useful when the left and right column names are different but represent the same meaning;</li> <li><code>right_on</code>: Column name in the right DataFrame to use for the join key</li> <li><code>left_index</code>: Default is False, do not use the row index in the left DataFrame as the join key (but it is better to use JOIN in this case)</li> <li><code>right_index</code>: defaults to False, does not use the row index in the right DataFrame as the join key (but it is better to use JOIN in this case)</li> <li><code>sort</code>: The default is False, the merged data will be sorted, setting it to False can improve performance</li> <li><code>suffixes</code>: A tuple of string values, used to specify the suffix name appended to the column name when the left and right DataFrames have the same column name, the default is ('_x', '_y')</li> <li><code>copy</code>: The default is True, always copy the data into the data structure, set to False to improve performance</li> <li><code>indicator</code>: Displays the source of the data in the combined data</li> <li><code>validate</code>:{"one_to_one" or "1:1", "one_to_many" or "1:m", "many_to_one" or "m:1", "many_to_many" or "m:m"} If specified, checks if the merge is Specify the type.</li> </ul> <h4>Example</h4> <p>Create two <code>DataFrame</code>.</p> <pre><code>df1 = pd.DataFrame( {'name': ['A1', 'B1', 'C1'], 'grade': [60, 70, 80]}) df2 = pd.DataFrame( {'name': ['B1', 'C1', 'D1'], 'grade': [70, 80, 100]}) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257481.png" /></p> <p><code>merge()</code>By default, it will be merged according to the columns that exist in the two <code>DataFrame</code>, and the merge method will take the intersection method.</p> <pre><code>df1.merge(df2) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257482.png" /></p> <p>The way to specify the merge is <code>outer</code>, take the union.</p> <pre><code>df1.merge(df2, how='outer') </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257483.png" /></p> <p>Create two more below <code>DataFrame</code>.</p> <pre><code>df1 = pd.DataFrame( {'name1': ['A1', 'B1', 'B1', 'C1'], 'grade': [60, 70, 80, 90]}) df2 = pd.DataFrame( {'name2': ['B1', 'C1', 'D1', 'E1'], 'grade': [70, 80, 90, 100]}) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257484.png" /></p> <p>Merge sums according to <code>name1</code>and columns . Columns are appended with the default suffixes and .<code>name2df1df2grade_x_y</code></p> <pre><code>df1.merge( df2, left_on='name1', right_on='name2') </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257485.png" /></p> <p>Combines <code>df1</code>and <code>df2</code>and appends the specified left and right suffixes to the end of overlapping columns.</p> <pre><code>df1.merge( df2, left_on='name1', right_on='name2', suffixes=('_1', '_2')) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257486.png" /></p> <h3>3.append()</h3> <p><code>append()</code>Can be used for splicing operations <code>DataFrame</code>between row directions (along the <code>y</code>axis), the default is union.</p> <p><strong>How to use</strong></p> <pre><code>df1.append( other, ignore_index=False, verify_integrity=False, sort= False ) </code></pre> <p><strong>parameter</strong></p> <ul> <li><code>other</code>: Specify the data to be added. DataFrame or Series object, or a list of these objects</li> <li><code>ignore_index</code>: Whether to ignore the index, if True, the axis will be reset to 0, 1, …, n - 1. Default is False</li> <li><code>verify_integrity</code>: If True, raise ValueError when creating an index with duplicates. Default is False</li> <li><code>sort</code>: If the columns of df1 and other are not aligned, sort the columns. Defaults to False.</li> </ul> <h4>Example</h4> <p>Create two <code>DataFrame</code>.</p> <pre><code>df1 = pd.DataFrame( [[1, 2], [3, 4]], columns=list('AB')) df2 = pd.DataFrame( [[5, 6], [7, 8]], columns=list('BC')) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257487.png" /></p> <p><code>append()</code>By default, two columns will be spliced ​​vertically along the y-axis <code>DataFrame</code>, <code>df1</code>, <code>df2</code>and the columns outside the intersection will be filled <code>NaN</code>.</p> <pre><code>df1.append(df2) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257488.png" /></p> <p><code>ignore_index</code>Set to True to achieve resetting the index of the axis.</p> <pre><code>df1.append(df2, ignore_index=True) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257489.png" /></p> <h3>4.join()</h3> <p><code>join()</code>Used for the stitching operation of two or <code>DataFrame</code>more intercolumn directions (along the <code>x</code>axis), the default is left stitching.</p> <p><strong>How to use</strong></p> <pre><code>df1.join( other, on=None, how='left', lsuffix='', rsuffix='', sort=False) </code></pre> <ul> <li><code>other</code>: Specify the data to be added. DataFrame or Series object, or a list of these objects</li> <li><code>on</code>: The column to join, the default is to use index join</li> <li><code>how</code>: {'left', 'right', 'outer', 'inner'}, the default is 'left', the connection method</li> <li><code>lsuffix</code>: defaults to an empty string, representing the suffix of repeated columns in df1</li> <li><code>rsuffix</code>: suffix for repeated columns in other</li> <li><code>sort</code>: lexicographically order the results on the join key. If False, the order of join keys depends on the join type (keyword).</li> </ul> <h4>Example</h4> <p>Create two <code>DataFrame</code>.</p> <pre><code>df1 = pd.DataFrame( { 'A' : [ 'A0' , 'A1' , 'A2' , 'A3' , 'A4' ], 'val' : [ 'V0' , 'V1' , 'V2' , 'V3' , 'V4 ' ]}) df2 = pd.DataFrame( { 'B' : [ 'B3' , 'B4' , 'B5' ], 'val' : [ 'V3' , 'V4' , 'V5' ]}) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257490.png" /></p> <p>If we want to join using <code>val</code>the column , we need to <code>val</code>set <code>df1</code>to <code>df2</code>the index in and .</p> <pre><code>df1.set_index('val').join( df2.set_index('val')) </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257491.png" /></p> <p>Another way to use <code>val</code>column joins is to specify the <code>on</code>parameter. Only<code>df1.join</code> indexes on can be used, but any column in can be used. So you can only convert the column in the index, and pass the connection column specified by the parameter .<code>df2df1df2valondf1val</code></p> <pre><code>df1.join( df2.set_index('val'), on='val') </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257492.png" /></p> <p>Use outer joins to connect <code>df1</code>,<code>df2</code></p> <pre><code>df1.join( df2.set_index('val'), on='val', how='outer') </code></pre> <p><img alt="" src="https://gitee.com/pythondy/pic_bed/raw/master/image/202111232257493.png" /></p> <h3>Summary of the four methods</h3> <ul> <li><code>concat()`PandasObjects can be connected along any axis , and a layer of hierarchical index can be added on the concatenated axis

  • join()Mainly used for column splicing based on row index
  • merge()Use database-style join merging, where joins are based on columns or indexes.
  • In general append(), join()it can be regarded concat()as merge()a simplified version of and , with fewer parameters and stronger ease of use.
  • For those who are just getting started Pythonor want to get started, you can contact the author through the small cardPython below to communicate and learn together. They all come from novices. Sometimes a simple question card takes a long time, but maybe someone else will suddenly realize it. , I sincerely hope that everyone can make progress together. There are also nearly a thousand sets of resume templates and hundreds of e-books waiting for you to collect them!

    👇🏻Follow the small card, reply to the “Communication Group”, and learn Python together 👇🏻

    You may also like...

    Leave a Reply

    Your email address will not be published.