[Python from zero to one] 15. Data preprocessing, Jieba tools and text clustering for text mining

Hits: 0

Welcome to “Python from Zero to One”, where I will share about 200 Python series articles, take everyone to learn and play together, and see the interesting world of Python. All articles will be explained in combination with cases, codes and the author’s experience. I really want to share my nearly ten years of programming experience with you. I hope it will be helpful to you. Please also ask Haihan for the inadequacies in the article. The overall framework of the Python series includes 10 articles on basic grammar, 30 articles on web crawlers, 10 articles on [visual] analysis, 20 articles on machine learning, 20 articles on big data analysis, 30 articles on image recognition, 40 articles on artificial intelligence, 20 articles on Python security, and 10 articles on other skills . Your attention, likes and retweets are the greatest support for Xiu Zhang. Knowledge is priceless and people have affection. I hope we can all be happy and grow together on the road of life.

The previous article described the principle knowledge-level cases of classification algorithms, including decision trees, KNN, and SVM, and summarized them with you through detailed classification comparison experiments and visual boundary analysis. This article will explain data preprocessing, Jieba word segmentation and text clustering in detail. This article can be said to be an introductory article on text mining and natural language processing. A 20,000-word basic article, I hope it will help you.

Article directory

download link:

Appreciation of the previous article:

Part 1 Basic Grammar

Part II Web Crawler

Part 3 Data Analysis and Machine Learning

The author’s new “Nazhang AI Security Home” will focus on Python and security technology, mainly sharing Web penetration, system security, artificial intelligence, big data analysis, image recognition, malicious code detection, CVE recurrence, threat intelligence analysis, etc. article. Although the author is a technical novice, he will ensure that every article will be written with great care. I hope these basic articles will be helpful to you and make progress with you on the road of Python and security.

The first part of the previous article introduces various Python network data crawling methods in detail, but the corpus crawled is all Chinese knowledge. The previous chapters of the second part also describe commonly used data analysis models and examples. These examples are all for the analysis of array or matrix corpus, so how to perform data analysis on Chinese text corpus? In this chapter, the author will lead you into the field of [text clustering] analysis, and explain the example content of text preprocessing and text clustering.

1. Overview of [data preprocessing]

In data analysis and data mining, it is usually necessary to go through steps such as preliminary preparation, data crawling, data preprocessing, data analysis, data visualization, evaluation and analysis, and the work before data analysis takes nearly half of the working time of data engineers. The data preprocessing will also directly affect the quality of subsequent model analysis.

Data preprocessing refers to some preliminary processing of data before data analysis, including missing value filling, noise processing, inconsistent data correction, Chinese word segmentation, etc. The goal is to obtain more standard and high-quality data , correcting erroneous abnormal data, thereby improving the results of the analysis.

Figure 1 shows the basic steps of data preprocessing, including Chinese word segmentation, part-of-speech tagging, data cleaning, feature extraction (vector space model storage), and weight calculation (TF-IDF).

1. Chinese word segmentation technology and Jieba tool
After obtaining the corpus, the first thing to do is to segment the Chinese corpus. Because Chinese words are closely related, a Chinese sentence is composed of a series of consecutive Chinese characters, and there is no obvious demarcation mark between words, so it is necessary to divide the sentence into a sequence of words connected by spaces through a certain word segmentation technology . This chapter introduces the commonly used word segmentation techniques in Chinese, and focuses on the example of word segmentation by Jieba, a commonly used word segmentation tool in Python.

2. Data cleaning and stop word filtering
After using Jieba’s Chinese word segmentation technology to obtain the word corpus, there may be dirty data and stop words. In order to get better data analysis results, it is necessary to perform data cleaning and stop word filtering operations on these data sets. Here, the Jieba library is used to clean the data.

3. Part-of-speech tagging
Part-of-speech tagging refers to tagging a correct part-of-speech for each word or phrase in the segmentation result, that is, the process of determining whether each word is a noun, verb, adjective or other part of speech. Part-of-speech tagging can determine the role of words in context. Usually part-of-speech tagging is a basic step in [natural language processing] and data preprocessing. Python also provides related libraries for part-of-speech tagging.

4. Feature extraction
Feature extraction refers to converting the original features into a set of core features with obvious physical or statistical significance. The extracted set of features can represent the original corpus as much as possible, and the extracted features are usually stored in the vector space. in the model. The vector space model uses vectors to represent a text, which converts Chinese text into numerical features. This chapter introduces the basics of feature extraction, vector space models, and cosine similarity, and provides in-depth explanations with examples.

5. Weight calculation and TFIDF
In the process of establishing a vector space model, the representation of weight is particularly important. Common methods include Boolean weight, word frequency weight, TF-IDF weight, entropy weight method, etc. This chapter describes the commonly used weight calculation methods, and explains the calculation methods and examples of TF-IDF in detail.

Now suppose that the data set shown in Table 1 exists and is stored in the local test.txt file. The whole chapter will be explained around this data set. The data set is divided into 9 lines of data, including 3 topics, namely: Guizhou Travel, big data and love. Next, each step of data preprocessing will be analyzed and explained.

Guizhou Province is located in the southwest region of China and is referred to as "Qian" or "Gui".
Traveling all over the land of China, you will be intoxicated and colorful in Guizhou.
Guiyang City is the capital of Guizhou Province and has the reputation of "Forest City".
Data analysis is the product of the combination of mathematics and computer science.
Regression, clustering, and classification algorithms are widely used in data analysis.
Data scraping, data storage and data analysis are closely related processes.
The sweetest is love, and the most bitter is love.
An egg can be painted countless times, but can a love be?
True love is often treasured in the most ordinary, ordinary life.

2. Chinese word segmentation

When the reader uses Python to crawl the Chinese dataset, the first thing to do is to perform Chinese word segmentation on the dataset. Since words in English are associated with spaces, phrases can be directly divided according to spaces, so word segmentation processing is not required, while Chinese characters are closely linked and have semantics, and there is no obvious difference between words. Therefore, it is necessary to use Chinese word segmentation technology to divide the sentences in the corpus by spaces and turn them into a sequence of words. The following is a detailed introduction to Chinese word segmentation technology and Jiaba Chinese word segmentation tool.

1. Chinese word segmentation technology

Chinese Word Segmentation refers to dividing the sequence of Chinese characters into individual words or word string sequences. It can create separation marks in Chinese strings without word boundaries, usually separated by spaces. Chinese word segmentation is a very basic knowledge point in the fields of data analysis and preprocessing, data mining, text mining, search engine, knowledge graph, natural language processing, etc. Only the corpus after Chinese word segmentation can be converted into the form of mathematical vector, continue to the following analysis. At the same time, because the Chinese data set involves knowledge such as semantics and ambiguity, it is more difficult to divide, and it is much more complicated than English. Let’s take a simple example to perform word segmentation on the sentence “I am a programmer”.

Input: I am a programmer
Output 1: I\is\program\program\member
Output 2: I am\is a program\program\programmer
Output 3: I\is\programmer

Three methods are used to introduce Chinese word segmentation.

  • “I\is\program\program\member” adopts the unigram word segmentation method to separate the Chinese character string into a single Chinese character;
  • “I am\is a program\program\programmer” uses the binary word segmentation method to separate Chinese characters in pairs;
  • “I\ am\ programmer” is a more complicated but more practical word segmentation method. It performs word segmentation according to Chinese semantics, and its word segmentation result is more accurate.

There are many methods of Chinese word segmentation, common ones include:

  • Word segmentation method based on string matching
  • Statistical word segmentation method
  • Semantic-based word segmentation

Here is a more classic word segmentation method based on string matching.

The word segmentation method based on string matching is also called the dictionary-based word segmentation method. It matches the Chinese string to be analyzed with the entry in the machine dictionary according to a certain strategy. If a certain string is found in the dictionary, the match is successful. , and identify the corresponding words. The matching principles of this method include maximum matching method (MM), reverse maximum matching method (RMM), word-by-word traversal method, best matching method (OM), parallel word segmentation method and so on.

The steps of the forward maximum matching method are as follows, assuming that the number of Chinese characters contained in the longest entry in the automatic word segmentation dictionary is n.

  • ① Select the first n Chinese characters in the current Chinese character string as the matching field from the processed text, and look up the word segmentation dictionary. If there is such an n-word in the dictionary, the matching is successful, and the matching field is segmented as a word.
  • ② If such an n-word cannot be found in the word segmentation dictionary, the matching fails, the last Chinese character is removed from the matching field, and the remaining Chinese characters are used as a new matching field to continue matching.
  • ③ Repeat the matching steps until the matching is successful.

For example, there is now a sentence “Beijing Institute of Technology students come to apply for a job”, and the process of using the forward maximum matching method for Chinese word segmentation is as follows.

Word segmentation algorithm: forward maximum matching method
Input characters: Beijing Institute of Technology students come to apply
Participle dictionary: Beijing, Beijing Institute of Technology, science and technology, university, college student, before his death, coming, applying
Maximum length: 6

Matching process:

  • (1) Select a field with a maximum length of 6 to match, that is, “Beijing Institute of Technology” matches the dictionary “Beijing Institute of Technology” if there is no matching field in the dictionary, then remove a Chinese character, and the remaining “Beijing Institute of Technology” continues to match, and the word has no matching field. Match field, continue to remove a Chinese character, namely “Beijing Institute of Technology”, if the word exists in the word segmentation dictionary, the match is successful.Result: Match “Beijing Institute of Technology”

  • (2) Then select a string with a length of 6 for matching, that is, “college students should come to answer” and “college students should come to answer” have no matching fields in the dictionary, continue to remove Chinese characters from the back, and the three Chinese characters of “college students” are matched in the dictionary success.Result: Matches “college student”

  • (3) The remaining string “Come to apply” Continue to match “Come to apply” There is no matching field in the dictionary, continue to remove Chinese characters from the back until “Come to”.Result: matches “come”

  • (4) The last string “Apply” is matched.Result: Match “Apply”

  • Word segmentation results: Beijing Institute of Technology \ college students \ come \ apply

As Chinese data analysis becomes more and more popular and widely used, various Chinese word segmentation tools have been developed according to its semantic characteristics. Common word segmentation tools include:

  • Stanford Chinese word segmentation tool
  • Harbin Institute of Technology Language Cloud (LTP-cloud)
  • Chinese Lexical Analysis System of Chinese Academy of Sciences (ICTCLAS)
  • IKAnalyzer participle
  • Participle of Pangu
  • Paoding Jie Niu Participle

At the same time, the common Chinese word segmentation tools for Python language include: Pangu word segmentation, Yaha word segmentation, Jieba word segmentation, etc. Their usages are not much different. Due to the fast word segmentation speed, they can be imported into dictionaries such as “Summer Palace” and “Huangguoshu Waterfall”. There are nouns and then Chinese word segmentation and other characteristics. This article mainly introduces the Jieba word segmentation tool to explain Chinese word segmentation.

2. Jieba Chinese word segmentation usage

(1) Installation process
The author recommends that you use the PIP tool to install the Jieba Chinese word segmentation package. The installation statement is as follows:

pip install jieba

Call the command “pip install jieba” to install the jieba Chinese word segmentation package as shown in the figure.

During the installation process, the percentage of packages and files related to the installation configuration will be displayed until the “Successfully installed jieba” command appears, indicating that the installation is successful. Note that you will encounter various problems during the installation process. You must learn to search for answers to solve these problems independently, so as to improve your ability to solve problems independently.

At the same time, if you use the Anaconda Spyder integrated environment, invoke the “Anaconda Prompt” command line mode and enter the “pip install jieba” command to install. If the extension package has been installed in your Python development environment, you will be prompted that the Jieba Chinese word segmentation package already exists, as shown in the figure.

(2) Basic usage
First, the reader looks at a simple stuttering word segmentation code.

  • jieba.cut(text,cut_all=True)
    Word segmentation function, the first parameter is the string that needs to be segmented, and the second parameter indicates whether it is a full mode. The result returned by the word segmentation is an iterable generator (generator), you can use the for loop to get each word after the segmentation, and it is recommended that the reader convert it into a list and then use it.
  • jieba.cut_for_search(text)
    Search engine mode word segmentation, the parameter is the word segmentation string, this method is suitable for the word segmentation of the inverted index constructed by the search engine, and the granularity is relatively fine.

#By:Eastmount CSDN
import jieba  

text = "Xiao Yang graduated from Beijing Institute of Technology and is engaged in Python artificial intelligence related work."  

#Full mode 
data = jieba.cut(text,cut_all= True )
print( u"[full mode]: " , "/" .join(data))

#precise mode   
data = jieba.cut(text,cut_all= False )
print( u"[exact mode]: " , "/" .join(data))

#Default is exact mode
data = jieba.cut(text)  
print( u"[default mode]: " , "/" .join(data))

#search engine mode
data = jieba.cut_for_search(text)    
print( u"[Search engine mode]: " , "/" .join(data))

#return list 
seg_list = jieba.lcut(text, cut_all= False )
print( "[return list]: {0}" .format(seg_list))

The output is shown below.

The final word segmentation result is ideal, in which “Xiao/Yang/Graduate/Yu/Beijing Institute of Technology/, / Engaged/Python/Artificial Intelligence/Related/Work/.” output by the precise mode is more accurate. The following is a brief description of the three participle modes of Chinese word segmentation.

Full mode
This mode constructs all words in the corpus that can be combined into words. The advantage is that it is very fast, but the disadvantage is that it cannot solve the ambiguity problem, and the word segmentation results are not very accurate. The word segmentation result is “small/yang/graduated/yu/Beijing/Beijing Institute of Technology/Beijing Institute of Technology/Technology/Technology University/Technology University/Technology University/University/// Engaged in/Python/Artificial/Artificial Intelligence/Intelligence/Related/Work //”.

Precise Mode
This mode uses its algorithm to separate sentences most precisely, suitable for text analysis, and is usually used for Chinese word segmentation. The word segmentation result is “small/yang/graduated/in/Beijing Institute of Technology/,/engaged in/Python/artificial intelligence/related/work/.”, in which the complete nouns such as “Beijing Institute of Technology” and “artificial intelligence” are accurately It is recognized, but some words are not recognized, and the subsequent import of the dictionary can realize the recognition of proprietary words.

Search engine mode
This mode is based on the precise mode, and the long words are segmented again to improve the recall rate. It is suitable for word segmentation in search engines. The result is “Small / Yang / Graduated / Yu / Beijing / Polytechnic / Technical University / University / Polytechnic University / Beijing Institute of Technology /, / Engaged in / Python / Artificial / Intelligence / Artificial Intelligence / Related / Work /.”.

The Jieba Chinese word segmentation package provided by Python mainly uses the Trie tree structure to achieve efficient word graph scanning (constructing a directed acyclic graph DAG), dynamic programming to find the maximum probability path (find the maximum segmentation combination based on word frequency), Algorithms such as the HMM model based on the ability of Chinese characters to form words are not described in detail here, and this book focuses more on application cases. At the same time, stuttering word segmentation supports traditional word segmentation and custom dictionary methods.

  • ==load_userdict(f) ==

(3) Chinese word segmentation example
The following is the Chinese word segmentation for the corpus in Table 1. The code is to read the contents of the file in turn, and call the stuttering word segmentation package to perform Chinese word segmentation, and then store it in a local file.

#By:Eastmount CSDN
import os  
import codecs
import jieba  
import jieba.analyse

source = open("test.txt", 'r')
line = source.readline().rstrip('\n')
content = []
while line!="":
    seglist = jieba.cut(line,cut_all= False )   #exact mode   
    output = ' ' .join(list(seglist)) #space          splicing  
    line = source.readline().rstrip('\n')

The output is as shown in the figure, you can see the corpus after word segmentation.

3. Data cleaning

1. Overview of Data Cleaning

Dirty data usually refers to data that is of low quality, inconsistent or inaccurate, as well as human-caused erroneous data, etc. The authors divide common dirty data into four categories:

  • Incomplete data
    This type of data refers to the data with missing information, which usually needs to be supplemented and written into the database or file. For example, the sales data for 30 days in September is counted, but the data is lost for a few days during the period. In this case, the data needs to be completed.
  • Duplicate data
    There may be duplicate data in the data set. In this case, the duplicate data needs to be exported for the customer to confirm and correct the data, so as to ensure the accuracy of the data. In the cleaning and conversion phase, try not to make a decision to delete duplicate data items easily, especially not to filter out important or business-meaning data. The work of verification and duplicate confirmation is essential.
  • Wrong data
    This type of dirty data often appears in the website database, which means that the business system is not sound enough, and it is directly written into the background database without making a judgment or wrong operation after receiving the input. For example, the string data is followed by a carriage return. , incorrect date format, etc. Such errors can be selected by going to the business system database with SQL statements, and then handing over to the business department for correction.
  • Not all words in the corpus after stop word

Data cleaning mainly solves dirty data and improves data quality. It is mainly used in data warehouse, data mining, data quality management and other fields. Readers can simply position data cleaning as: as long as it is a process that helps to solve data quality problems, it is considered data cleaning, and the definition of data cleaning in different fields is different. In short, the purpose of data cleaning is to ensure data quality and provide accurate data, and its task is to filter or modify those data that do not meet the requirements to better pave the way for subsequent data analysis.

In order to solve the above problems, the data cleaning methods are divided into:

  • Resolving missing data
    For null or missing data, it is necessary to use the estimation filling method. Common estimation methods include filling in the sample mean, median, mode, maximum value, and minimum value. For example, the average value of all data is selected to fill in the missing data. data. These methods will have certain errors. If there are too many null data, it will affect the results and deviate from the actual situation.
  • Resolving duplicate data
    Simple duplicate data needs to be identified by humans, and the computer’s method of solving duplicate data is more complicated. The method usually involves entity recognition technology, using effective technology to identify similar data, these similar data point to the same entity, and then correct these duplicate data.
  • Resolving erroneous data
    For erroneous data, statistical methods are usually used to identify them, such as deviation analysis, regression equation, normal distribution, etc., or a simple rule base can be used to detect the numerical range, and use the constraint relationship between attributes to proofread these data.
  • Solving Stop
    Words The concept of stop words was proposed by Hans Peter Luhn and has made a great contribution to information processing. There is usually a collection of stop words, called a stop word list. Stop words are often added manually based on empirical knowledge and are universal. The solution to stop words is to use stop word dictionary or stop word table for filtering. For example, words such as “and”, “dang”, “di” and “ah” have no specific meaning and need to be filtered. There are also some phrases such as “we”, “but”, “don’t say”, “and”. Filtering is required.

2. Chinese corpus cleaning

The Chinese text corpus crawled by Python has been segmented, and then data cleaning operations are required, usually including stop word filtering and special punctuation removal. For null data and duplicate data, the author recommends Everyone makes simple judgments or supplements missing values ​​during the data crawling process. The following is an example of data cleaning for the Chinese corpus provided in Table 1 (including three themes of Guizhou, big data and love).

(1) Stop word filtering
The above picture is the result of using the stuttering tool for Chinese word segmentation, but it has some stop words that appear frequently but do not affect the text theme, such as “data analysis is the product of the combination of mathematics and computer science” Words such as “is”, “and”, and “of” in the sentence need to be filtered during preprocessing.

Here the author defines an array of common stop words that conform to the data set, and then compares the sequence after word segmentation, each word or phrase with the stop word list, and deletes the word if it is repeated, and the last retained text can be Reflect as much as possible the topic of each line of corpus. code show as below:

#By:Eastmount CSDN
import os  
import codecs
import jieba  
import jieba.analyse

#Stopword list 
stopwords = {}.fromkeys([ 'of' , 'or' , 'etc' , 'is' , 'have' , 'of' , 'and' ,
                          'and' , 'also' , ' By' , 'do' , 'in' , 'in' , 'most' ])

source = open("test.txt", 'r')
line = source.readline().rstrip('\n')
content = []                                  #full text

while line!="":
    seglist = jieba.cut(line,cut_all= False )   #exact mode 
    final = []                                #store the content of removing stopwords 
    for seg in seglist: 
         if seg not  in stopwords :  
    output = ' ' .join(list(final))            #space splicing
    line = source.readline().rstrip('\n')

The stopwords variable defines the stop word list. Only the common stop words related to our test.txt corpus are listed here. In real preprocessing, the common stop word list is usually imported from the file, including All kinds of stop words, readers can go to the Internet to search and view.

The core code is the for loop to judge whether the corpus after word segmentation is in the stop word list, if not, it will be added to the new array final, and the filtered text will be retained at the end, as shown in the figure.

(2) Remove punctuation marks
When doing text analysis, punctuation marks are usually counted as a feature, which affects the results of the analysis, so we need to filter punctuation marks as well. The filtering method is the same as the previous method of filtering stop words. Create an array of punctuation marks or put them in stop words stopwords. The stop word array is as follows:

stopwords = {}.fromkeys([ 'of' , 'or' , 'etc' , 'is' , 'have' , 'of' , 'with' ,
                          'and' , 'also' , 'by' , ' is ' , 'in' , 'in' , 'most' ,
                          '"' , '"' , '.' , ',' , '?' , ',' , ';' ])

At the same time, the text content is stored in the local result.txt file. The complete code is as follows:

# coding=utf-8
#By:Eastmount CSDN
import os  
import codecs
import jieba  
import jieba.analyse

#stop word list 
stopwords = {}.fromkeys([ 'of' , 'or' , 'etc' , 'is' , 'have' , 'of' , 'and' ,
                          'and' , 'also' , ' Is' , 'do' , 'in' , 'in' , 'most' ,
                          '"' , '"' , '.' , ', ' , '?' , ', ' , ';'])

source = open("test.txt", 'r')
result = codecs.open("result.txt", 'w', 'utf-8')
line = source.readline().rstrip('\n')
content = []                                  #full text

while line!="":
    seglist = jieba.cut(line,cut_all= False )   #exact mode 
    final = []                                #store the content of removing stopwords 
    for seg in seglist:   
         if seg not  in stopwords :  
    output = ' ' .join(list(final))            #space splicing
    result.write(output + '\r\n')
    line = source.readline().rstrip('\n')

The output results are shown in Figure 7. The obtained corpus is very refined, reflecting the text topics as much as possible, among which 1-3 are Guizhou tourism topics, 4-6 are big data topics, and 7-9 are love topics.

4. Feature extraction and vector space model

This section mainly introduces the basic knowledge of feature extraction, vector space model and cosine similarity, and uses the corpus provided in Table 21.1 to calculate the cosine similarity based on the vector space model.

1. Feature Specification

The corpus after web crawling, Chinese word segmentation, and data cleaning is usually called the initial feature set, and the initial feature set is usually composed of high-dimensional data, and not all features are important. High-dimensional data may contain irrelevant information, which will reduce the performance of the algorithm, and even high-dimensional data will cause dimensional disaster, affecting the results of data analysis.

The study found that reducing the redundant dimension (weakly correlated dimension) of the data or extracting more valuable features can effectively speed up the calculation, improve the efficiency, and ensure the accuracy of the experimental results, which is called feature reduction in academics.

Feature reduction refers to the selection of features relevant to data analysis applications to obtain the best performance with less processing effort. Feature reduction consists of two tasks: feature selection and feature extraction. They all find the most effective features from the original features, and these features can characterize the original dataset as much as possible.

(1) Feature extraction
Feature extraction is to convert the original features into a set of core features with obvious physical or statistical significance, and the extracted set of features can represent the original corpus as much as possible. Feature extraction is divided into linear feature extraction and nonlinear feature extraction. Common methods of linear feature extraction include:

  • PCA principal component analysis method. This method finds the optimal subspace representing the data distribution, reduces the dimension of the original data and extracts irrelevant parts, which is often used for dimension reduction, refer to the previous article on clustering.
  • LDA Linear Discriminant Analysis method. This method finds the subspace with the largest separability criterion.
  • ICA independent component analysis method. This method reduces the dimensionality of the original data and extracts mutually independent attributes, looking for a linear transformation.

Common methods of nonlinear feature extraction include Kernel PCA, Kernel FDA, etc.

(2) Feature selection
Feature selection is to select a set of the most statistically significant features from the feature set to achieve dimensionality reduction, which usually includes four parts: generation process, evaluation function, stopping criterion, and verification process. Traditional methods include information gain (Information Gain, referred to as IG) method, random generation sequence selection algorithm, Genetic Algorithms (Genetic Algorithms, referred to as GA) and so on.

The following figure is an example of extracting edge line features of Lena diagram in image processing applications. A certain amount of features can be used to describe the outline of the whole person as much as possible. It is the same principle as the application in data analysis.

2. Vector space model

Vector Space Model (VSM for short) represents a document in the form of a vector, which can convert Chinese text into numerical features for data analysis. As one of the most mature and widely used text representation models, the vector space model has been widely used in data analysis, natural language processing, Chinese information retrieval, data mining, text clustering and other fields, and has achieved certain results.

A text corpus is represented by a vector space model, which converts a document (Document) or a web corpus (Web Dataset) into a series of keywords (Key) or feature items (Term) vectors.

  • Feature item (Trem)
    feature item means that the content expressed by the document is composed of the basic language units (words, words, phrases or phrases) contained in it. In the text representation model, the basic language unit is called the feature item of the text. For example, the text Doc contains n feature items, which are expressed as:

  • Feature weight (Trem Weight)
    Feature weight refers to assigning a weight wi to a feature item ti (1≤i≤n) in the document to indicate the importance of the feature item to the document content. The higher the weight, the more effective the feature item. Reflect its importance in the document. There are n feature items in the text Doc, namely: {t1, t2, t3, … , tn-1, tn}, which is an n-dimensional coordinate, and then the weight wi of each feature item ti in the text needs to be calculated, as The coordinate value of the corresponding feature. The text representation by feature weight is as follows, where WDoc is called the feature vector of text Doc.

  • Document Representation
    After obtaining the feature items and feature weights, if a document needs to be represented, the following formula is used. Among them, the document Doc contains a total of n feature words and n weights. ti is a series of characteristic words that are different from each other, i=1,2,…,n. wi(d) is the weight of feature word ti in document d, which can usually be expressed as the frequency of ti appearing in d.

There are many different calculation methods for the feature item weight W. The simplest method is to use the number of occurrences of the feature item in the text as the weight of the feature item, which will be described in detail in the fifth part.

As you can see from the above figure, the process of storing documents as word frequency vectors is converted to {1,0,1,0,…,1,1,0} form. The selection of feature items and the calculation of feature item weights are the two core issues of the vector space model. In order to make the feature vector better reflect the meaning of the text content, it is necessary to select reasonable feature items for the text, and follow the rules when assigning weights to the feature items. The principle that the feature item that has a greater influence on the text content feature has a greater weight.

3. Cosine similarity calculation

After the vectors of the two articles are calculated using the above vector space model, the degree of similarity between the two articles can be calculated, and the degree of similarity between the two articles is described by the cosine angle Cos of the two vectors. The similarity calculation formula of texts D1 and D2 is as follows:

where the numerator represents the dot product of two vectors, and the denominator represents the product of the modulo of the two vectors. After calculating the cosine similarity, the similarity of any two articles can be obtained. Documents with higher similarity can be classified into the same topic, or a threshold can be set for cluster analysis. The principle of the method is to convert language problems into mathematical problems to solve practical problems.

The following figure is a vector space model diagram, which shows the calculation method of cosine similarity between documents Term1, Term2, …, TermN. If the two documents are more similar, the smaller the angle θ, the closer the Cos value is to 1, When two documents are completely similar, the included angle is 0° and the Cos value is 1. This also demonstrates the principle knowledge of cosine similarity.

Below we borrow two sentences to calculate the cosine similarity with “Beijing Institute of Technology students come to apply for a job”. Assuming that there are three sentences, we need to see which sentence has a higher degree of similarity with “Beijing Institute of Technology students come to apply for a job”, and the themes are considered to be more similar. So, how to calculate the similarity between sentence A and sentence B?

Sentence 1: Beijing Institute of Technology students came to apply

Sentence 2: Tsinghua University students also came to apply

Sentence 3: I like to code

The following uses vector space model, word frequency and cosine similarity to calculate the similarity between sentence 2 and sentence 3 and sentence 1 respectively.

The first step: Chinese word segmentation.

Sentence 1: Beijing Institute of Technology / college students / come to / apply for a job

Sentence 2: Tsinghua University / college students / also / come / apply

Sentence 3: I / like / write / code

Step 2: List all the words in the order in which they appear.

Beijing Institute of Technology / college students / come / apply for / Tsinghua University / also / I / like / write / code

Step 3: Calculate word frequency. as shown in the table.

Step 4: Write the word frequency vector.

Sentence 1: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

Sentence 2: [0, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Sentence 3: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

Step 5: Calculate the cosine similarity.

The results show that the similarity between sentence 1 and sentence 2 is 0.67, and there is a certain similar theme; while the similarity between sentence 1 and sentence 3 is 0, which is completely dissimilar.

In conclusion, cosine similarity is a very useful algorithm, and it can be used any time you want to calculate how similar two vectors are. When the cosine value is closer to 1, it indicates that the angle between the two vectors is closer to 0 degrees, and the two vectors are more similar. However, cosine similarity, as the simplest similarity calculation method, also has some disadvantages, such as too large amount of calculation and no consideration of the correlation between words.

5. Weight calculation

The method of word frequency weight calculation described above is too simple, and the following will introduce other weight calculation methods.

Weight calculation refers to measuring the importance of feature items in document representation through feature weights, and assigning certain weights to feature words to measure statistical text feature words. Commonly used weight calculation methods include: Boolean weight, absolute word frequency, inverted document word frequency, TF-IDF, TFC, entropy weight, etc.

1. Common weight calculation methods

(1) Boolean weight
Boolean weight is a relatively simple weight calculation method. The set weight is either 1 or 0. If the feature word appears in the text, the component of the text vector corresponding to the feature word is assigned 1; if the feature word does not appear in the text, the component is 0. The formula is as follows, where wij represents the weight of the feature word ti in the text Dj.

Suppose the eigenvectors are:

  • {Beijing Institute of Technology, college student, come here, apply for a job, Tsinghua University, also, I, like, write, code}

Now it is necessary to calculate the weight of the sentence “Beijing Institute of Technology college students came to apply for a job”, then the corresponding component of the feature word that exists in the feature vector is 1, and the corresponding component that does not exist is 0. The final feature vector result is:

  • {1,1,1,1,0,0,0,0,0,0}

However, in practical applications, the value of Boolean weight 0-1 cannot reflect the importance of feature words in the text, so the method of word frequency is derived.

(2) Absolute word
frequency The word frequency method is also called absolute word frequency (Term Frequency, TF for short). It first calculates the frequency of feature words appearing in the document, and then characterizes the text. It is usually represented by tfij, that is, the frequency of the feature word ti in the training text Dj.

Suppose the sentence is “University of Beijing Institute of Technology and college students of Tsinghua University come to apply for jobs”, and the corresponding feature words are: {Beijing Institute of Technology, college students, come, apply, Tsinghua University, also, I, like, write, code, , and }, the corresponding word frequency vector is:

  • {1,2,1,1,1,0,0,0,0,0,2,1}

The previous example of calculating text cosine similarity using the vector space model also uses word frequency, which is one of the simplest and most effective methods of weight calculation.

(3) Inverted document frequency
Because the word frequency method cannot reflect the distinguishing ability of low-frequency feature items, there are often some feature items with high frequency, but have a low degree of influence in the text, such as “we”, “but”, “De” and other words; at the same time, although some feature items appear infrequently, they express the core idea of ​​the whole text and play a vital role.

The Inverse Document Frequency (IDF) method is a classic method proposed by Spark Jones in 1972 for calculating the weight of words and documents. The formula is as follows:

Among them, the parameter | D | represents the total number of texts in the corpus, and represents the number of feature words ti contained in the text.

In the inverse document frequency method, the weight changes inversely with the change of the number of documents of the feature word. Some common words, such as “we”, “but”, “of”, etc., appear very frequently in all documents, but their IDF values ​​are very low. Even if it appears in every document, the log1 calculation result is 0, thus reducing the effect of these common words; on the contrary, if a word introducing “Python” appears only in that document, its effect is very high.

There are also many weight calculation methods, including TF-IDF, entropy weight, TF-IWF, error-driven feature weight algorithm, etc. Readers can study by themselves. Here, only the most basic methods are briefly introduced.


TF-IDF (Term Frequency-Invers Document Frequency) is a classic weight calculation technique used in data analysis and information processing in recent years. This technology calculates the importance of the feature word in the entire corpus according to the number of times the feature word appears in the text and the document frequency in the entire corpus. The advantage is that it can filter out some common but irrelevant words, as many as possible. The feature words with high degree of influence are reserved.

Among them, TF (Term Frequency) represents the frequency or number of times a certain keyword appears in the entire article. IDF (Invers Document Frequency) represents the inverse text frequency, also known as the inverse document frequency. It is the inverse of the document frequency and is mainly used to reduce the effect of some common words in all documents but have little impact on the document. The complete formula of TF-IDF is as follows:

where tfidfi, j represents the product of the word frequency tfi,j and the inverted text word frequency idfi, the weight in TF-IDF is proportional to the frequency of the feature item appearing in the document, and inversely proportional to the number of documents that the feature item appears in the entire corpus. tfidfi, the larger the value of j is, the more important the feature word is to the text.

The calculation formula of TF word frequency is as follows:

Among them, ni,j is the number of times the feature word ti appears in the training text Dj, is the number of all feature words in the text Dj, and the calculated result is the word frequency of a feature word.

The TF-IDF formula derivation is as follows:

The core idea of ​​TF-IDF technology is that if a feature word has a high frequency TF in an article and rarely appears in other articles, it is considered that this word or phrase has a good ability to distinguish between categories and is suitable for Do weight calculation. The TF-IDF algorithm is simple and fast, and the results are in line with the actual situation. The disadvantage is that the importance of a word is simply measured by word frequency, which is not comprehensive. Sometimes important words may not appear many times, and the algorithm cannot reflect the position information of words.

3. Sklearn calculates TF-IDF

Scikit-Learn is a Python-based machine learning module. The basic functions are mainly divided into six parts: classification, regression, clustering, data dimensionality reduction, model selection and data preprocessing. For details, please refer to the documentation on the official website. The installation and usage of Scikit-Learn are described in detail earlier in this book. Two classes CountVectorizer and TfidfTransformer in Scikit-Learn are mainly used here to calculate word frequency and TF-IDF value.

  • CountVectorizer
    This class is a form of converting text words into a word frequency matrix. For example, the text “I am a teacher” contains four words in total, and the word frequencies of their corresponding words are all 1, and “I”, “am”, “a”, and “teacher” appear once respectively. CountVectorizer will generate a matrix a[M][N], a total of M text corpus, N words, such as a[i][j] represents the word frequency of word j under i-type text. Then call the fit_transform() function to calculate the number of occurrences of each word, and the get_feature_names() function to obtain all the text keywords in the thesaurus.

The word frequency code for calculating the result.txt text is as follows. The following table is the result of the Chinese word segmentation and data cleaning of the data set in Table 1, as shown below.

Guizhou Province is located in Southwest China, referred to as Qiangui
Traveling all over the land of Shenzhou, intoxicating and colorful Guizhou
Guiyang City is known as the capital of the forest in Guizhou Province.
Data analysis is a combination of mathematical computer science
Regression Clustering Classification Algorithms Widely Used in Data Analysis
Data crawling data storage data analysis closely related process
The sweetest love The bitterest love
An egg can paint countless times of love
True love often treasures the most ordinary ordinary life

code show as below:

#By:Eastmount CSDN
from sklearn.feature_extraction.text import CountVectorizer  

#Store the read corpus one line is expected to be a document
corpus = []  
for line in open('result.txt', 'r', encoding="utf-8").readlines():  

#Convert words in text to word frequency matrix  
vectorizer = CountVectorizer()

#Count the number of times a word appears  
X = vectorizer.fit_transform(corpus)

#Get all text keywords in the word bag  
word = vectorizer.get_feature_names()  
for n in range(len(word)):  
    print(word[n],end=" ")

#View word frequency results   
print (X.toarray())

The output is shown below.

  • TfidTransformer
    After using the CountVectorizer class to calculate the word frequency matrix, the TfidfTransformer class is used to calculate the TF-IDF value of each word in the vectorizer variable. The code is added as follows.

#By:Eastmount CSDN
from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.feature_extraction.text import TfidfTransformer

# store the read corpus
corpus = []  
for line in open('result.txt', 'r', encoding="utf-8").readlines():  
vectorizer = CountVectorizer() #Convert         the words in the text into a word frequency matrix 
X = vectorizer.fit_transform(corpus)   #Count the number of times a word appears   
word = vectorizer.get_feature_names() #Get all text keywords in the word bag    
for n in range (len(word)):  
     print (word[n],end= " " )
 print ( '' )  
 print (X.toarray()) #View                     word frequency results

#Calculate TF-IDF value
transformer = TfidfTransformer()  
tfidf = transformer.fit_transform(X) #Statistics the word frequency matrix X into a TF-IDF value   #View the 
data structure 
print (tfidf.toarray())                #tfidf[i][j] represents the tf-idf weight in the i-type text

The result of running part is shown in the figure below.

TF-IDF values ​​are stored in the form of matrix arrays, each row of data represents a text corpus, and each column of each row represents the weight corresponding to one of the features. After obtaining TF-IDF, various data analysis algorithms can be used for analysis, such as Cluster analysis, LDA topic distribution, public opinion analysis, etc.

6. Text Clustering

After obtaining the text TF-IDF value, this section briefly explains the process of using the TF-IDF value for text clustering, which mainly includes the following five steps:

  • The first step is to generate a word frequency matrix on the corpus after Chinese word segmentation and data cleaning. Mainly call the CountVectorizer class to calculate the word frequency matrix, and the generated matrix is ​​X.
  • The second step is to call the TfidfTransformer class to calculate the TF-IDF value of the word frequency matrix X, and obtain the Weight weight matrix.
  • The third step is to call the KMeans class of the Sklearn machine learning package to perform the clustering operation, and set the number of clusters n_clusters to 3, corresponding to the three topics of corpus Guizhou, data analysis and love. Then call the fit() function to train and assign the predicted class labels to the y_pred array.
  • The fourth step is to call the Sklearn library PCA() function for dimensionality reduction. Since TF-IDF is a multidimensional array, it is the weight corresponding to all the features of 9 lines of text, and these features need to be reduced to two dimensions before drawing, corresponding to the X and Y axes.
  • The fifth step is to call the Matplotlib function to perform visualization operations, draw clustering graphics, and set graphics parameters, titles, axis content, etc.

code show as below.

# coding:utf-8
#By:Eastmount CSDN
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer  

#The first step is to generate the word frequency matrix
corpus = []  
for line in open('result.txt', 'r', encoding="utf-8").readlines():  
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus) 
word = vectorizer.get_feature_names()    
for n in range(len(word)):  
    print(word[n],end=" ")

#The second step calculates the TF-IDF value
transformer = TfidfTransformer()  
tfidf = transformer.fit_transform(X)
weight = tfidf.toarray()

#The third step KMeans clustering 
from sklearn.cluster import KMeans  
clf = KMeans(n_clusters=3)  
s = clf.fit(weight) 
y_pred = clf.fit_predict(weight)
print (clf)
 print (clf.cluster_centers_) #Class cluster center 
print (clf.inertia_) #Distance          : used to evaluate whether the number of clusters is suitable, the smaller the better the clustering. 
print (y_pred) #Predict                the class label

#The fourth step is dimensionality reduction 
from sklearn.decomposition import PCA  
pca = PCA(n_components= 2 )    #Reduce to two-dimensional drawing
newData = pca.fit_transform(weight)  
x = [n[0] for n in newData]  
y = [n[1] for n in newData]  

#The fifth step is to visualize 
import numpy as np  
 import matplotlib.pyplot as plt   
plt.scatter(x, y, c=y_pred, s=100, marker='s')  

The clustering output is shown in the figure.

A total of 6 points are drawn in the graph, and the data are clustered into three categories, corresponding to different colors. The corresponding class labels are:

  • [2 0 2 0 0 0 1 1 0]

It gathers the 1st and 3rd lines of corpus together, and the class label is 2; the 2nd, 4th, 5, 6, and 9 lines are gathered into a group, and the class label is 0; the 7th and 8th lines of corpus are gathered into the last group, The class is marked as 1. In the real data set, rows 1, 2, and 3 represent Guizhou topics, rows 4, 5, and 6 represent data analysis topics, and rows 7, 8, and 9 represent love topics, so there will be certain errors in the prediction results of data analysis. It is necessary to reduce the error as much as possible, similar to deep learning, which is also progressing in continuous learning.

You may be wondering why only 6 points are plotted with 9 rows of data? The following are the X and Y coordinates generated by 9 lines of data for dimensionality reduction. It can be seen that some of the data are the same. This is because these 9 lines of corpus contain fewer words, and the frequency of occurrence is basically once. The same phenomenon may occur after the word frequency matrix and TF-IDF are processed by dimensionality reduction. However, the corpus in the real analysis contains more words, and the more scattered points in the cluster analysis can reflect the results of the analysis more intuitively.

[[-0.19851936  0.594503  ]
 [-0.07537261  0.03666604]
 [-0.19851936  0.594503  ]
 [-0.2836149  -0.40631642]
 [-0.27797826 -0.39614944]
 [-0.25516435 -0.35198914]
 [ 0.68227073 -0.05394154]
 [ 0.68227073 -0.05394154]
 [-0.07537261  0.03666604]]

During the postgraduate period, the author used the KMeans clustering algorithm to perform text clustering analysis on the four subject encyclopedia data sets crawled when studying knowledge graphs and entity alignment knowledge. The clustering results are shown in the figure.

In the figure, red indicates the theme text of tourist attractions, green indicates the theme text of protected animals, blue indicates the theme text of characters and stars, and black indicates the theme text of National Geographic. From the figure, it can be found that the four types of topics are clustered into four clusters. This is a simple example of text analysis. I hope readers can analyze the text knowledge they study based on the knowledge points in this chapter.

7. Summary

The data analysis content described above is almost all based on numbers and matrices, and some data analysis involves text processing analysis, especially Chinese text data. How do they deal with it? After we get Chinese corpus through web crawler, can we perform data analysis? The answer is definitely yes.

But unlike the previous data analysis, it also needs to go through the steps of Chinese word segmentation, data cleaning, feature extraction, vector space model, weight calculation, etc., to convert the Chinese data into the form of mathematical vectors, these vectors are the corresponding numerical features, and then can Carry out the corresponding data analysis. The explanation in this chapter runs through a custom data set, which includes three themes of Guizhou, data analysis, and love. The KMeans clustering algorithm is used for example explanations. I hope readers will study hard, master the methods of Chinese corpus analysis, and how to use their own The Chinese data set is converted into a vector matrix, and then related analysis is performed.

Finally, I hope that readers can reproduce every line of code, and only practice can improve. At the same time, I have more knowledge of clustering algorithms and principles. I hope readers will come down to study and research on their own. I also recommend that you learn more machine learning knowledge by combining Sklearn official website and open source website.

The download address of all codes in this series:

Thanks to the colleagues on the way to study, live up to the meeting, don’t forget the original intention. This week’s message is emotional~

(By: Nazhang House Eastmount 2021-08-06 Night in Wuhan /Eastmount )


  • [6] Zhang Liangjun, Wang Lu, Tan Liyun, Su Jianlin. Python data analysis and mining practice [M]. Beijing: Machinery Industry Press, 2016.
  • [7] (US) Wes McKinney. Translated by Tang Xuetao and others. Using Python for data analysis [M]. Beijing: Machinery Industry Press, 2013.
  • [8] Jiawei Han, Micheline Kamber. Translated by Fan Ming and Meng Xiaofeng. The Concept and Technology of Data Mining. Beijing: Machinery Industry Press, 2007.

You may also like...

Leave a Reply

Your email address will not be published.