Today’s article is actually a real question that I have encountered in an interview.
You may have heard more about sub-database and sub-table , but do you understand the problem of reading diffusion ?
There are several issues involved here.
[What is a sub-library sub-table] ?
What is the read diffusion problem?
Why do sub-database sub-tables cause read diffusion problems?
How to solve the reading diffusion problem?
Can you please stop calling me Diao Mao in the comments?
Sorry, lost my temper.
These questions are rather interesting.
I believe that brothers will also have the opportunity to meet hahaha.
Let’s start with the topic of sub-database and sub-table.
Sub-library and sub-table
We usually do project development. At the beginning, a data table is usually used first. Generally speaking, after the data table writes 2kw pieces of data, the hierarchical structure of the underlying B+ tree may become higher, and the data pages of different levels are generally placed in different disks. In other words, disk IO will increase, resulting in poor query performance. If you have any doubts about the above sentence, you can read the article I wrote earlier.
Therefore, when we need to manage more and more data in a single table, we have to consider database sub-tables . The sub-tables here are divided into horizontal sub-tables and vertical sub-tables .
The principle of vertical table division is relatively simple. Generally, some columns are split into a new table, so that the single row of data will become smaller, and the number of rows that can be placed in a single data page (fixed 16kb) in the B+ tree will increase. , so that a single table can fit more data.
There isn’t much to say about the vertical split table. Below, we focus on the most common level sub-tables .
There are several ways to divide the table horizontally, but no matter which one is, the essence is to turn the original
user_0, user1, user2 .... uerNsuch N number of small tables.
From reading and writing a large user table , to reading and writing N small .
In each small table, only a part of the data is saved, but how much is saved is determined by yourself, generally 500w~2kw .
How to do the sub-table?
Divide the table according to the id range
I think the best use is to divide the table according to the id range.
We assume that each sub-table can
2kwrelease data. Then user0 puts the data whose [primary key] id is
1~2kw. User1 will put id as
2kw+1 ~ 4kw, user2 will put id as
4kw+1 ~ 6kw, userN will put it
2N kw+1 ~ 2(N+1)kw.
so. For business code , it only knows that it is reading and writing a user table, and it does not know that there are so many small tables underneath.
For the database , it does not know that it has been divided into tables, it only knows that there are so many tables, and the names just look similar.
This is just a sub-table in one database . If the scope is larger, it can be sub-table in multiple databases . This is the so-called sub-database sub-table .
Whether it is a single database sub-table or a sub-database sub-table, routing can be done through such an intermediate layer logic.
It really answered that sentence, there is nothing that cannot be solved by adding a middle layer.
If there is, add an extra layer.
As for the implementation of this middle layer, it is more flexible. It can be added to the business code like a third-party orm library .
You can also add a proxy service between mysql and business code .
If it is done through a third-party orm library, then different code bases need to be implemented according to different languages, so many factories choose the latter method of adding a proxy, so that there is no need to care what language the upstream service uses .
Modulo table according to id
At this time, a brother wants to ask a question, “I see that many schemes are modulo id , is your scheme incomplete?”.
Modulo schemes are also common.
For example, if an id=31 comes in, we have a total of 5 tables, which are user0 to user4. Yes
31%5=1, take the modulo
1, so you can know that the
user1table should be read and written.
The advantage is, of course, that it is relatively simple. And read and write data can be very evenly distributed to each sub-table.
But the disadvantages are also obvious. If you want to expand the number of tables, for example, from 5 tables to 8 tables. That is also the data of id=31,
31%8 = 7, you need to read and write the user7 table. It doesn’t match up with the original.
This requires consideration of data migration issues. Very bald.
In order to avoid the problem of subsequent expansion, I have seen some businesses estimate the data to be very large at the beginning, and then divide it into 100 tables. If a table can store 2kw records, it can also store 2 billion data. .
It doesn’t mean that this is not possible. Even when this business is finally abandoned, millions of pieces of data are stored. Every time I open the database table, I can see a lot of user_xx. mental burden of the staff .
In the above method, the table is divided according to the id range, which can solve these problems very well. When the data is small, the table is also small. As the data increases, the table will gradually increase. And the table can be expanded infinitely.
Does that mean that the modulo method is useless?
Combining the above two methods
The biggest advantage of id modulo is that the newly written data is actually scattered across multiple tables .
The table is divided according to the id range, because the id is incremented, the newly written data will generally fall on a certain table . If your business scenario writes data frequently, then this table will have write hot spots. The problem.
At this time, you can combine the id modulo and the id range table method.
We can introduce the modulo function in a certain id range. For example, the user1 table used to
2kw~4kwbe , and now it can be divided into 5 tables in this range , that is, user1-0, user1-2 to user1-4 are introduced, and modulo is taken from these 5 tables.
For example, id=3kw, according to the range, it will be divided into user1 table, and then modulo 3kw % 5 = 0, that is, read and write user1-0 table.
In this way, the write-single table can be amortized into the write-multi-table.
This advantage will be more obvious in the scenario of sub-library. Different libraries can deploy services to different machines, so that the performance of each machine can be used.
Read Diffusion Problem
The several table-sharding methods we mentioned above all use the id column as the basis for the table-sharding , which is actually the so-called sharding key .
In fact, we generally use the database primary key as the sharding key .
In this way, ideally we know an id, and no matter which rule is used, we can quickly locate which sub-table to read.
But in many cases, our query does not only look up the primary key, if my database table has a column name, and a common index is added.
This way I execute the following sql
select * from user where name = "Xiaobai" ; copy code
Since name is not a shard key, we can’t locate which shard table to execute sql.
Therefore, the above sql will be executed for all the sub-tables . Of course, the sql will not be executed serially. Generally, the sql will be executed concurrently .
If I have 100 tables, execute sql 100 times.
If I have 200 tables, execute sql 200 times.
As I have more and more tables, the number of times will increase and this is the so-called read diffusion problem .
This is an interesting question. It is indeed a problem, but most businesses don’t deal with it. What’s wrong with reading 100 times, and what’s wrong with increasing the number of reads after the data grows? But I can’t stand my business without making money , and I can’t grow so much data at all .
That’s true, but when the interviewer asks you, you have to know what to do.
Introduce a new table to do sub-tables
The core of the problem is that the primary key is the shard key, and the normal index column is not sharded.
That’s easy to do. Let’s build a new sharding table separately . The columns in this new table are only the primary key id of the old table and the ordinary index column. This time, the ordinary index column is used as the sharding key.
In this way, when we want to query common index columns, we first do a query in this new sharded table, and we can quickly locate the corresponding primary key id, and then use the primary key id to check the data in the old sharded table. In this way, the original aimless full-table diffusion query is reduced to only a few fixed tables.
for example. For example, my table originally looks like this, where the id column is the primary key and the sharding key, and the name column is the non-primary key index. To simplify, assume three pieces of data in one table.
At this time, all the data in the
id=1,4,6sub -table .
But if we create a new table (nameX) for the name column, use name as the new shard key .
Then take the ids in the result to query
select * from user where id in (ids);, so that even if there are more tables, you can quickly locate a few specific tables, reducing the number of queries.
However, the disadvantage of this approach is also obvious. You need to maintain two sets of tables, and when the ordinary index column is updated, the two tables need to be changed at the same time.
There is a certain amount of development
Is there an easier solution?
Use other more suitable storage
Our regular query is to query the corresponding name column through the id primary key. In the above scheme, by introducing a new table, the corresponding id is first found by name, and then the id is used to obtain specific data. This is actually like establishing a new index. Like this, the idea of checking the original data through the name column is actually very similar to the inverted index .
It is equivalent to using the idea of inverted index to solve the data query problem under the sub-table.
In retrospect, in fact, our original requirement is nothing more than to provide common index columns or other more [dimensional] queries in the scenario of a large amount of data .
In this case, it is more suitable to use es, es natural sharding, and use the form of inverted index internally to speed up data query.
Oh? Brother Meng, it is it again, the inverted index , and it is a very small detail, take notes.
For example, I also have a row of data id, name, age. In mysql, you have to shard according to id. If you want to support the query of name and age, in order to prevent read proliferation, you have to build a sharding table of name and a sharding table of age respectively.
And if you use es, it will shard with the id shard key inside it, and also build a name to id, and an age to id inverted index. Is this the same as what was done above.
Moreover, it is also very simple to connect mysql to es. We can
binloglog changes of mysql through open source tools, and then parse the data and write it to es, so that es can provide near real-time query capabilities.
Think es+mysql is still cumbersome? Is there any other more concise solution?
Don’t use mysql, use tidb instead , I believe everyone has heard of this name, this is a distributed database .
It shards data tables by introducing the concept of Range . For example, the id of the first sharded table is 02kw, and the id of the second sharded table is 2kw4kw.
Oh? Are you familiar with it, isn’t this the database table division based on the id range mentioned at the beginning of the article?
It supports ordinary indexes, and ordinary indexes are also fragmented, which is similar to the inverted index scheme mentioned above.
Another tiny detail.
And the syntax of [tidb] and mysql is almost the same, and there are many ready-made tools that can help you migrate data from mysql to tidb. So the development cost is not high.
- When the data in a single table is too large, the query performance of MySQL will deteriorate. Therefore, when the amount of data becomes huge, horizontal table partitioning needs to be considered.
- A shard key needs to be selected for a horizontal table, usually a primary key, and then the modulo is performed according to the id, or the table is divided according to the range of the id.
- After MySQL is divided into tables horizontally, there will be a problem of read diffusion for queries on non-sharding key fields. You can use ordinary index columns as the sharding key to create a new table, first check the new table to get the id, then go back to the original table and check it again. Once the original table. This is essentially a reference to the idea of inverted index.
- If you want to support more dimensional queries, you can monitor MySQL’s binlog, write data to es, and provide near real-time query capabilities.
- Of course, replacing mysql with tidb is also an idea. tidb is really a good thing. Many factories use it to change the skin and stick a label to make their own self-developed database . I highly recommend everyone to learn it.
- Don’t do premature optimization. If you don’t have anything to do, just divide it into 100 tables. In many cases, it is really useless.