What is database partitioning?
Partitioning is dividing your tables and indexes into smaller pieces, and even subdivide it into even smaller pieces.
Think of it as having several million different magazines of different topics and different years (say 2000-2019) all in one big warehouse (one big table). Partitioning would mean that you would put them organized in different rooms inside that big warehouse. They still belong together inside the one warehouse, but now you group them on a logical level, depending on your database partitioning strategy.
What’s the advantage of using database partitioning?
A good partitioning strategy is feasible for several reasons, for instance better performance, better manageability, higher availability, or for load balancing.
- Better performance: If, in the example of the magazine warehouse from above, you are looking for the October 2009 issue of “Sports Illustrated”, and you have partitioned your warehouse on issued year, you will only have to look into the room or the partition of 2009 to find what your looking for. Now, imagine that inside that one “2009” room inside your warehouse, you’ve put several big boxes, and you put the magazines inside these boxes per theme, or per publication (“Sports Illustrated”, “Pro Photographer”, “Fishing Hero”,…). That would be creating a subpartition, and it would make it even easier to find the magazine your looking for, as you would no longer need to browser through ALL 2009 magazines, but you could go straight to the 2009 room, and straight to the “Sports Illustrated” box. And that is partitioning for performance. Now, if you would put the different magazines inside a cupboard with all the backs of the magazines (with the title and month/year) on the outside, that would be the same as adding an index. It would allow you to quickly go to the correct magazine and pick it out without having to pick out every magazine and read the whole front until you’ve found the one you need. And, if you need data (magazines) from two or more different partitions (rooms, boxes), you can have two or more processes (people) looking for it, while you are still sure they are not looking over the same part twice.
- Better manageability and load balancing: you can store different logical partitions on different physical disks or in some cases even servers. This way, working on one partition/disk/server doesn’t necessarily mean that the other partitions have to be offline. The same goes for maintenance, if you are keeping only the last 10 years of data, and you partition your table per year-range, you can easily just delete (or backup and delete) the oldest partition. To go back to the warehouse comparison, if you have your warehouse divided into rooms per year, and work is done in room 2011, all the other rooms are still accessible. Of if you want to remove all magazines of 2000, you can just take out everything from that one single room, instead of having to browse through each and every magazine of each and every year.
- Higher availability: downtime due to failure or anything does not necessarily have an impact on other partitions. The same goes for backup and recovery, which you can do per partition, independently from the other partitions, reducing downtime.
When should I partition my table?
There are several things to consider in your partitioning strategy:
- Big tables are always a good partitioning candidate, everything that goes over 2Gb is a start.
- If your table holds historical data, and data will be added chronologically. Even more so if you load the data per batch on month or day basis for instance. In this type of partitioning one could leave only the latest partition updateable, and have the other (older) partitions read only.
- If you have the need to distribute your table across different types of storage (or even servers in some cases). For instance, most queries run on the last three year partitions, but you need to keep 10 years worth of data inside your table. The oldest 7 years are rarely queried. This way you could put the newest three partitions on high performance SSD drives, and the oldest 7 partitions on older and cheaper, but slower HD drives.
Why shouldn’t I use an index instead of partitioning?
Well, indexes are good. They’re great even! And they are absolutely an indismissible tool in your database performance toolbox. But indexes are very good at retrieving you a small portion of your data, especially data scattered all over your table. Think of indexes as ultimately performing best if they can deliver you >20% of your data, and most probably the real turning point is somewhere around 1%-5% of the data. But of course, a lot depends on many different factors, such as the server, the setup, the data, the weather (no, not the weather!), …
And most importantly: don’t think in terms of partitioning OR indexing. In a good partitioning strategy, there is still a lot of space and opportunity to use indexes well. For instance, my example above of the magazine warehouse, partitioned per year inside the warehouse, and partitioned per publication inside the rooms. Leaving the opportunity to increase performance even more by making the title/month more visible inside the cupboard instead of box (which is creating an index).
On what can I base my database partitioning strategy?
Each row that you enter in your partitioned table, should unambiguously belong to one specific partition. The partitioning key defines in which partition your row will be stored. This partitioning key can be one single field, or can comprise of a combination of fields, as long as it is ambiguous, so unique.
To go back to the magazines example, you can use “year of publication” as a partition key, but you cannot use “contains pictures of car brand” as a partitioning key and partition per brand, since that would mean that the latest “Top Gear Magazine” would need to be in both the “Ferrari” and the “Aston Martin” partition, and even many more.
What are the different types of partitioning?
- Range partitions
- List partitions
- Hash partitions
- Sub partitions
1. Range Partitioning
Partitioning by a range of values. This means that you partition your table in such a way that a row belongs to a specific partition if the value of the partitioning key lies in the range of that partition.
For instance, the magazines: We divided the warehouse into rooms per year. This means we have range partitioning, where each partition ranges from 01-01-YYYY to 31-12-YYYY. So, if I have a magazine with publishing date May 2015, it will go into the range partition going from 01-01-2015 to 31-12-2015, as May 2015 falls within that range.
Interval Partitioning is a special type of range partitioning, where the ranges in interval partitioning are always time based, and are created automatically as you add new data new data to the table that should belong to a new interval range.
2. List Partitioning
Database partitioning by lists allows you to explicitly control and define which values go in which partition, by defining the partitioning key in the definition of the table and of each partition:
SQL> Create table magazines (Pub_Title varchar(150), Pub_Date (date)) partition by list(Pub_Title) (partition Pub_Title1 values(“Sports Illustrated”, “Mens Health”), partition Pub_Title2 values(“Vanity”, “Flair”), partition Pub_Title3 values(“Yoga Magazine”, “Zen Magazine”));
Or, to work on the magazines example, as you may be able to read from the SQL already: we divide the warehouse into rooms per publication title, but we don’t have enough rooms for each different publication title, so we “list” a few per room. In room one we put “Sports Illustrated” and “Mens Health”, in room two we put “Vanity” and “Flair”, and in room three we put “Yoga Magazine” and “Zen Magazine”. Makes sense, right?
3. Hash Partitioning
In hash partitioning, the row of data is mapped to a partition based on the hashed value of your partitioning key. This can be useful if you want to use partitioning on a table where there is no clear business or logical partitioning of the data, and you require an evenly distributed partitioning.
For instance, you have a whole warehouse filled with magazines, but publication titles and publication dates are at random. No full years of the same and so on. But you want to distribute them evenly over the ten different rooms of your warehouse, so that when you need to retrieve one (or more) of the magazines later on, you can send out ten people at the same time to go looking for them in the ten different rooms. Than hash partitioning comes in handy. You choose for instance the combination of the publication title and the publication date as your partitioning key, and you let your database hash this key, and select the appropriate partition, while evenly distributing all your magazines over all your rooms.
4. Sub Partitioning
Let’s go back to the magazine warehouse. We first divide the whole set of magazines per year in different rooms. This is a first partitioning, based on range partitioning. Next, in each of these year-range-rooms, we put boxes, and in each box, we put all of the magazines of the same publication title. Since in that room, we already only have publications from one specific year, we will only have magazines from one specific publication title and for one specific publication year inside that single box. So this makes it extremely easy and fast to find that “Yoga Magazine”, “November Edition” of “2017”.
Or, in code:
SQL> Create table magazines (Pub_Title(varchar(150)), Pub_Date(date) Partition by rage (Pub_Year(Pub_Date)) Subpartition by list(Pub_Title) ( partition Pub_Title1 values(“Sports Illustrated”, “Mens Health”), partition Pub_Title2 values(“Vanity”, “Flair”), partition Pub_Title3 values(“Yoga Magazine”, “Zen Magazine”) );
As you can see, partitioning is a valuable tool in addition to the many others in your database toolbox, and is very useful to improve your database performance, availability and maintainability.
Which partitioning strategy to choose depends of course on your data, but also on the technical possibilities of the specific database system you are going to implement on.