Relational Database Partitions
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.
In a relational database, a partition divides logical database elements into separate physical parts. You typically partition a database to improve manageability, performance or availability. You can choose to partition your tables using either vertical or horizontal partitioning depending on the nature of the data and the type of performance improvement you need.
Need of Partitions
The following is a brief list of the primary advantages of using partitions:
Loading large data volumes: It has been a challenge for traditional method of loading warehouse particularly when there are very large volumes of extracted data. In partitions, the data in warehouse is organized based on partitioned column, such as transaction date. One day of transaction load will be loaded to one partition would minimize the total load time. The loading time will further be reduced if partition switching option is adopted, which would further optimize the loading time.
Archiving: Archiving is often used with time-based entries on fact tables. Archiving keeps the size of a table manageable by retaining detailed data only for the necessary time period. For example, if you need daily data for only two months, you could rows for previous months from the daily detail table. Deleting old data from a fact table by using traditional methods is slow, because the deletions are logged. By using partitions, you can simply remove old partitions from the logical table, possibly switching them to an archive table. Removing a partition from one table, or adding the partition to another table is a very fast operation.
Data Access Queries: Fast performance for data access queries is the critical to a data warehouse system. Achieving fast query performance is challenging when the data warehouse is large. Partitioning facilitates fast queries in two ways: first, partitions that are unrelated to the query can be ignored; second, multiple partitions can be queried in parallel.
Seamless access to distributed sources: When the data is distributed across different servers, partitioned views allow you to present it as if it were a single table. As with local partitions, the query optimizer is able to avoid querying source partitions that are not used in a query, but this optimization is even more critical with distributed data sources.
Most partitioning can be described as either horizontal or vertical.
Horizontal partitioning divides a logical table into multiple physical tables called partition members. Each table maintains an identical schema, but contains fewer rows. For example, a table containing one year of data can be partitioned to 12 partition member tables, each with 1 month of data. As another example, a table containing employee records can be split into two partition member tables, one with US employees and one with non-US employees. In a horizontally partitioned table, no two partition member tables contain the same data: there is no overlap between the partitions. A logical union of the partition member tables recreates the original logical table.
The partition commands in SQL Server all facilitate various types of horizontal partitioning, and can apply to various objects. You can partition a table or an index, or you can create a partitioned view that joins data from multiple sources into one logical table.
Vertical partitioning divides a logical table into multiple physical tables that contain fewer columns. The two types of vertical partitioning are normalization and row splitting. Normalization is the standard database process of removing redundant columns from a table and putting them in secondary tables to reduce update anomalies and redundancy and linked to the primary table by primary key and foreign key relationships.
Row splitting is a technique for grouping columns together based on their business meaning. With row splitting, you divide the original table into tables with the same number of rows, but fewer columns. Each logical row in a split table matches the same logical row in the others.
Like horizontal partitioning, vertical partitioning lets queries scan less data. This can query performance, provided that the query does not require columns from all the vertical partitions. As a simplistic example, table that contains four attributes for the key, of which only the first two are typically used in queries may benefit from splitting the infrequently used attribute columns into a separate table. This is, naturally, much more important when there are hundreds of attributes for a key. If most queries use columns from most or all of the partitions, then the overhead of adding a join more than offsets any benefit from splitting the rows.
This section details the implementation guidelines for horizontal partitioning. Vertical partitioning is nothing but separate table structures which would be accessed using joins as they both contain same cardinality but different columns combined with a key. There are no separate guidelines required for implementing vertical partitions.
There are three different approaches for implementing horizontal partitions, they are
A Partitioned View is also known as an Updateable Partitioned View or a Distributed Partitioned View. You create a view to logically union identically structured tables. You typically use a partitioned view in distribution environment where multiple servers are connected using linked servers. Rather than create a Partitioned View for a local database, you should create a Partitioned Table.
Steps to create Partitioned views
Creating a Partitioned View entails creating tables with identical schemas, and creating a view to union them together.
o Create tables with identical schemas on distributed servers. Each should have a column that can be used as a basis for logically dividing the data—a Partition column.
o Add a Check constraint to each table enforce the logical distinction between the partition tables—with no overlapping and no gap between the tables
o Add all the distributed servers as linked servers on the server where you will create the Partitioned view.
o Create a Partitioned views to union all the partitioned tables
Partition member table creation script (example)
CREATE TABLE T1 (F1 INT,
F2 INT CHECK (F2 = 1998))
Partition view creation script (example):
CREATE VIEW PartView AS
SELECT F1, F2 FROM T1
SELECT F1, F2 FROM Server.DB2.T2
Once you create the view, users can access the underlying data transparently, without knowing its physical location. You can make a Partitioned view writeable (updatable), but there are restrictions. For example, a single update command must affect only a single source table.
The concept of a partitioned table involves splitting a logical table into several physical units called partitions or partition members. The query engine can access each partition member independently of the others, limiting the scope of I/O intensive activities such as queries, data loads, and maintenance tasks, particularly those that require table level locks. The most common method of splitting data is horizontal partitioning, in which rows of a table matching mutually exclusive criteria (such as, range of dates or letters in alphabet, for date time and character data, respectively) are placed in designated partitions.
Steps to create Partitioned tables
The process of partitioning involves four main steps.
o The first, preliminary step consists of creating file groups and their corresponding files. During the second step, you define a partition function using the CREATE
o PARTITION FUNCTION T-SQL statement, with ranges of values (set by the FOR VALUES clause) that determine the beginning and end of each partition (and subsequently, their number)
CREATE PARTITION FUNCTION PartFunction (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);
o As a third step, create partition scheme using the CREATE PARTITION SCHEME T-SQL statement, which associates partition function to file groups, determining physical layout of data. The association between partitions and file groups is typically one to one, although other arrangements are also possible (including specifying extra unassigned partitions for future growth)
CREATE PARTITION SCHEME PartSchema AS PARTITION PartFunction
TO (FileGroup1, FileGroup2, FileGroup3, FileGroup4);
As a fourth step, create a partitioned table or index using CREATE TABLE or CREATE INDEX statements. The ON clause within these statements identifies the target partition scheme and partitioning column, whose values are compared against the definition of the partitioning function.
CREATE TABLE PartTable (F1 INT, F2 INT)
ON PartSchema (F2)
Indexes on each partition table will naturally be smaller than a corresponding index on the non partitioned table, which means fewer levels to traverse and fewer pages to scan, thereby boosting performance.
In addition, you can explicitly partition an index, even on an unpartitioned table. Secondary indexes can be set up completely separately from primary indexes. The syntax for creation is the same. When the indexes and partitions are within the same file group, the indexes are aligned. Alignment provides several advantages; most importantly, it provides a means for simplifying data backup. Query performance is better in aligned index systems, because the I/O aspects of query processing are increased.
The following example create partitioned index on table PartTable1 column F1, using partition schema partitioned on F2. Creating Partitioned schema, partition function is explained in the example given for creating Partitioned table
CREATE NONCLUSTERED INDEX IX_PartTable1_ReferenceF2 ON PartTable F1) ON PartSchema (F2);
Traditional Approach with no Partitions
Generally depending on the level of expertise there are various ways developers can design tables with no partitions
• Crude way to designing table is (1 FileGroup; 1 File; on only 1 Disk)
• Single table with multiple file groups, multiple files on multiple disks. To check for a record the query optimizer scans all the files groups as to check all the discs where the files are locate because upfront it doesn’t understand where the required data is stored.
Vertical Partition Guidelines
These guidelines relate to creating vertical partitions by splitting rows. Creating vertical partitions by normalizing tables is a widely understood practice and is not described in detail here.
Vertical partitioning moves some of the columns of a logical table into a separate physical table. None of the tables includes all the columns for a given row. There is no limit to the number of vertical partitions you can create for a logical table. The cardinality of the vertically partitioned tables needs to be same and share same Primary key to uniquely identify rows across. Vertical partitions are physical tables, so they can be distributed to different servers to take advantage of any performance gain available from parallelism.
Consider creating vertical partitions for a table with a large number of columns
Vertical partitioning is particularly useful if the number of columns of a table exceeds the limit that SQL Server supports. This often happens with employee or customer demographic information. When you partition because of a large number of columns, you may need to maintain separate queries for the different partitions, because of the limit in the number of columns in a view or query. Vertical partitioned tables do not help in warehouse structure as warehouse structure is denormalized in nature.
Consider creating vertical partitions when several of the columns are rarely used in queries
When a specific subset of the columns in a table is rarely used, consider moving the columns to a different physical table. Since there will be fewer columns left in the main table, more records can fit into a data page. As a result, queries that use only the frequently-used columns can be much faster. A reporting query that needs the infrequently used columns can still access them by joining the two tables together.
Consider creating vertical partitions when some of the columns are very large
Tables that contain very large text fields or binary data such as images are often very large. You may want to put very large columns into a separate vertical partition. This enhances performance for queries that don’t require the large data column, and also allow the database to manage the table space more effectively than storing very large columns together with small columns.
Be aware of the overhead of vertical partitions
Vertical partitioning can cause worse performance when loading data. Breaking a logical table into separate physical tables, means that each insert or delete operation requires multiple statements. This also increases application complexity, because you cannot execute a delete or update statements on a view that joins multiple tables.
Vertical partitioning requires consistent keys between the partitions of a single logical table. You can use the following methods to enforce that consistency:
Primary Key: Have same primary key on all the partitioned tables
Client Application Code: Client applications need to \insert and delete operations uniformly in all the partitions of a table. Update operations should apply only to the physical partitions that contain the relevant columns.
Cascade Operations: You can establish foreign key constraints between the partitions, so that deleting a record in one table will automatically delete the corresponding records in the other tables.
Foreign Key Constraints: All the partition tables should be tied together with foreign key constraints
Triggers: You can enforce consistent inserts and delete operations across all partitions by using triggers
Vertical partitioning can cause slower reporting queries. If you do need columns from each of the physical partition tables, you must add a join, which would be slower than retrieving the data from a single table.
Partitioned views guidelines
Keep frequently accessed data in one partition member
One of the major advantages of Partitioned views is the partitioned tables can be on any server, distributed to take advantage of parallelism, local disk I/O and processing power before merging the results using a partitioned view. If some of the data—for example, the current month—is frequently accessed, you can store that data one partitioned member, and move less frequently accessed data—such as prior month history—in a different partition member on a different server. This could help the performance for most of the queries as the cardinality would be reduced, but query results would still be extracted from a single table.
Keep locally accessed data in local partitions
Because you can store partitioned data on different servers, you can distribute the servers geographically to meet the needs of the local users. For example, if you put North America data in one partition, South America in another, Europe data in another, and Asia in a fourth, you could put each partition on a server within the region. If most queries are local, most queries will run exclusively off the local server. Enterprise-wide queries can still access all the partitions as needed.
Keep check constraints in a trusted state for tables in a partitioned view
Partition Views works on trusted constraints. A constraint becomes untrusted if you perform actions such as bulk-insert without a CHECK_CONSTRAINTS argument. Once a constraint is untrusted, the query processor has to scan all the base tables, as it has no way of verifying which base table contains which data. When using a partitioned view, be sure that the
Use partitioned column of partition views for optimizer to choose appropriate partition when queried
• When SELECT statements referencing the partitioned view specify a search condition on the partition column, the query optimizer uses the CHECK constraint definitions to determine which member table contains the rows and so the fetching is faster
• If the queries have a filtering key that corresponds to the partitioning key, response time will be faster. This is because the use of partitioning encourages the use of parallel operations and the partitioning key in the query predicate makes pruning data easier.
• If you are trying to improve performance and manageability for large subsets of data and there are defined access patterns, use the range partitioning mechanism. If your data has logical groupings – and you often work with only a few of these defined subsets at a time then define your ranges so the queries are isolated to work with only the appropriate data
Partitioned view member tables should have same schema, same partition column, non overlapped partition ranges and check constraints
• All member tables in the partitioned view should have same schema (columns, data type, size, collation), and a primary key defined on the same set of columns. Only the Check constraint should be different.
• The ranges defined in the check constraints on the member tables should not have gaps or overlap. The query manager could not determine where the data of the overlapped region exists, and this would defeat the purpose of the partitioned views
Use INSTEAD OF Triggers to update partitioned views
If you need to create partitioned views that don't meet the requirements for updates—for example, if the partitioning column isn't part of the underlying table's primary key or the view doesn't include all the base tables' columns—you can create INSTEAD OF triggers to modify the view. However, the query optimizer's plans for views with INSTEAD OF triggers might be less efficient than its plans for updateable views because some of the new optimizing techniques depend on the rules that make the views updateable
• To perform updates on a partitioned view, the partitioning column must be a part of the primary key of the base table. If a view is not updatable, you can create an INSTEAD OF trigger on the view that allows updates
• Check constraints will help query manager look for the specific member table instead of all the member tables defined in the partitioned view. Therefore, create check constraints on the member tables participating in the partitioned view and queries on partitioned view
• The partitioned column cannot be auto generated or computed such as identity, time stamp, and it should be part of the primary key.
• Limit each partition view to one constraint. More than one constraint on the partitioned column will eventually make query manager unable to recognize the view as a partitioned view
Choosing Partitioned views vs. Partition Table
• In most cases, if the data is stored locally, use a partitioned table rather than a partitioned view. If the data is distributed among remote servers, however, use a partitioned view.
• Queries that use partitioned tables will compile much faster than equivalent queries that use partitioned views. Query compilation time is proportional to the number of partitions when using partitioned views, while compilation is not affected by partition count when using partitioned tables
Use member tables for bulk load instead of using corresponding partitions views
• Bulk load data directly into the member tables rather than through the partitioned view
• When loading into a partitioned view or a partitioned table, load the data into separate staging tables—one for each partition.
• Add the loaded staging table to another new table having check constraint and modify partitioned view to have the new table be considered in partitioned view (make sure partition view requirements are followed such as no overlapping key, schema requirements
If you are using linked server for a member table in a partitioned view use sp_serveroption to defer schema validation
Consider setting the lazy schema validation for each linked server definition option by using the sp_serveroption stored procedure. This optimizes performance by making sure the query processor does not request metadata for any one of the linked tables until data is actually needed from that remote member table
Consider checking the availability of distributed servers
Make sure all table partitions spread across different servers are accessible all the time, or else you will not be able to use the partitioned view. Unavailability of a single server that partitioned view accesses the partitioned data from will make the partitioned view unusable until all the servers are accessible.
Apply consistent security at each individual server in distributed environment
Since you have your data partitioned across multiple database servers, you have to follow consistent security practices across all your servers. Typically these servers are linked from the database server where the partitioned view is created.
Create proper ranges across partitions
The ranges need to reflect data access patterns. The goal is to load balance distribution of queries by placing each of the ranges in a separate partition residing on a separate SQL server. Each partition takes the form of a table stored in a database residing on a separate server. If ranges are chosen properly, then for the majority of queries, it is sufficient to access a single partition only. For example, a table can contain data that is somehow related to a geographical location.
Place the data in correct partition for optimal usage
It is essential to determine optimal placement of data, such that related data is stored together. One way to accomplish this is to determine all foreign keys referencing rows in a local partition and store them on the same server. Another method can be used if you need to partition already existing set of tables. In such case, you can base your decision on observing and analyzing current data access patterns. Tables that do not qualify as good candidates for partitioning can be copied across federated servers and synchronized through the use of triggers or replication. Design partitions so that most SQL statements can be routed to one member server.
Partition Index Guidelines
Partitioning Clustered Indexes
When partitioning a clustered index, the clustering key must contain the partitioning column. If the clustered index is unique, you must explicitly specify that the clustered index key contain the partitioning column.
Partitioning Non Clustered Indexes
When partitioning a unique non clustered index, the index key must contain the partitioning column. When partitioning a non unique, non clustered index, SQL Server adds the partitioning column by default as a non key (included) column of the index to make sure the index is aligned with the base table. SQL Server does not add the partitioning column to the index if it is already present in the index.
Ensure the partitioning index are dropped or disabled prior loading the data
When data is loaded a partitioned table, it is better to drop the partitioned index before loading to get performance advantage and eliminate load time index rebuilt. Create the partitioned index on the partitioned table once the load it finished. You do not need to drop and recreate the index for lower order records transactions. This technique is particularly used for bulk loading.
Adding the data with indexes may cause index sorting and index be rebuilt
Having indexes when you load the data has a performance issue because the records are sorted
o Add data to partition and then create the index
o Ensure the partitioning index are dropped or disabled prior loading the data
o After loading the data, rebuild the indexes
o Rebuild and reorganize indexes on individual partitions, thereby facilitating better management of partitioned indexes
o It is a good practice to create partition of the tables and then create index over it to have it aligned with the data
o Partition Index alignment with partition tables is also important when SQL performs sorting based on partitioned index column. If it is aligned, it would take less memory as it does it by chunks (partitions) or else it would load and sort entirely. This process may get slower in memory constrained machines and happens only when non alignment case.
If a huge table cannot be partitioned, create a partitioned index
If the table already exists and contains large volume of data, moving it to different partitions is time consuming. If you can’t allow any downtime while moving the data, you can consider creating a partition index which would minimize the index look up effort (less index pages to be scanned) for finding the required record out of all data pages.
Use the same partition function for the table as well as for the index
Partition both the table and its indexes using the same partition function. When both the indexes and the table use the same partitioning function and the same partitioning columns (in the same order), the table and index are said to be aligned
• To achieve alignment two partitioned tables or indexes must have some correspondence between their respective partitions. They must use "equivalent" partition functions and some relation with respect to the partitioning columns. Two partition functions can be used to align data when:
o Both partition functions use the same number of arguments and partitions
o The partitioning key used in each function is of equal type (includes length, precision and scale if applicable, and collation if applicable)
o The boundary values of partition ranges are equivalent
NOTE: Even when two partition functions are designed to align data, you may end up with an unaligned index if it is not partitioned on the same column as the partitioned table.
• When the tables and indexes are defined with not only the same partition function but also the same partition scheme then they are said to be storage aligned. In general, having the tables and indexes on the same file or file group can often benefit where multiple CPUs can parallelize an operation across partitions, it is because one processor can deal with finding the data from the index from same file group. In the case of storage alignment and multiple CPUs, SQL Server can have each processor work directly on a specific file or file group – and know that there are no conflicts in data access as they typically exist on the same disk. This allows more processes to run in parallel without interruption and so the better performance.
• Indexes created on a partitioned table will also use the same partitioning scheme and partitioning column. When this is true, the index are said to be aligned with the table.
• If you are creating an index on a partitioned table, and do not specify a file group on which to place the index, the index is partitioned in the same manner as the underlying table. This is because indexes, by default, are placed on the same file groups as their underlying tables, and for a partitioned table in the same partition scheme that uses the same partitioning columns
• If you are creating a non-unique index then it will add the partition column to index keys so it can go to corresponding partition directly. When partitioning a non-unique, clustered index, the Database Engine by default adds any partitioning columns to the list of clustered index keys, if not already specified
Partition table guidelines:
Guidelines to determine if the partition is needed
o Not every table benefits from partitioning. Consider partitioning first for very large, frequently accessed tables.
o If queries typically extract from a predictable subset of the table, it may be a good candidate for partitioning.
o Before designing partitions, make sure that the table and index are properly designed and optimized.
o If possible, move old data to an archive to reduce the size of the table. This may eliminate the need to create partitions.
o Determine which tables are slow to load. This will typically be very large tables that have very large incremental loads
o If the Warehouse system cannot allow for any down-time or performance degradation while loading, partitions may be a good option.
o As a general rule, don’t consider tables smaller than two gigabytes for partitioning
o The most common tables for partitioning are fact tables. However, if a dimension table has more than 10 million members, you may benefit from partitioning it.
o Tables containing historical data—in which new data is added into the newest partition—are good candidates for partitioning. A typical example is a historical table where only the current month's data is updatable and previous months are read only
o Tables that contain a rolling time period—where an old period is dropped each time a new period is added—are good candidates for partitioning. In this case, executing a delete on a large table is prohibitively expensive, but dropping a partition is very efficient, particularly when the table is very large.
o Tables that are too large to effectively back-up are good candidates for partitioning.
o When dropping and rebuilding an index over a whole table is too slow, you may want to add partitions. Dropping and rebuilding indexes can be effective when working with bulk load processes.
o When query performance degrades because the index is too large, you may want to add partitions.
o If the existing table has already huge size of data and cannot be moved to different table cause of its size then consider partitioning indexes instead of table
Guidelines for creating file groups
o Think about hardware that would support placing file groups on different disks, though it is not necessary to put file groups on separate disks
o Put the most frequently used partitions on the best performing disk drives so that mostly used queries can run faster than seldom run queries
o One more reason of keeping partitions on different file groups would facilitate easy backups
o The number of file groups may be limited by hardware resources, so consider plans for scaling up or scaling out in the future to place each file group on a different disk for getting parallel disk IO advantage
o To simplify alignment between data files and indexes, keep the data file and index for each partition on the same file group.
o Partition scheme and partition function should have the same number of partitions.
Guidelines for Choosing a Partition Column
o Partition column should be deterministic
o The partitioned column value should not change with updates; select a persistent column
o Limitations on data type of partitioned column
Guidelines for choosing Partition ranges
o The column used for partitioning should be able to be broken into ranges. Additionally, the range of values ultimately determines how many partitions your table will support.
o Though there is a limit of 1000 partitions supported per table, having those many partitions would create over head and may induce performance drop for the queries. The partition column should be able to broken into range for creating partitions. Range need not be uniform for all the partitions.
o Make sure that partition ranges do not overlap
o Keep NULL values in one partition range preferably in the first partition range
o Do not mix NULL and non-Null values in the same partition range
o The maximum number of partitions in SQL Server is 1000, so be sure that there are fewer than 1000 partition ranges, particularly taking into consideration future growth potential.
o Analyze query patterns before deciding the partition granularity: how many queries take place within a week, month, or day boundary
o The partition ranges should be approximately equal size
o Separate mostly frequently used partitions from the other partitions by using the partition range. For example, if data for the current month is used more frequently than data for other months then do not combine the different months of data in the same partition; a good candidate for the partition column is Month
Use an explicit partition, not just separate file sets
Putting a table onto multiple file groups technically does physically partition the data, but it does not make the information available to the query optimizer. An explicit partition has all the benefits of putting a table into multiple file groups, with the added benefit of informing the query optimizer of the partition. The optimizer can then skip any physical partitions that are not needed for a specific query.
Consider using a commonly used join column as the partition column
If you frequently run queries that involve a join that involves the same column of a table you want to partition, you should use the column from the join as the partition column. As with any search, any partitions that are not needed for the join do not need to be searched.
Ensure your partitions are useful for underlying queries
Use Query Analyzer to verify that partitioning queries appropriate tables
• Check the query plan to see if it uses appropriate tables
• SET STATISTICS PROFILE ON command, run the query and invest
When determining your partition column, consider the total plan.
The number of subsets provides the realm of possibilities for long-term partition maintenance. One of the most common partitioning columns is a date column. It provides the most natural means for dividing table data. Determining the partition column would decide the fate of the performance of the queries running the partition table. Typically date is used as partition column to store each day of the day in a partition table, this would enable for easy partition switch when the end of the day synch is loaded to warehouse without affecting the existing system but extending the table partitions. When the daily data partitions are archived, it would be easy for the partitioned archive table to take multiple days of data into a single partition of archive tables.
Consider creating a file group for each partition
It is suggested to create file group for each partition defined for better performance and can be separated physically to multiple drives to take advantage of parallel read by hardware. If file groups are not used (ex: One file group for all the partitions), then though the concurrent read is possible (if multiple processors exist), disk parallel reading is not achieved unless the partitioned are kept in different physical disks.
Increase availability by blocking partitioning that only loaded
One of the major advantages of partitions is faster loading. In general when the data is loaded, the lock is escalated at table level based on the data volume inserts/updates. This would make the warehouse unavailable and could cause significant downtime for end users. In partitions the partitioned member can only participate in locking instead of entire table to reduce the downtime. It would facilitate to maintain high availability of existing data at the same time for faster loading of new data. Some of the guidelines on loading scenario on partitioned tables are as follows
• Results in lower availability: Block all other users from accessing the table (partition member) during data loading
• Have a Planned maintenance window for incrementally loading data
• Use partition switching if the data is loaded aligned with partition schema. Example: Table is partitioned on date. If the loading data is once per day then partition switching approach would greatly benefit particularly if the data volume is huge
Confine each partition to a separate file group to simplify back-up and restore
• SQL Server only allows piecemeal (i.e., incremental) restores It would provide the advantage of piecemeal backup and restore operations in SQL server to offer more flexibility to take backup of only required data instead of complete table. This particular file group can be made read only and perform bulk load particularly important for larger size loads.
• Mark the file group as read-only mode for piecemeal backup to work in a Simple Recovery model.
• For Bulk-Logged Recovery model or the Full Recovery model, it is necessary to back up the transaction log(s). Doing so is essential to restore the file group successfully.
Consider keeping Null values of Partitioned column in a single partition.
• If Null is a valid value for a partition, set the first (LEFT) boundary position as the lowest non-null value possible. For example, assuming a column that can contain only positive integers
CREATE PARTITION FUNCTION EmpPartition (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);
Here the column on which EmpPartition is based allows NULL values, so LEFT is used in the partition function and the lowest non null value expected for the partition column is 1. This ensures that the left most partition does not contain any non null value for the partition column assuming there is no negative numbers for the partition column. Putting all the Null values into a single partition allows the query optimizer to manage Nulls efficiently.
Guidelines for implementing a Sliding Window Scenario
Sliding window scenario is much useful when you want to add new data to a partitioned table and remove old data from the same partitioned table on a regular basis. This operation can involve large or small amounts of data in a variety of scenarios. If new data you are adding has to be loaded, scrubbed, or transformed, it can be treated as a separate entity before adding it as a partition. Regardless of how large or small the collection, the transfer is quick and efficient because, unlike an INSERT INTO SELECT FROM statement, the data is not physically moved. Only the metadata about where it is stored changes from one partition to another.
• The sliding window (or rolling window) scenario works best when data movement across partitions is minimized
• The sliding window scenario works only when the target table or partition is empty
• This requires a schema lock on the partitioned table.
• Partitions can be switched in and out when no other process has acquired a table-level lock on the partitioned table.
• Consider using Check Constraints on the archive table to ensure it holds only the specified data range
• The switch partitions will wait until the lock has been released by other process
• When partitioning is used and if you want to use the partition switch trick we have to know in which file group the partition resides and the id of that partition. In order to make the partitioning switch work, in fact, we have to create the two support table in the same file group where the partition of the main table we have to change resides
• Note also that any indexes that are not aligned with their respective tables must be dropped or disabled prior to the switch
• The sliding window scenario works best when data movement across partitions is minimized. The rows are moved between partitions by deleting the rows at the original position and by inserting the rows at the new position. The partitions involved in this switch are inaccessible during this period.
• The basic steps that include in sliding window scenario is to create an empty partition on the receiving partition table
Isolate the data to one partition of source partition table by splitting/merging the partitions accordingly
Move the data from source table to destination using a partition switch statement (which modifies the meta data, not exactly moving the entire content)
Remove the source partition which is moved to destination
Consider loading incoming data to an empty table in case of huge volume
Load data to an empty table and then merge with the partition table incase of huge data volume. There are ways to speed up the bulk load; however, these can directly affect all other users thereby sacrificing concurrency for speed. The partitioned approach entails the bulk loading of data into individual empty tables, each of which represents a determined partition range. These loaded tables are then added to the partitioned view or switched in to the partitioned table as a new partition.
Optimize the usage of processors
It is a general concern whether to partition evenly by size or by usage. Consider there are 10 evenly sized partitions with one per year. In such a case there is only one partition (for current year) where 90% of the queries will hit. The disadvantage here there is no partition on the current data so most of the queries will be queried on single partition. It will give parallel query execution except that it will eliminate the older data.
Consider there are 10 evenly partitioned tabled based on usages. In this case, the query result is extracted uniformly across all the partitions but the data searches on each partition are more than in the first scenario. Here the advantage if parallel execution for query but the data inside one partition is huge.
The idea scenario would be, have the seldom used data into one partition and partition the current and frequently queried data partitioned into multiple partitions. This way it would concurrently extract data as well as the partition size is minimized.
Move seldom used rows to other partitions
The general motive behind horizontal partitioning is to move seldom-used data into other partition
Consider moving the old or infrequently used data into a different partition so that there is not much data for the current partition that query optimizer can work on. This would give performance advantage as the data it queries on is less when compared to unpartitioned table. The other motives are time scaling.
Consider maintaining Current and Archive set for the warehouse data
Keep archiving the current data and keep it in different set of tables. In data warehouse, these tables aggregated by month or by week and stored in different set of tables leaving behind the current tables intact. Having this scenario help queries to run faster as most of the analysis queries would be on the current data. The duration of current data window is decided by the business requirements, such as business trend id closely observed for last 2 weeks, it means last 2 weeks is to be more granular and would act as a current set of tables. As the new data enters to new week, the previous week data is aggregated and keep it in a different data store. The history data is also queried to get business requirement but drill down to more granular is not available. This is acceptable as it is not practically feasible to have data maintained granular for large periods of time. Current set of tables as well history tables also partitioned but using different partition schema.
Consider placing partitions in different file groups
If you need to, you can put each partition table on a different file group and partition the data among multiple disks/RAID volumes/drive controllers.
Each partition can be created in a different set of file group. These file groups can be stored on different discs particularly to get parallel disk IO advantage. If these partitions are supported by hardware RAID, then it would be much more performance advantage cause of redundancy and fault tolerance. This is particularly useful for VLDB scenarios.
Consider creating file groups even in single disk environment. Though the file groups are typically stored in different set of disks but it is not a requirement. In case of single disk availability, the partitions can be created on a single disc, there is still advantage of unnecessary partition tables are eliminated by query optimizer and so there is a performance gain but disk parallel IO is not possible with single disk system.
Guidelines to improve Query Performance
o Have queries filtered on partitioned column and use the partitioned column as a first filtering condition in the query
o Have partitioned indexes created and aligned with underlying partitioned data
o Have file groups created on different disks for parallel IO
o Reduce the data size by adopting sliding window scenario or data archival process
o Create separate partition range for NULL values, suggested to be the first range in partition function with the usage of LEFT, having no overlapping with non null values
o Better partition range to have uniform distribution of data across partitions
o Hardware support such RAID
Determine which tables are performing poorly before partitioning
Having partition for smaller tables would reduce the performance of the queries cause of overhead to combine the data. Selection of table for partition would depend on the usage and data volume. If the queries are performing poorly mainly cause of huge data volume then partition will suit for the requirement. For smaller tables, if the query performs badly, then consider creating proper indexes instead of partitioning.
Consider non partitioned table size before partitioning
When you actually create and set up partitioning, you must create new tables. This may be a problem in systems where partitioning is needed but you can’t rebuild the table structure. While it is not possible to partition the existing table, a new partition table can be created and existing table data can be moved the new set of table. It is understood that large data tables data moving to different partition is a time consuming job, partition strategy should be part of the data modal before start loading data in the later phase.
Partitioning column limitations
You cannot use SQL CLR, timestamp, image, or ntext types as the partitioning column
Data types such as SQL CLR, time stamp, image, ntext types are not possible as they are not deterministic. TimeStamp column can be converted to integer by casting to big int and can be used as a partition column but the over head is queries while using the time stamp column should first convert it to big int for optimizer to recognize it as a partition column and also its value will change from each update resulting record restructure between two partitions. Another solution of using binary columns as partitioned column is to hash them or check_sum them to a different column and use this column as a partitioned column. The idea behind to not allow above data types is the ability of query optimizer to determine the partition based on the value of partitioned column of incoming record.
The columns must be deterministic and persisted
Partitioned column is to be deterministic and its value is to be persisted incase of modifications. Derived column can not be used as partitioned column as its value it not deterministic independently, depending on calculations based on more than one column.
Scenarios to avoid Partition
Following are the guidelines when to avoid partitions
• For smaller table, do not go for partitioning as there is no advantage of parallel runs but definitely partition overhead with switch
• Partition schema should be in such a way that data is better organized based on query requirements and uniformly distributed - else if queries get data from all partitions it is as good as no partition scenario
• MOLAP structure does not need partitions as the storage is not relational in nature – if all the queries are running against the cube and the source for the cube is relational then you do not need to create partitions for MOLAP and consider avoiding partitions on the relational.
• Table partitioning allows you to spread the load across disks. However, all data will still be managed by a single server. If your performance bottleneck is associated with CPU or memory, this method (partition) probably isn't your best choice
• Do not create partitions for smaller cardinality tables as it does not give advantage but overhead to maintain partitions. Typically done on fact tables; however, very large dimension tables (more than 10 million members) may benefit from partitioning
Partitions are unavoidable implementation particularly when the data volume is huge, so as is typically in warehouse systems. This chapter details the guidelines for implementing the various types of partitions in different scenarios to get loading time reduced and data access queries run faster. Sliding window scenario is much useful for loading huge volumes and archival of fact tables.