I have heard many debates about using integers v unique identifiers as primary keys. When asked my opinion, as a consultant, the only correct as answer is this: “it depends”. There is no right or wrong answer and each situation must be examined on its own merits. I will state, however, that most people don’t have thorough enough knowledge of SQL to understand the implications of using guids as primary keys, so their arguments quickly lose validity. The concepts are really quite simple.
Let’s start with discussing the difference between a primary key and a clustered index. When a primary key is created on SQL server, it is created as the clustered index by default. The clustered index is simply the physical sorting of the data. Many people don’t realize that the clustered index may be changed and it does not have to be the primary key. This affects index fragmentation and is directly related to the fill factor of the data pages. Fill factor is simply the percentage of each page utilized to store data and is directly related to indexing and performance. If the fill factor of a database is set to 80, then 20 percent of the page will be left empty for index expansion.
Consider the default fill factor of 0, however, which is the server default and is the same as 100, meaning that each page is filled to capacity. When your clustered index is set to an auto-incrementing column, each new record will be inserted on the next sequential line of the data page. It’s similar to writing a book- each new line comes after the next. Indexing is very easy because once a record is there, that is where it stays (unless records are deleted, but the effect of deleting records is the same regardless of the type of column your clustered index is on, so it doesn’t need to be discussed for the purposes of this discussion). Let’s assume that you can have 100 records on 1 data page. Records 1-100 are on page 1, 101-200 are on page 2, etc. We know that if there are 250 records, the next record, 251, comes right after 250 and will be on the third page. Thus, the indexes stay nice and neat.
Now consider using unique identifiers. Let’s start with defining a guid- it is a quasi-random number generated based on a series of variants which virtually guarantees uniqueness across all space and time until something like the year 2070 (one quick note- guids were not invented by Microsoft. I’ve heard so many mainframe people complain about having a Microsoft concept forced upon them- Microsoft calls them Guids (Globally Unique Identifiers) and the rest of the world calls them UUID (Universally Unique Identifiers) and is an open standard (ITU-T Rec. X.667 | ISO/IEC 9834-8)). The important item to note in that description is the word “random”. For the sake of this discussion, we can safely consider guids to be random numbers. Now that we understand what a clustered index is and have briefly reviewed the concept of fill factor, think about what happens when you insert a random number into your tables. No longer can we say that the next record is number 251 and will be on page 3. It could be number 2 on page 1, or number 2043 on page x- then the next number is completely different, etc. This means that each new record is not happily placed sequentially on the “line”, but is placed wherever it belongs in accordance to its random number. When this happens, your indexes become fragmented. This means that when the index was created and there were 1000 records, the database knew where each record lived. With sequential clustered indexes, each new record is added to the next line or page. With random clustered indexes- each new record is randomly placed somewhere in the pages and the index will no longer know where anything is and eventually becomes useless. The more “broken” the index becomes, the higher the level of fragmentation. The higher the fragmentation, the more SQL has to rely on table scans to find the data.
So now that we know that, let’s discuss guids v ints for a minute. Yes, using ints will always be faster than guids. And yes, using ints will always take less space than guids. I think that’s common sense and doesn’t take a $150 an hour consultant to tell us that. But in the year 2008, that’s simply not a good enough argument to say “never use guids as keys”; it’s a simplistic and uninformed statement. Let’s start with space issue first. Space was something that needed closely considered in the old “green screen” mainframe days. Yes, size mattered back then. But today we live in a world where you can buy a terabyte hard drive for $150. Size is simply now a commodity. That being said, I’ll repeat what I said earlier: “It depends”. Ignoring size is not acceptable for all systems- if you’re building databases for Google or Walmart, you still need to be mindful of such things in your data architecture and data modeling. For most systems, however, it is a non-issue.
Now let’s discuss speed. Yes, using ints will always be faster than using guids. This is a true statement and will never be disputed. The question, however, is “how much faster?”. On most systems you are talking about nanoseconds, at which point the speed differences is negligible. I will create another benchmark to demonstrate this to you at a later time, and I very much encourage you to do it on your own, but I have done this before with SQL 2005 and I benchmarked a database with 1,000,000 records and the difference in performance was measured in the nanoseconds. Again, this add up when you’re talking about hundreds of millions of records or even billions of records, so if you’re Google or Walmart, pay close attention to this. As for the rest of the world, it usually doesn’t matter.
So now I know what most of you are thinking at this point: “but I’ve seen systems will less than 1M records and the performance was horrible!”. Yup, and I can tell you exactly the reason why: index fragmentation. The developer or dba who built the tables using guids as primary keys didn’t understand the two things we talked about above: 1) use guids as your primary keys but change your clustered index to something more practical like a timestamp or other numeric value, and 2) if there aren’t any appropriate tables to build your clustered index on you keep it on the unique identifier- you MUST establish maintenance plans for your indexes. Each system must have its usage patterns evaluated to determine the best maintenance plan- hourly, nightly, weekly, monthly? Defrag (can be done online without locking) v Rebuild (locks table access)?
The last thing that most of you will be thinking right now is that it just seems like a lot of work to use a guid. Well, sure- there is extra work involved, but you get a lot of benefits from using guids. Especially in today’s world of SOA where systems are crossing organizational boundaries, the guaranteed uniqueness is a huge benefit that very often outweighs any of the cons associated with using guids. Another very practical benefit is that by being able create your guid in your business objects, you can fully build out parent / child relationships in the business tier without ever having to connect to the data tier to get the ID of your next object. And some developers will tell you they have a hard time reading guids and it’s easier to use ints- well, for that one all I can say is that they’ve never really done it. I’ve been using guids for a long time and it’s as simple as reading 2-5 digit number.
As for my own preferences, most of my tables will default to guids as primary keys (not clustered indexes) for true data, and will ints for things such as lookup tables and system data. But as I said, a good consultant will evaluate everything about a system and make an appropriate recommendation. There is no right or wrong, and if you’re married to one answer over the other, you’re not doing your job.
Elegant Software Solutions