Skip to main content

Statistics In SQL

If you know nothing about Statistics, you've come to the right place. I'm not going to bore you with long technical definitions that make no sense to people who don't know anything about it. I'm going to show everything with examples and visuals. You just sit back and relax.

I have a Customers table. When I search for 'Abigail Parks' I get 5 records.
And I have 3 indexes for this table. First is the clustered index, which is created automatically for the primary key column. The second and third ones are non-clustered indexes, one for the NameSurname column and the other one is for the Birthdate column. Now I'm wondering which index SQL used for my query. Was it able to find NameSurname index? When I click on Display Estimated Execution Plan button I get this:
SQL was able to find the right index. In my next query, I try to find how many customers were born on the same day with the first 'Abigail Parks' on my list, which is 1973-06-05.
I have 512 people who were born that day. Now I check again if SQL was able to find the right index.

It could find the right Index again. What if I say: Select * From Customers Where NameSurname = 'Abigail Parks' and Birthdate = '1973-06-05'? Which Index is going to choose? Smart choice would be choosing the NameSurname Index since it has fewer rows. Let's see which one it's going to choose.

How does SQL know which index has fewer rows? It knows it thanks to the statistics. Every index in SQL has its own statistics recorded under the Indexes section. When I click on the properties of IX_NameSurname Statistics, I can see the histogram of the NameSurname column. 


To optimize our query performance, we need to keep our statistics up to date. Evey time we rebuild or reorganize our indexes, these statistics get updated automatically. Now I want to confuse SQL and see what happens. I'm going to add 1000 more 'Abigail Parks' to my table with stored procedure and see which index SQL is going to choose. 

After adding 1000 more 'Abigail Parks' I asked the same question again.

It was a wrong choice this time. The reason for that, those statistics are not updated. When I checked the statistics, I saw that 'Abigail Parks' was still recorded as 5 rows. So what's the price that we pay for not updated statistics? The good thing is we can choose which index to use at our query and by using SET STATISTICS IO ON formula we can see how many pages it has to read for every index.
This is the price that we pay. If my statistics were updated, SQL was going to read 107 pages for that query. But it wasn't updated and it used NameSurname column and had to read 2088 pages. If it was updated, my query would work 20 times faster. 

What to do to force SQL to use the right index? Using the WITH condition every time wouldn't be effective. The only way to do it effectively is by updating statistics. We have 4 different ways to do that. We can rebuild or reorganize our indexes. But it might not be possible to do it every day, because when our system is working, rebuilding or reorganizing indexes slows our system. Luckily, we can update our statistics without touching our indexes by using SP_UPDATESTATS. This command updates every statistic in our server. Or we can choose which table's statistics to update by using the UPDATE command: UPDATE STATISTICS 'TableName'

To optimize our queries we need to update our statistics regularly. But doing it every time by writing a query might not be the best option. In my opinion, the best option is to add it to our database maintenance plan, which is a subject for another post.

Comments

Popular posts from this blog

Manipulating Grand Totals with ISINSCOPE in Power BI

ISINSCOPE function was introduced in with November 2018 Power BI update. We can say it's a very young player in the game, but it's definitely magic. It can be used to manage lots of issues related to hierarchies, but in this post, I'm going to focus on manipulating the grand total line with it.  This is my budget forecast table. The budget forecast that you see for every month is not the forecast for every month, but it's the forecast for the whole year. The budget forecast for 2020 has changed every month. That's why we have different numbers for every row. Below you see the table visual with the year-month column and a measure that shows the values for every month. The thing that I want you to pay attention to is the total line. I wanted to see the last month's value for the total line and this is the measure for that.  This measure calculates this: If the year-month column has one value then take the average of that value. If it doesn't have one value, li...

How to Add Flags to Power BI Reports

Adding flags to Power BI Reports might be easier said than done. If you have a dataset which contains data of various countries, adding country flags looks cool. The process to do it, on the other hand, might be compelling. Now I am going to share with you the obstacles I  have faced during the process and of course, how I overcame them. First I made an excel list of countries that my dataset has. Then I have downloaded to my computer all the flags I need from Wikipedia and converted them to Base64 link via an image encoder website  base64-image.de  and added those links to my excel file, which looks like this: After I added my excel file to Power BI, I converted the data category of URL column from 'uncategorized' to 'image URL', which might be a small step for you but it's a big step for the success of the flag operation. Then I created one to many relationship with DimLocation table and chose 'both' as filter direction, so that I can filter from both t...

Solving Data Type Conflicts in Power BI and SQL

Whether we like it or not, error messages are part of our lives. The other day, I faced one of those little horrid yellow messages. For me, it's impossible to stay chill when I see that message. For a second, I feel alarmed and panicked. Let me show you what I'm talking about. The first column is the customer code, quite self-explanatory. The Registration Date column shows the date when the customer registered to my shop. The First Sale Date column shows the date when the customer bought something for the first time. And this is what I'm trying to see in the Sales Day column: If Registration Date and First Sale Date are the same, meaning customer bought something at the same day customer registered to my shop, print  "First Day", otherwise show the data of First Sale Date column. But computer says no! It says "Expressions that yield variant data-type cannot be used to define calculated columns." If I need to translate it in simplified English, it means: ...