Index pruning techniques

Last week I ran across two different blog posts discussing removing duplicate and useless indexes from a database. Coincidentally, I have a nagging TODO item to clean up some indexes in the schema on one of the applications we have been developing. So, with inspiration in hand, here are some techniques to clean up your indexes.

Finding unused indexes

This is sort of my classic approach to index pruning, as in a lot of cases indexes are created based on an expected need, but the actual need turns out to be something different. This method should work against any recent version of postgres, though it relies on having postgresql statistics gathering turned on, and that you have been running production level traffic against the database for some time. The query is as follows:

select 
    indexrelid::regclass as index, relid::regclass as table 
from 
    pg_stat_user_indexes 
    JOIN pg_index USING (indexrelid) 
where 
    idx_scan = 0 and indisunique is false;

You’ll notice that we ignore Primary Key and Unique indexes, because they are required to enforce uniqueness, even if they are not used. I’d also note that the “idx_scan = 0” is used because it is pretty black and white, either the index has been used or it hasn’t. If you’re concerned that you may have indexes that have been used in the past that are no longer valid, you can either up that threshold, or reset your postgresql statistics and then have a fresh look at things once you have collected some data. In both cases, beware of the “little used but extremely necessary” index; for example an index needed to run a monthly report.

Finding duplicate indexes

This next method is actually a pretty good idea, especially during application development, where you don’t have the luxury of real queries running against your data. I’ve worked out some special instances of this query in the past, but I think this version will work well for general purposes. What this query does is look for true duplicate indexes, Ie. same table, same columns, same column order.

select 
    indrelid::regclass, array_accum(indexrelid::regclass) 
from 
    pg_index 
group by 
    indrelid, indkey 
having 
    count(*) > 1;

What this returns is the table name, and then a list of indexes with the above matching criteria (generated using array_accum). What it doesn’t do is make any assumptions about which is the best index to drop. For example, if you have two duplicate indexes, but one is a unique index and the other is not, you would probably want to drop the non-unique index, but that might be application dependent; this will get you close to the problem, but you’ll need to decide that last part.

Now, this is good for finding true duplicates, but we can also take that a step further. In this next query, we list out all of the indexes that have a matching multi-column index, meaning if you are indexing column a, and columns a,b, we will pop that up on the list.

select 
    a.indrelid::regclass, a.indexrelid::regclass, b.indexrelid::regclass 
from 
    (select \*,array_to_string(indkey,' ') as cols from pg_index) a 
    join (select \*,array_to_string(indkey,' ') as cols from pg_index) b on 
        ( a.indrelid=b.indrelid and a.indexrelid > b.indexrelid 
        and 
            ( 
                (a.cols LIKE b.cols||'%' and coalesce(substr(a.cols,length(b.cols)+1,1),' ')=' ') 
                or 
                (b.cols LIKE a.cols||'%' and coalesce(substr(b.cols,length(a.cols)+1,1),' ')=' ') 
            ) 
        ) 
order by 
    indrelid;

The result format is table name, first index, second index, though like before, the order of the indexes does not indicate which index is better or worse than the other, because there are too many cases where one index might be better, or maybe even both are needed. If you’re running this against a live system, I’d recommend checking in the stats… in some cases a dual column index might never be used if a single column index exists (especially when you factor in postgresql’s in-memory index bitmapping), but the opposite is just as true. Also beware that in some cases, you might have a dual column index for performance, but a single column index to enforce a unique constraint; in that case you have to keep both indexes.

Finding useless indexes

The last method listed here is an attempt to determine which indexes are unnecessary based on the data contained within the database. To use it, you’ll once again need to have real data available for looking at (though it can be just a copy of production; it doesn’t have to actually be production), and you’ll want to make sure all of your tables are freshly analyzed (with a default_stats_target of at least 100 ;-D ). What this query does it look at the relative distinctness of the values in a given indexes column and the show indexes where there is not enough relative distinctness that the planner is likely to use the index. Warning, this query isn’t as solid as the others:

select 
    starelid::regclass, indexrelid::regclass, array_accum(staattnum), relpages, reltuples, array_accum(stadistinct) 
from 
    pg_index 
    join pg_statistic on (starelid=indrelid and staattnum = ANY(indkey)) 
    join pg_class on (indexrelid=oid) 
where 
    case when stadistinct < 0 then stadistinct > -.8 else reltuples/stadistinct > .2 end 
    and 
    not (indisunique or indisprimary) 
    and 
    (relpages > 100 or reltuples > 1000) 
group by 
    starelid, indexrelid, relpages, reltuples 
order by 
    starelid ;

When I say it isn’t as solid, that’s mostly because I didn’t spend as much time trying to get it right as the others. I found that it really didn’t give me anything that the queries above wouldn’t have flagged, and it has most of the same warnings as above, like columns that are not terribly distinct might be used in bitmap-joins, plus additional issues like statistical correlation for multi-column indexes being pretty gray. I think the idea behind this method is pretty sound though, so I’d be interested in seeing a newer version that has some better maths applied to it (especially if you can work things out for multiple columns).

Summary

Index pruning is a very application specific task, that relies not only on knowing your database, but your data and your business rules as well. None of these items are magic bullet solutions, but should be generally helpful in finding trouble spots that deserve a second look.