DB Index Selectivity — Why Indexes Don't Always Make Queries Faster

single

You created an index on a column, ran EXPLAIN, and the database still did a full table scan. What happened? The answer lies in a concept called selectivity.


What Is Selectivity?

Selectivity measures how "unique" the data in a column is:

Selectivity = Distinct values / Total rows

The result falls between 0 and 1. Closer to 1 means the index is more useful.

For example, a users table with 1,000,000 rows:

ColumnDistinctSelectivityIndex effective?
user_id1,000,0001.000000✅ Very high
email999,8000.999800✅ Very high
age800.000080⚠️ Low
country500.000050❌ Very low
gender20.000002❌ Extremely low

Why Does Low Selectivity Make an Index Useless?

From Selectivity → Estimated Rows

The query optimizer uses selectivity to estimate how many rows a query will return:

Estimated rows = Total rows ÷ Distinct values

For gender (only 2 values: M / F):

Estimated rows = 1,000,000 ÷ 2 = 500,000 rows

In other words, WHERE gender = 'M' returns 50% of the table. At that point, the index is no longer helpful.

Random I/O vs Sequential I/O

This is the physical reason. When using an index, the DB must:

  1. Traverse the B-tree (root → branch → leaf) to get a list of addresses for matching rows
  2. Jump to each of those addresses on disk to read the actual row data

With 500K rows, step 2 generates 500,000 random disk reads — each one jumping to a different location on disk.

A full table scan does the opposite — it reads the file sequentially from start to finish. Both HDDs and SSDs handle sequential I/O significantly faster than random I/O.

Index scan:  ~500K random I/O   ← slow
Full scan:   ~125K sequential I/O ← faster

Rule of thumb: If a query returns more than 10–15% of total rows, the optimizer will typically skip the index and do a full scan instead.


How Many I/Os Does One Index Lookup Cost?

Many people assume an index lookup = 1 I/O. It's actually not. With a secondary index, each row lookup goes through these steps:

I/O #1  →  Root node (B-tree)
I/O #2  →  Branch node(s)
I/O #3  →  Leaf node  →  retrieve PK value
I/O #4  →  Clustered index root
I/O #5  →  Data page  →  full row data

Total: 3–5 I/Os per row. Multiply by 500,000 rows → 1.5–2.5 million random I/Os, while a full scan only costs around 125,000 sequential reads.

Note: The root node and branch nodes are usually cached in the buffer pool (RAM), so in practice it's often 2–3 I/Os per row. But leaf nodes and data pages almost always require disk reads.


How Is Data Physically Stored on Disk?

The database does not store rows individually. The smallest unit of read/write is a page:

  • InnoDB (MySQL): page = 16 KB
  • PostgreSQL: page = 8 KB

Each page holds multiple rows. With ~200-byte rows, a 16 KB page fits roughly 80 rows. Even if you only need 1 row, the DB loads the entire page into RAM.

With 20 rows (~4 KB total), the entire table fits in a single page. The B-tree index also has just one level — the root node is also the leaf node. Every query costs just 1–2 I/Os.

As the table grows, data spreads across more pages and the B-tree gains depth — that's when the gap between random and sequential I/O starts to matter.


Real-World Example: orders Table

Assume an orders table with 2,000,000 rows:

Query 1 — lookup by order_id (selectivity = 1.0)

SELECT * FROM orders WHERE order_id = 'ORD-000001';
-- Estimated rows: 2,000,000 ÷ 2,000,000 = 1 row
-- → Index used ✅

Query 2 — order history for one customer (selectivity = 0.2)

SELECT * FROM orders WHERE customer_id = 'CUS-7823';
-- Estimated rows: 2,000,000 ÷ 400,000 = 5 rows
-- → Index used ✅

Query 3 — all pending orders (selectivity = 0.0000025)

SELECT * FROM orders WHERE status = 'pending';
-- Estimated rows: 2,000,000 ÷ 5 = 400,000 rows (20% of table)
-- → Index skipped ❌

Composite Index — Rescuing Low-Selectivity Columns

The same status column becomes useful when combined with customer_id:

SELECT * FROM orders
WHERE customer_id = 'CUS-7823'
AND   status = 'pending';

Create a composite index:

CREATE INDEX idx ON orders (customer_id, status);
--                          ^^^^^^^^^^^  ^^^^^^
--                          high         low

customer_id narrows the result to ~5 rows first, then status filters further — down to ~1 row. The index is now extremely effective.

Rule: always put the column with higher selectivity first in a composite index.


Covering Index — Eliminating Data Page I/O Entirely

If the index leaf node already contains all the columns the query needs, the DB skips the clustered index lookup altogether:

-- Query only needs customer_id and status
SELECT customer_id, status FROM orders WHERE customer_id = 'CUS-7823';

-- Covering index: both columns stored in the leaf node
CREATE INDEX idx_covering ON orders (customer_id, status);

I/O steps #4 and #5 (data page lookup) are eliminated — dropping from 3–5 I/Os down to 1–3 I/Os per row.


Summary

ScenarioOptimizer decision
High selectivity, few rows returnedUse index
Low selectivity, >10–15% of tableFull scan
Composite index, high-selectivity column firstEffective even with one low-selectivity column
Covering indexSkip data page lookup entirely

An index is not "fast just because it exists." Understanding selectivity tells you when an index is genuinely useful, when it's wasted space, and how to design composite indexes in the right column order.


To check real selectivity in your database:

-- PostgreSQL
SELECT attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'users';

-- MySQL / InnoDB
SHOW INDEX FROM users;
-- The Cardinality column = estimated distinct values
thongvmdev_M9VMOt
WRITTEN BY

thongvmdev

Share and grow together