You created an index on a column, ran EXPLAIN, and the database still did a full table scan. What happened? The answer lies in a concept called selectivity.
What Is Selectivity?
Selectivity measures how "unique" the data in a column is:
Selectivity = Distinct values / Total rows
The result falls between 0 and 1. Closer to 1 means the index is more useful.
For example, a users table with 1,000,000 rows:
| Column | Distinct | Selectivity | Index effective? |
|---|---|---|---|
user_id | 1,000,000 | 1.000000 | ✅ Very high |
email | 999,800 | 0.999800 | ✅ Very high |
age | 80 | 0.000080 | ⚠️ Low |
country | 50 | 0.000050 | ❌ Very low |
gender | 2 | 0.000002 | ❌ Extremely low |
Why Does Low Selectivity Make an Index Useless?
From Selectivity → Estimated Rows
The query optimizer uses selectivity to estimate how many rows a query will return:
Estimated rows = Total rows ÷ Distinct values
For gender (only 2 values: M / F):
Estimated rows = 1,000,000 ÷ 2 = 500,000 rows
In other words, WHERE gender = 'M' returns 50% of the table. At that point, the index is no longer helpful.
Random I/O vs Sequential I/O
This is the physical reason. When using an index, the DB must:
- Traverse the B-tree (root → branch → leaf) to get a list of addresses for matching rows
- Jump to each of those addresses on disk to read the actual row data
With 500K rows, step 2 generates 500,000 random disk reads — each one jumping to a different location on disk.
A full table scan does the opposite — it reads the file sequentially from start to finish. Both HDDs and SSDs handle sequential I/O significantly faster than random I/O.
Index scan: ~500K random I/O ← slow Full scan: ~125K sequential I/O ← faster
Rule of thumb: If a query returns more than 10–15% of total rows, the optimizer will typically skip the index and do a full scan instead.
How Many I/Os Does One Index Lookup Cost?
Many people assume an index lookup = 1 I/O. It's actually not. With a secondary index, each row lookup goes through these steps:
I/O #1 → Root node (B-tree) I/O #2 → Branch node(s) I/O #3 → Leaf node → retrieve PK value I/O #4 → Clustered index root I/O #5 → Data page → full row data
Total: 3–5 I/Os per row. Multiply by 500,000 rows → 1.5–2.5 million random I/Os, while a full scan only costs around 125,000 sequential reads.
Note: The root node and branch nodes are usually cached in the buffer pool (RAM), so in practice it's often 2–3 I/Os per row. But leaf nodes and data pages almost always require disk reads.
How Is Data Physically Stored on Disk?
The database does not store rows individually. The smallest unit of read/write is a page:
- InnoDB (MySQL): page = 16 KB
- PostgreSQL: page = 8 KB
Each page holds multiple rows. With ~200-byte rows, a 16 KB page fits roughly 80 rows. Even if you only need 1 row, the DB loads the entire page into RAM.
With 20 rows (~4 KB total), the entire table fits in a single page. The B-tree index also has just one level — the root node is also the leaf node. Every query costs just 1–2 I/Os.
As the table grows, data spreads across more pages and the B-tree gains depth — that's when the gap between random and sequential I/O starts to matter.
Real-World Example: orders Table
Assume an orders table with 2,000,000 rows:
Query 1 — lookup by order_id (selectivity = 1.0)
SELECT * FROM orders WHERE order_id = 'ORD-000001'; -- Estimated rows: 2,000,000 ÷ 2,000,000 = 1 row -- → Index used ✅
Query 2 — order history for one customer (selectivity = 0.2)
SELECT * FROM orders WHERE customer_id = 'CUS-7823'; -- Estimated rows: 2,000,000 ÷ 400,000 = 5 rows -- → Index used ✅
Query 3 — all pending orders (selectivity = 0.0000025)
SELECT * FROM orders WHERE status = 'pending'; -- Estimated rows: 2,000,000 ÷ 5 = 400,000 rows (20% of table) -- → Index skipped ❌
Composite Index — Rescuing Low-Selectivity Columns
The same status column becomes useful when combined with customer_id:
SELECT * FROM orders WHERE customer_id = 'CUS-7823' AND status = 'pending';
Create a composite index:
CREATE INDEX idx ON orders (customer_id, status); -- ^^^^^^^^^^^ ^^^^^^ -- high low
customer_id narrows the result to ~5 rows first, then status filters further — down to ~1 row. The index is now extremely effective.
Rule: always put the column with higher selectivity first in a composite index.
Covering Index — Eliminating Data Page I/O Entirely
If the index leaf node already contains all the columns the query needs, the DB skips the clustered index lookup altogether:
-- Query only needs customer_id and status SELECT customer_id, status FROM orders WHERE customer_id = 'CUS-7823'; -- Covering index: both columns stored in the leaf node CREATE INDEX idx_covering ON orders (customer_id, status);
I/O steps #4 and #5 (data page lookup) are eliminated — dropping from 3–5 I/Os down to 1–3 I/Os per row.
Summary
| Scenario | Optimizer decision |
|---|---|
| High selectivity, few rows returned | Use index |
| Low selectivity, >10–15% of table | Full scan |
| Composite index, high-selectivity column first | Effective even with one low-selectivity column |
| Covering index | Skip data page lookup entirely |
An index is not "fast just because it exists." Understanding selectivity tells you when an index is genuinely useful, when it's wasted space, and how to design composite indexes in the right column order.
To check real selectivity in your database:
-- PostgreSQL SELECT attname, n_distinct, correlation FROM pg_stats WHERE tablename = 'users'; -- MySQL / InnoDB SHOW INDEX FROM users; -- The Cardinality column = estimated distinct values
