Optimization is a complex task because ultimately it requires understanding
of the entire system to be optimized. Although it may be possible to perform
some local optimizations with little knowledge of your system or
application, the more optimal you want your system to become, the more you
must know about it.
This chapter tries to explain and give some examples of different ways to
optimize MySQL. Remember, however, that there are always additional ways to
make the system even faster, although they may require increasing effort to
achieve.
The most important factor in making a system fast is its basic design. You
must also know what kinds of processing your system is doing, and what
its bottlenecks are. In most cases, system bottlenecks arise from these
sources:
Disk seeks. It takes time for the disk to find a piece of data. With
modern disks, the mean time for this is usually lower than 10ms, so
we can in theory do about 100 seeks a second. This time improves
slowly with new disks and is very hard to optimize for a single
table. The way to optimize seek time is to distribute the data onto
more than one disk.
Disk reading and writing. When the disk is at the correct position, we
need to read the data. With modern disks, one disk delivers at least
10–20MB/s throughput. This is easier to optimize than seeks because
you can read in parallel from multiple disks.
CPU cycles. When we have the data in main memory, we need to process
it to get our result. Having small tables compared to the amount of
memory is the most common limiting factor. But with small tables,
speed is usually not the problem.
Memory bandwidth. When the CPU needs more data than can fit in the CPU
cache, main memory bandwidth becomes a bottleneck. This is an
uncommon bottleneck for most systems, but one to be aware of.
7.1.1. MySQL Design Limitations and Tradeoffs
When using the MyISAM storage engine, MySQL
uses extremely fast table locking that allows multiple readers or a
single writer. The biggest problem with this storage engine occurs when
you have a steady stream of mixed updates and slow selects on a single
table. If this is a problem for certain tables, you can use another
storage engine for them. See
Chapter 14, Storage Engines and Table Types.
MySQL can work with both transactional and non-transactional tables. To
make it easier to work smoothly with non-transactional tables (which
cannot roll back if something goes wrong), MySQL has the following
rules. Note that these rules apply only
when you use the
IGNORE specifier for
INSERT or UPDATE.
All columns have default values.
If you insert an inappropriate or out-of-range value into a column,
MySQL sets the column to the “best possible
value” instead of reporting an error. For numerical values,
this is 0, the smallest possible value or the largest possible
value. For strings, this is either the empty string or as much of
the string as can be stored in the column.
All calculated expressions return a value that can be used instead
of signaling an error condition. For example, 1/0 returns
NULL.
Because all SQL servers implement different parts of standard SQL, it
takes work to write portable database applications. It is very easy to
achieve portability for very simple selects and inserts, but becomes
more difficult the more capabilities you require. If you want an
application that is fast with many database systems, it becomes even
more difficult.
All database systems have some weak points. That is, they have different
design compromises that lead to different behavior.
To make a complex application portable, you need to determine which SQL
servers it must work with, and then determine what features those
servers support. You can use the MySQL
crash-me program to find
functions, types, and limits that you can use with a selection of
database servers. crash-me
does not check for every possible feature, but it is still reasonably
comprehensive, performing about 450 tests. An example of the type of
information crash-me can
provide is that you should not use column names that are longer than 18
characters if you want to be able to use Informix or DB2.
The crash-me program and
the MySQL benchmarks are all very database independent. By taking a look
at how they are written, you can get a feeling for what you must do to
make your own applications database independent. The programs can be
found in the sql-bench directory of MySQL
source distributions. They are written in Perl and use the DBI database
interface. Use of DBI in itself solves part of the portability problem
because it provides database-independent access methods. See
Section 7.1.4, “The MySQL Benchmark Suite”.
If you strive for database independence, you need to get a good feeling
for each SQL server's bottlenecks. For example, MySQL is very fast in
retrieving and updating rows for
MyISAM tables, but has a problem in mixing
slow readers and writers on the same table. Oracle, on the other hand,
has a big problem when you try to access rows that you have recently
updated (until they are flushed to disk). Transactional database systems
in general are not very good at generating summary tables from log
tables, because in this case row locking is almost useless.
To make your application really
database independent, you should define an easily extendable interface
through which you manipulate your data. For example, C++ is available on
most systems, so it makes sense to use a C++ class-based interface to
the databases.
If you use some feature that is specific to a given database system
(such as the REPLACE statement, which is
specific to MySQL), you should implement the same feature for other SQL
servers by coding an alternative method. Although the alternative might
be slower, it enables the other servers to perform the same tasks.
With MySQL, you can use the /*! */ syntax
to add MySQL-specific keywords to a statement. The code inside
/* */ is treated as a comment (and ignored)
by most other SQL servers. For information about writing comments, see
Section 9.4, “Comment Syntax”.
If high performance is more important than exactness, as for some Web
applications, it is possible to create an application layer that caches
all results to give you even higher performance. By letting old results
expire after a while, you can keep the cache reasonably fresh. This
provides a method to handle high load spikes, in which case you can
dynamically increase the cache size and set the expiration timeout
higher until things get back to normal.
In this case, the table creation information should contain information
about the initial cache size and how often the table should normally be
refreshed.
An attractive alternative to implementing an application cache is to use
the MySQL query cache. By enabling the query cache, the server handles
the details of determining whether a query result can be reused. This
simplifies your application. See
Section 5.13, “The MySQL Query Cache”.
7.1.3. What We Have Used MySQL For
This section describes an early application for MySQL.
During MySQL initial development, the features of MySQL were made to fit
our largest customer, which handled data warehousing for a couple of the
largest retailers in Sweden.
From all stores, we got weekly summaries of all bonus card transactions,
and were expected to provide useful information for the store owners to
help them find how their advertising campaigns were affecting their own
customers.
The volume of data was quite huge (about seven million summary
transactions per month), and we had data for 4–10 years that we needed
to present to the users. We got weekly requests from our customers, who
wanted instant access to new reports from this data.
We solved this problem by storing all information per month in
compressed “transaction tables.” We had a set
of simple macros that generated summary tables grouped by different
criteria (product group, customer id, store, and so on) from the tables
in which the transactions were stored. The reports were Web pages that
were dynamically generated by a small Perl script. This script parsed a
Web page, executed the SQL statements in it, and inserted the results.
We would have used PHP or mod_perl
instead, but they were not available at the time.
For graphical data, we wrote a simple tool in C that could process SQL
query results and produce GIF images based on those results. This tool
also was dynamically executed from the Perl script that parses the Web
pages.
In most cases, a new report could be created simply by copying an
existing script and modifying the SQL query that it used. In some cases,
we needed to add more columns to an existing summary table or generate a
new one. This also was quite simple because we kept all
transaction-storage tables on disk. (This amounted to about 50GB of
transaction tables and 200GB of other customer data.)
We also let our customers access the summary tables directly with ODBC
so that the advanced users could experiment with the data themselves.
This system worked well and we had no problems handling the data with
quite modest Sun Ultra SPARCstation hardware (2×200MHz). Eventually the
system was migrated to Linux.
7.1.4. The MySQL Benchmark Suite
This benchmark suite is meant to tell any user what operations a given
SQL implementation performs well or poorly. You can get a good idea for
how the benchmarks work by looking at the code and results in the
sql-bench directory in any MySQL source distribution.
Note that this benchmark is single-threaded, so it measures the minimum
time for the operations performed. We plan to add multi-threaded tests
to the benchmark suite in the future.
To use the benchmark suite, the following requirements must be
satisfied:
The benchmark scripts are written in Perl and use the Perl DBI
module to access database servers, so DBI must be installed. You
also need the server-specific DBD drivers for each of the servers
you want to test. For example, to test MySQL, PostgreSQL, and DB2,
you must have the
DBD::mysql,
DBD::Pg, and DBD::DB2 modules
installed. See
Section 2.14, “Perl Installation Notes”.
After you obtain a MySQL source distribution, you can find the benchmark
suite located in its sql-bench
directory. To run the benchmark tests, build MySQL, and then change
location into the sql-bench
directory and execute the run-all-tests
script:
shell> cd sql-bench
shell> perl run-all-tests --server=server_name
server_name should be the name
of one of the supported servers. To get a list of all options and
supported servers, invoke this command:
shell> perl run-all-tests --help
The crash-me script also
is located in the
sql-bench directory.
crash-me tries to
determine what features a database system supports and what its
capabilities and limitations are by actually running queries. For
example, it determines:
You should definitely benchmark your application and database to find
out where the bottlenecks are. After fixing one bottleneck (or by
replacing it with a “dummy” module), you can
proceed to identify the next bottleneck. Even if the overall performance
for your application currently is acceptable, you should at least make a
plan for each bottleneck and decide how to solve it if someday you
really need the extra performance.
For examples of portable benchmark programs, look at those in the MySQL
benchmark suite. See
Section 7.1.4, “The MySQL Benchmark Suite”. You can take any program
from this suite and modify it for your own needs. By doing this, you can
try different solutions to your problem and test which really is fastest
for you.
It is very common for a problem to occur only when the system is very
heavily loaded. We have had many customers who contact us when they have
a (tested) system in production and have encountered load problems. In
most cases, performance problems turn out to be due to issues of basic
database design (for example, table scans are not good under high load)
or problems with the operating system or libraries. Most of the time,
these problems would be much easier to fix if the systems were not
already in production.
To avoid problems like this, you should put some effort into
benchmarking your whole application under the worst possible load. You
can use Super Smack, available at
http://jeremy.zawodny.com/mysql/super-smack/. As suggested by its
name, it can bring a system to its knees, so make sure to use it only on
your development systems.
First, one factor affects all statements: The more complex your
permissions setup, the more overhead you have. Using simpler permissions
when you issue GRANT statements enables
MySQL to reduce permission-checking overhead when clients execute
statements. For example, if you do not grant any table-level or
column-level privileges, the server need not ever check the contents of
the tables_priv and
columns_priv tables. Similarly, if you place
no resource limits on any accounts, the server does not have to perform
resource counting. If you have a very high statement-processing load, it
may be worth the time to use a simplified grant structure to reduce
permission-checking overhead.
If your problem is with a specific MySQL expression or function, you can
perform a timing test by invoking the
BENCHMARK() function using the
mysql client program. Its
syntax is
BENCHMARK(loop_count,expression).
The return value is always zero, but mysql
prints a line displaying approximately how long the statement took to
execute. For example:
mysql> SELECT BENCHMARK(1000000,1+1);
+------------------------+
| BENCHMARK(1000000,1+1) |
+------------------------+
| 0 |
+------------------------+
1 row in set (0.32 sec)
This result was obtained on a Pentium II 400MHz system. It shows that
MySQL can execute 1,000,000 simple addition expressions in 0.32 seconds
on that system.
All MySQL functions should be highly optimized, but there may be some
exceptions. BENCHMARK() is an excellent
tool for finding out if some function is a problem for your queries.
7.2.1. Optimizing Queries with EXPLAIN
EXPLAIN tbl_name
Or:
EXPLAIN [EXTENDED] SELECT select_options
The EXPLAIN statement can be used either as
a synonym for DESCRIBE or as a way to
obtain information about how MySQL executes a
SELECT
statement:
EXPLAIN
tbl_name is
synonymous with DESCRIBE
tbl_name or
SHOW COLUMNS FROM
tbl_name.
When you precede a SELECT statement
with the keyword EXPLAIN, MySQL
displays information from the optimizer about the query execution
plan. That is, MySQL explains how it would process the
SELECT, including information about how
tables are joined and in which order.
With the help of EXPLAIN, you can see where
you should add indexes to tables to get a faster
SELECT that uses indexes to find rows. You
can also use EXPLAIN to check whether the
optimizer joins the tables in an optimal order. To force the optimizer
to use a join order corresponding to the order in which the tables are
named in the SELECT
statement, begin the statement with SELECT
STRAIGHT_JOIN rather than just
SELECT.
If you have a problem with indexes not being used when you believe that
they should be, you should run ANALYZE TABLE
to update table statistics such as cardinality of keys, that can affect
the choices the optimizer makes. See
Section 13.5.2.1, “ANALYZE TABLE Syntax”.
EXPLAIN returns a row of information for
each table used in the SELECT statement.
The tables are listed in the output in the order that MySQL would read
them while processing the query. MySQL resolves all joins using a
single-sweep multi-join method. This means that MySQL reads a row
from the first table, and then finds a matching row in the second table,
the third table, and so on. When all tables are processed, MySQL outputs
the selected columns and backtracks through the table list until a table
is found for which there are more matching rows. The next row is read
from this table and the process continues with the next table.
In MySQL version 4.1, the EXPLAIN output
format was changed to work better with constructs such as
UNION statements, subqueries, and derived
tables. Most notable is the addition of two new columns:
id and select_type.
You do not see these columns when using servers older than MySQL 4.1.
EXPLAIN syntax also was augmented to allow
the EXTENDED keyword. When this keyword is
used, EXPLAIN produces extra information
that can be viewed by issuing a SHOW WARNINGS
statement following the EXPLAIN statement.
This information displays how the optimizer qualifies table and column
names in the SELECT statement, what the
SELECT looks like after the application of rewriting and
optimization rules, and possibly other notes about the optimization
process.
Each output row from EXPLAIN provides
information about one table, and each row contains the following
columns:
id
The SELECT identifier. This is the
sequential number of the SELECT within
the query.
select_type
The type of SELECT, which can be any of
those shown in the following table:
SIMPLE
Simple SELECT
(not using UNION or subqueries)
PRIMARY
Outermost
SELECT
UNION
Second or later
SELECT statement in a
UNION
DEPENDENT UNION
Second or later
SELECT statement in a
UNION, dependent on outer query
UNION RESULT
Result of a
UNION.
SUBQUERY
First
SELECT in subquery
DEPENDENT SUBQUERY
First
SELECT in subquery, dependent on outer query
The join type. The different join types are listed here, ordered
from the best type to the worst:
system
The table has only one row (= system table). This is a special
case of the const join type.
const
The table has at most one matching row, which is read at the
start of the query. Because there is only one row, values from
the column in this row can be regarded as constants by the rest
of the optimizer.
const tables are very fast because
they are read only once.
const is used when you compare all
parts of a PRIMARY KEY or
UNIQUE index to constant values. In
the following queries,
tbl_name can be used
as a
const table:
SELECT * FROM tbl_name WHERE primary_key=1;
SELECT * FROM tbl_name
WHERE primary_key_part1=1 AND primary_key_part2=2;
eq_ref
One row is read from this table for each combination of rows
from the previous tables. Other than the
system and
const
types, this is the best possible join type. It is used when all
parts of an index are used by the join and the index is a
PRIMARY KEY or
UNIQUE index.
eq_ref can be used for indexed
columns that are compared using the =
operator. The comparison value can be a constant or an
expression that uses columns from tables that are read before
this table. In the following examples, MySQL can use an
eq_ref join to process
ref_table:
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
ref
All rows with matching index values are read from this table for
each combination of rows from the previous tables.
ref is used if the join uses only a leftmost prefix of
the key or if the key is not a
PRIMARY KEY or
UNIQUE index (in other words, if
the join cannot select a single row based on the key value). If
the key that is used matches only a few rows, this is a good
join type.
ref can be used for indexed columns
that are compared using the = or
<=> operator. In the following
examples, MySQL can use a ref join
to process ref_table:
SELECT * FROM ref_table WHERE key_column=expr;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
ref_or_null
This join type is like ref, but
with the addition that MySQL does an extra search for rows that
contain NULL values. This join type
optimization was added for MySQL 4.1.1 and is used mostly when
resolving subqueries. In the following examples, MySQL can use a
ref_or_null
join to process ref_table:
SELECT * FROM ref_table
WHERE key_column=expr OR key_column IS NULL;
This type replaces ref for some
IN subqueries of the following
form:
value IN (SELECT primary_key FROM single_table WHERE some_expr)
unique_subquery is just an index
lookup function that replaces the subquery completely for better
efficiency.
index_subquery
This join type is similar to
unique_subquery. It replaces
IN subqueries, but it works for
non-unique indexes in subqueries of the following form:
value IN (SELECT key_column FROM single_table WHERE some_expr)
range
Only rows that are in a given range are retrieved, using an
index to select the rows. The key
column in the output row indicates which index is used. The
key_len contains the longest key part that was used. The
ref column is
NULL for this type.
range can be used when a key column
is compared to a constant using any of the
=, <>,
>, >=,
<, <=,
IS NULL,
<=>,
BETWEEN, or
IN
operators:
SELECT * FROM tbl_name
WHERE key_column = 10;
SELECT * FROM tbl_name
WHERE key_column BETWEEN 10 and 20;
SELECT * FROM tbl_name
WHERE key_column IN (10,20,30);
SELECT * FROM tbl_name
WHERE key_part1= 10 AND key_part2 IN (10,20,30);
index
This join type is the same as ALL,
except that only the index tree is scanned. This usually is
faster than ALL because the index
file usually is smaller than the data file.
MySQL can use this join type when the query uses only columns
that are part of a single index.
ALL
A full table scan is done for each combination of rows from the
previous tables. This is normally not good if the table is the
first table not marked
const, and usually
very bad in all other
cases. Normally, you can avoid ALL
by adding indexes that allow row retrieval from the table based
on constant values or column values from earlier tables.
possible_keys
The possible_keys column indicates
which indexes MySQL can choose from use to find the rows in this
table. Note that this column is totally independent of the order of
the tables as displayed in the output from
EXPLAIN. That means that some of the
keys in possible_keys might not be
usable in practice with the generated table order.
If this column is NULL, there are no
relevant indexes. In this case, you may be able to improve the
performance of your query by examining the
WHERE clause to check whether it refers
to some column or columns that would be suitable for indexing. If
so, create an appropriate index and check the query with
EXPLAIN again. See
Section 13.1.2, “ALTER TABLE Syntax”.
To see what indexes a table has, use SHOW
INDEX FROM tbl_name.
key
The key column indicates the key
(index) that MySQL actually decided to use. The key is
NULL if no index was chosen. To force
MySQL to use or ignore an index listed in the
possible_keys column, use
FORCE INDEX, USE INDEX, or
IGNORE INDEX in your query. See
Section 13.2.7, “SELECT Syntax”.
The key_len column indicates the length
of the key that MySQL decided to use. The length is
NULL if the key
column says NULL. Note that the value
of
key_len enables you to determine how
many parts of a multiple-part key MySQL actually uses.
ref
The ref column shows which columns or
constants are compared to the index named in the
key column to select rows from the
table.
rows
The rows column indicates the number of
rows MySQL believes it must examine to execute the query.
Extra
This column contains additional information about how MySQL resolves
the query. Here is an explanation of the values that can appear in
this column:
Distinct
MySQL is looking for distinct values, so it stops searching for
more rows for the current row combination after it has found the
first matching row.
Not exists
MySQL was able to do a LEFT JOIN
optimization on the query and does not examine more rows in this
table for the previous row combination after it finds one row
that matches the LEFT JOIN
criteria. Here is an example of the type of query that can be
optimized this way:
SELECT * FROM t1 LEFT JOIN t2 ON t1.id=t2.id
WHERE t2.id IS NULL;
Assume that t2.id is defined as
NOT NULL. In this case, MySQL scans
t1 and looks up the rows in
t2 using the values of
t1.id. If MySQL finds a matching
row in t2, it knows that
t2.id can never be
NULL, and does not scan through the
rest of the rows in t2 that have
the same id value. In other words,
for each row in t1, MySQL needs to
do only a single lookup in t2,
regardless of how many rows actually match in
t2.
range checked for each record (index map:
N)
MySQL found no good index to use, but found that some of indexes
might be used after column values from preceding tables are
known. For each row combination in the preceding tables, MySQL
checks whether it is possible to use a
range access method to retrieve rows. The applicability
criteria are as described in
Section 7.2.5, “Range Optimization”, with the exception that
all column values for the preceding table are known and
considered to be constants.
This is not very fast, but is faster than performing a join with
no index at all.
Using filesort
MySQL must do an extra pass to find out how to retrieve the rows
in sorted order. The sort is done by going through all rows
according to the join type and storing the sort key and pointer
to the row for all rows that match the
WHERE clause. The keys then are sorted and the rows are
retrieved in sorted order. See
Section 7.2.9, “ORDER BY
Optimization”.
Using index
The column information is retrieved from the table using only
information in the index tree without having to do an additional
seek to read the actual row. This strategy can be used when the
query uses only columns that are part of a single index.
Using temporary
To resolve the query, MySQL needs to create a temporary table to
hold the result. This typically happens if the query contains
GROUP BY and
ORDER BY clauses that list columns
differently.
Using where
A WHERE clause is used to restrict
which rows to match against the next table or send to the
client. Unless you specifically intend to fetch or examine all
rows from the table, you may have something wrong in your query
if the Extra
value is not Using where and the
table join type is ALL or
index.
If you want to make your queries as fast as possible, you should
look out for Extra values of
Using filesort and Using temporary.
Using index for group-by
Similar to the Using index way of
accessing a table, Using index for
group-by indicates that MySQL found an index that can be
used to retrieve all columns of a
GROUP BY or
DISTINCT query without any extra
disk access to the actual table. Additionally, the index is used
in the most efficient way so that for each group, only a few
index entries are read. For details, see
Section 7.2.10, “GROUP BY
Optimization”.
You can get a good indication of how good a join is by taking the
product of the values in the rows column of
the EXPLAIN output. This should tell you
roughly how many rows MySQL must examine to execute the query. If you
restrict queries with the
max_join_size system variable, this row
product also is used to determine which multiple-table
SELECT statements to execute and which to
abort. See
Section 7.5.2, “Tuning Server Parameters”.
The following example shows how a multiple-table join can be optimized
progressively based on the information provided by
EXPLAIN.
Suppose that you have the SELECT statement
shown here and that you plan to examine it using
EXPLAIN:
EXPLAIN SELECT tt.TicketNumber, tt.TimeIn,
tt.ProjectReference, tt.EstimatedShipDate,
tt.ActualShipDate, tt.ClientID,
tt.ServiceCodes, tt.RepetitiveID,
tt.CurrentProcess, tt.CurrentDPPerson,
tt.RecordVolume, tt.DPPrinted, et.COUNTRY,
et_1.COUNTRY, do.CUSTNAME
FROM tt, et, et AS et_1, do
WHERE tt.SubmitTime IS NULL
AND tt.ActualPC = et.EMPLOYID
AND tt.AssignedPC = et_1.EMPLOYID
AND tt.ClientID = do.CUSTNMBR;
For this example, make the following assumptions:
The columns being compared have been declared as follows:
Table
Column
Data Type
tt
ActualPC
CHAR(10)
tt
AssignedPC
CHAR(10)
tt
ClientID
CHAR(10)
et
EMPLOYID
CHAR(15)
do
CUSTNMBR
CHAR(15)
The tables have the following indexes:
Table
Index
tt
ActualPC
tt
AssignedPC
tt
ClientID
et
EMPLOYID
(primary key)
do
CUSTNMBR (primary key)
The tt.ActualPC values are not evenly
distributed.
Initially, before any optimizations have been performed, the
EXPLAIN statement produces the following
information:
table type possible_keys key key_len ref rows Extra
et ALL PRIMARY NULL NULL NULL 74
do ALL PRIMARY NULL NULL NULL 2135
et_1 ALL PRIMARY NULL NULL NULL 74
tt ALL AssignedPC, NULL NULL NULL 3872
ClientID,
ActualPC
range checked for each record (key map: 35)
Because type is ALL
for each table, this output indicates that MySQL is generating a
Cartesian product of all the tables; that is, every combination of rows.
This takes quite a long time, because the product of the number of rows
in each table must be examined. For the case at hand, this product is 74
× 2135 × 74 × 3872 = 45,268,558,720 rows. If the tables were bigger, you
can only imagine how long it would take.
One problem here is that MySQL can use indexes on columns more
efficiently if they are declared as the same type and size. (For
ISAM tables, indexes may not be used at all
unless the columns are declared the same.) In this context,
VARCHAR and CHAR
are considered the same if they are declared as the same size.
tt.ActualPC is declared as
CHAR(10) and
et.EMPLOYID
is CHAR(15), so there is a length mismatch.
To fix this disparity between column lengths, use
ALTER TABLE to lengthen ActualPC
from 10 characters to 15 characters:
mysql> ALTER TABLE tt MODIFY ActualPC VARCHAR(15);
Now tt.ActualPC and
et.EMPLOYID are both
VARCHAR(15). Executing the
EXPLAIN statement again produces this
result:
table type possible_keys key key_len ref rows Extra
tt ALL AssignedPC, NULL NULL NULL 3872 Using
ClientID, where
ActualPC
do ALL PRIMARY NULL NULL NULL 2135
range checked for each record (key map: 1)
et_1 ALL PRIMARY NULL NULL NULL 74
range checked for each record (key map: 1)
et eq_ref PRIMARY PRIMARY 15 tt.ActualPC 1
This is not perfect, but is much better: The product of the
rows values is less by a factor of 74. This
version executes in a couple of seconds.
A second alteration can be made to eliminate the column length
mismatches for the tt.AssignedPC = et_1.EMPLOYID
and tt.ClientID = do.CUSTNMBR comparisons:
After that modification, EXPLAIN produces
the output shown here:
table type possible_keys key key_len ref rows Extra
et ALL PRIMARY NULL NULL NULL 74
tt ref AssignedPC, ActualPC 15 et.EMPLOYID 52 Using
ClientID, where
ActualPC
et_1 eq_ref PRIMARY PRIMARY 15 tt.AssignedPC 1
do eq_ref PRIMARY PRIMARY 15 tt.ClientID 1
At this point, the query is optimized almost as well as possible. The
remaining problem is that, by default, MySQL assumes that values in the
tt.ActualPC column are evenly distributed, and that is not the
case for the
tt table. Fortunately, it is easy to tell
MySQL to analyze the key distribution:
mysql> ANALYZE TABLE tt;
With the additional index information, the join is perfect and
EXPLAIN produces this result:
table type possible_keys key key_len ref rows Extra
tt ALL AssignedPC NULL NULL NULL 3872 Using
ClientID, where
ActualPC
et eq_ref PRIMARY PRIMARY 15 tt.ActualPC 1
et_1 eq_ref PRIMARY PRIMARY 15 tt.AssignedPC 1
do eq_ref PRIMARY PRIMARY 15 tt.ClientID 1
Note that the rows column in the output
from
EXPLAIN is an educated guess from the MySQL
join optimizer. You should check whether the numbers are even close to
the truth by comparing the rows
product with the actual number of rows that the query returns. If the
numbers are quite different, you might get better performance by using
STRAIGHT_JOIN in your
SELECT statement and trying to list the
tables in a different order in the FROM
clause.
7.2.2. Estimating Query Performance
In most cases, you can estimate query performance by counting disk
seeks. For small tables, you can usually find a row in one disk seek
(because the index is probably cached). For bigger tables, you can
estimate that, using B-tree indexes, you need this many seeks to find a
row:
log(row_count)
/ log(index_block_length / 3 ×
2 / (index_length +
data_pointer_length)) + 1.
In MySQL, an index block is usually 1,024 bytes and the data pointer is
usually four bytes. For a 500,000-row table with an index length of
three bytes (the size of
MEDIUMINT), the formula indicates
log(500,000)/log(1024/3×2/(3+4)) + 1 =
4 seeks.
This index would require storage of about 500,000 × 7 × 3/2 = 5.2MB
(assuming a typical index buffer fill ratio of 2/3), so you probably
have much of the index in memory and so need only one or two calls to
read data to find the row.
For writes, however, you need four seek requests to find where to place
a new index value and normally two seeks to update the index and write
the row.
Note that the preceding discussion does not mean that your application
performance slowly degenerates by log N. As long as everything is cached
by the OS or the MySQL server, things become only marginally slower as
the table gets bigger. After the data gets too big to be cached, things
start to go much slower until your applications are only bound by
disk-seeks (which increase by log N).
To avoid this, increase the key cache size as the data grows. For
MyISAM tables, the key cache size is
controlled by the key_buffer_size system
variable. See
Section 7.5.2, “Tuning Server Parameters”.
7.2.3. Speed of SELECT Queries
In general, when you want to make a slow SELECT
... WHERE query faster, the first thing to check is whether you
can add an index. All references between different tables should usually
be done with indexes. You can use the
EXPLAIN statement to determine which
indexes are used for a SELECT. See
Section 7.2.1, “Optimizing Queries with EXPLAIN”,
and
Section 7.4.5, “How MySQL Uses Indexes”.
Some general tips for speeding up queries on
MyISAM tables:
To help MySQL better optimize queries, use
ANALYZE TABLE or run myisamchk
--analyze on a table after it has been loaded with
data. This updates a value for each index part that indicates the
average number of rows that have the same value. (For unique
indexes, this is always 1.) MySQL uses this to decide which index to
choose when you join two tables based on a non-constant expression.
You can check the result from the table analysis by using
SHOW INDEX FROM tbl_name
and examining the Cardinality value.
myisamchk --description --verbose
shows index distribution information.
To sort an index and data according to an index, use
myisamchk --sort-index
--sort-records=1
(assuming that you want to sort on index 1). This is a good way to
make queries faster if you have a unique index from which you want
to read all rows in order according to the index. The first time you
sort a large table this way, it may take a long time.
7.2.4. WHERE Clause Optimization
This section discusses optimizations that can be made for processing
WHERE clauses. The examples use
SELECT statements, but the same
optimizations apply for WHERE clauses in
DELETE and UPDATE
statements.
Work on the MySQL optimizer is ongoing, so this section is incomplete.
MySQL performs a great many optimizations, not all of which are
documented here.
Some of the optimizations performed by MySQL follow:
Removal of unnecessary parentheses:
((a AND b) AND c OR (((a AND b) AND (c AND d))))
-> (a AND b AND c) OR (a AND b AND c AND d)
Constant folding:
(a<b AND b=c) AND a=5
-> b>5 AND b=c AND a=5
Constant condition removal (needed because of constant folding):
(B>=5 AND B=5) OR (B=6 AND 5=5) OR (B=7 AND 5=6)
-> B=5 OR B=6
Constant expressions used by indexes are evaluated only once.
COUNT(*) on a single table without a
WHERE is retrieved directly from the
table information for MyISAM and
MEMORY (HASH)
tables. This is also done for any NOT NULL
expression when used with only one table.
Early detection of invalid constant expressions. MySQL quickly
detects that some SELECT
statements are impossible and returns no rows.
HAVING is merged with
WHERE if you do not use
GROUP BY or aggregate functions (COUNT(),
MIN(), and so on).
For each table in a join, a simpler WHERE
is constructed to get a fast WHERE
evaluation for the table and also to skip rows as soon as possible.
All constant tables are read first before any other tables in the
query. A constant table is any of the following:
An empty table or a table with one row.
A table that is used with a WHERE
clause on a PRIMARY KEY or a
UNIQUE index, where all index parts
are compared to constant expressions and are defined as
NOT NULL.
All of the following tables are used as constant tables:
SELECT * FROM t WHERE primary_key=1;
SELECT * FROM t1,t2
WHERE t1.primary_key=1 AND t2.primary_key=t1.id;
The best join combination for joining the tables is found by trying
all possibilities. If all columns in ORDER BY
and GROUP BY clauses come from the same
table, that table is preferred first when joining.
If there is an ORDER BY clause and a
different GROUP BY clause, or if the
ORDER BY or GROUP
BY
contains columns from tables other than the first table in the join
queue, a temporary table is created.
If you use the SQL_SMALL_RESULT option,
MySQL uses an in-memory temporary table.
Each table index is queried, and the best index is used unless the
optimizer believes that it is more efficient to use a table scan. At
one time, a scan was used based on whether the best index spanned
more than 30% of the table, but a fixed percentage no longer
determines the choice between using an index or a scan. The
optimizer now is more complex and bases its estimate on additional
factors such as table size, number of rows, and I/O block size.
In some cases, MySQL can read rows from the index without even
consulting the data file. If all columns used from the index are
numeric, only the index tree is used to resolve the query.
Before each row is output, those that do not match the
HAVING clause are skipped.
Some examples of queries that are very fast:
SELECT COUNT(*) FROM tbl_name;
SELECT MIN(key_part1),MAX(key_part1) FROM tbl_name;
SELECT MAX(key_part2) FROM tbl_name
WHERE key_part1=constant;
SELECT ... FROM tbl_name
ORDER BY key_part1,key_part2,... LIMIT 10;
SELECT ... FROM tbl_name
ORDER BY key_part1 DESC, key_part2 DESC, ... LIMIT 10;
MySQL resolves the following queries using only the index tree, assuming
that the indexed columns are numeric:
SELECT key_part1,key_part2 FROM tbl_name WHERE key_part1=val;
SELECT COUNT(*) FROM tbl_name
WHERE key_part1=val1 AND key_part2=val2;
SELECT key_part2 FROM tbl_name GROUP BY key_part1;
The following queries use indexing to retrieve the rows in sorted order
without a separate sorting pass:
SELECT ... FROM tbl_name
ORDER BY key_part1,key_part2,... ;
SELECT ... FROM tbl_name
ORDER BY key_part1 DESC, key_part2 DESC, ... ;
The range access method uses a single index
to retrieve a subset of table rows that are contained within one or
several index value intervals. It can be used for a single-part or
multiple-part index. The following sections give a detailed description
of how intervals are extracted from the
WHERE clause.
7.2.5.1. The Range Access Method for Single-Part Indexes
For a single-part index, index value intervals can be conveniently
represented by corresponding conditions in the
WHERE clause, so we speak of
range conditions rather than “intervals.”
The definition of a range condition for a single-part index is as
follows:
For both BTREE and
HASH indexes, comparison of a key
part with a constant value is a range condition when using the
=, <=>,
IN, IS NULL,
or
IS NOT NULL operators.
For BTREE indexes, comparison of a
key part with a constant value is a range condition when using
the >, <,
>=, <=,
BETWEEN, !=,
or
<> operators, or
LIKE 'pattern'
(where
'pattern'
does not start with a wildcard).
For all types of indexes, multiple range conditions combined with
OR or
AND form a range condition.
“Constant value” in the preceding
descriptions means one of the following:
A constant from the query string
A column of a const or
system table from the same join
The result of an uncorrelated subquery
Any expression composed entirely from subexpressions of the
preceding types
Here are some examples of queries with range conditions in the
WHERE clause:
SELECT * FROM t1
WHERE key_col > 1
AND key_col < 10;
SELECT * FROM t1
WHERE key_col = 1
OR key_col IN (15,18,20);
SELECT * FROM t1
WHERE key_col LIKE 'ab%'
OR key_col BETWEEN 'bar' AND 'foo';
Note that some non-constant values may be converted to constants
during the constant propagation phase.
MySQL tries to extract range conditions from the
WHERE clause for each of the possible
indexes. During the extraction process, conditions that cannot be
used for constructing the range condition are dropped, conditions
that produce overlapping ranges are combined, and conditions that
produce empty ranges are removed.
Consider the following statement, where
key1 is an indexed column and
nonkey is not indexed:
SELECT * FROM t1 WHERE
(key1 < 'abc' AND (key1 LIKE 'abcde%' OR key1 LIKE '%b')) OR
(key1 < 'bar' AND nonkey = 4) OR
(key1 < 'uux' AND key1 > 'z');
The extraction process for key key1 is as
follows:
Start with original WHERE clause:
(key1 < 'abc' AND (key1 LIKE 'abcde%' OR key1 LIKE '%b')) OR
(key1 < 'bar' AND nonkey = 4) OR
(key1 < 'uux' AND key1 > 'z')
Remove nonkey = 4 and
key1 LIKE '%b' because they cannot be used for a range
scan. The correct way to remove them is to replace them with
TRUE, so that we do not miss any matching rows when doing
the range scan. Having replaced them with
TRUE, we get:
(key1 < 'abc' AND (key1 LIKE 'abcde%' OR TRUE)) OR
(key1 < 'bar' AND TRUE) OR
(key1 < 'uux' AND key1 > 'z')
Collapse conditions that are always true or false:
(key1 LIKE 'abcde%' OR TRUE) is
always true
(key1 < 'uux' AND key1 > 'z')
is always false
Replacing these conditions with constants, we get:
(key1 < 'abc' AND TRUE) OR (key1 < 'bar' AND TRUE) OR (FALSE)
Removing unnecessary TRUE and
FALSE constants, we obtain:
(key1 < 'abc') OR (key1 < 'bar')
Combining overlapping intervals into one yields the final
condition to be used for the range scan:
(key1 < 'bar')
In general (and as demonstrated by the preceding example), the
condition used for a range scan is less restrictive than the
WHERE clause. MySQL performs an
additional check to filter out rows that satisfy the range condition
but not the full WHERE clause.
The range condition extraction algorithm can handle nested
AND/OR
constructs of arbitrary depth, and its output does not depend on the
order in which conditions appear in WHERE
clause.
7.2.5.2. The Range Access Method for Multiple-Part Indexes
Range conditions on a multiple-part index are an extension of range
conditions for a single-part index. A range condition on a
multiple-part index restricts index rows to lie within one or
several key tuple intervals. Key tuple intervals are defined over a
set of key tuples, using ordering from the index.
For example, consider a multiple-part index defined as
key1(key_part1,
key_part2,
key_part3), and the
following set of key tuples listed in key order:
The interval covers the 4th, 5th, and 6th tuples in the preceding data
set and can be used by the range access method.
By contrast, the condition
key_part3
= 'abc' does not define a single interval and cannot be used
by the range access method.
The following descriptions indicate how range conditions work for
multiple-part indexes in greater detail.
For HASH indexes, each interval
containing identical values can be used. This means that the
interval can be produced only for conditions in the following
form:
key_part1cmpconst1
AND key_part2cmpconst2
AND ...
AND key_partNcmpconstN;
Here, const1,
const2, … are constants,
cmp is one of the
=, <=>,
or
IS NULL comparison operators, and the
conditions cover all index parts. (That is, there are
N conditions, one for
each part of an N-part
index.) For example, the following is a range condition for a
three-part HASH index:
key_part1 = 1 AND key_part2 IS NULL AND key_part3 = 'foo'
For a BTREE index, an interval might
be usable for conditions combined with
AND, where each condition compares a
key part with a constant value using =,
<=>, IS NULL,
>, <,
>=, <=,
!=, <>,
BETWEEN, or
LIKE 'pattern'
(where
'pattern'
does not start with a wildcard). An interval can be used as long
as it is possible to determine a single key tuple containing all
rows that match the condition (or two intervals if
<> or
!= is used). For example, for this
condition:
key_part1 = 'foo' AND key_part2 >= 10 AND key_part3 > 10
It is possible that the created interval contains more rows than
the initial condition. For example, the preceding interval
includes the value ('foo', 11, 0),
which does not satisfy the original condition.
If conditions that cover sets of rows contained within intervals
are combined with OR, they form a
condition that covers a set of rows contained within the union
of their intervals. If the conditions are combined with
AND, they form a condition that covers a set of rows
contained within the intersection of their intervals. For
example, for this condition on a two-part index:
(key_part1 = 1 AND key_part2 < 2) OR (key_part1 > 5)
In this example, the interval on the first line uses one key part
for the left bound and two key parts for the right bound. The
interval on the second line uses only one key part. The
key_len column in the
EXPLAIN output indicates the maximum
length of the key prefix used.
In some cases, key_len may indicate
that a key part was used, but that might be not what you would
expect. Suppose that
key_part1 and
key_part2 can be
NULL. Then the
key_len column displays two key part
lengths for the following condition:
MySQL can perform the same optimization on
col_name
IS NULL
that it can use for col_name=
constant_value. For example, MySQL can use indexes and
ranges to search for
NULL with IS NULL.
Examples:
SELECT * FROM tbl_name WHERE key_col IS NULL;
SELECT * FROM tbl_name WHERE key_col <=> NULL;
SELECT * FROM tbl_name
WHERE key_col=const1 OR key_col=const2 OR key_col IS NULL;
If a WHERE clause includes a
col_name
IS NULL
condition for a column that is declared as NOT
NULL, that expression is optimized away. This optimization does
not occur in cases when the column might produce
NULL anyway; for example, if it comes from a table on the right
side of a LEFT JOIN.
MySQL 4.1.1 and up can also optimize the combination
col_name
=
expr AND
col_name IS NULL, a
form that is common in resolved subqueries.
EXPLAIN shows
ref_or_null
when this optimization is used.
This optimization can handle one IS NULL
for any key part.
Some examples of queries that are optimized, assuming that there is an
index on columns a and
b of table t2:
SELECT * FROM t1 WHERE t1.a=expr OR t1.a IS NULL;
SELECT * FROM t1, t2 WHERE t1.a=t2.a OR t2.a IS NULL;
SELECT * FROM t1, t2
WHERE (t1.a=t2.a OR t2.a IS NULL) AND t2.b=t1.b;
SELECT * FROM t1, t2
WHERE t1.a=t2.a AND (t2.b=t1.b OR t2.b IS NULL);
SELECT * FROM t1, t2
WHERE (t1.a=t2.a AND t2.a IS NULL AND ...)
OR (t1.a=t2.a AND t2.a IS NULL AND ...);
ref_or_null works by first doing a read on
the reference key, and then a separate search for rows with a
NULL key value.
Note that the optimization can handle only one IS
NULL level. In the following query, MySQL uses key lookups only
on the expression (t1.a=t2.a AND t2.a IS NULL)
and is not able to use the key part on
b:
SELECT * FROM t1, t2
WHERE (t1.a=t2.a AND t2.a IS NULL)
OR (t1.b=t2.b AND t2.b IS NULL);
7.2.7. DISTINCT Optimization
DISTINCT combined with
ORDER BY needs a temporary table in many cases.
In most cases, a DISTINCT clause can be
considered as a special case of GROUP BY.
For example, the following two queries are equivalent:
SELECT DISTINCT c1, c2, c3 FROM t1 WHERE c1 > const;
SELECT c1, c2, c3 FROM t1 WHERE c1 > const GROUP BY c1, c2, c3;
Due to this equivalence, the optimizations applicable to
GROUP BY queries can be also applied to
queries with a DISTINCT clause. Thus, for
more details on the optimization possibilities for
DISTINCT queries, see
Section 7.2.10, “GROUP BY Optimization”.
When combining LIMIT
row_count with
DISTINCT, MySQL stops as soon as it finds
row_count unique rows.
If you do not use columns from all tables named in a query, MySQL stops
scanning any unused tables as soon as it finds the first match. In the
following case, assuming that
t1 is used before t2
(which you can check with EXPLAIN), MySQL
stops reading from t2 (for any particular
row in t1) when it finds the first row in
t2:
SELECT DISTINCT t1.a FROM t1, t2 where t1.a=t2.a;
7.2.8. LEFT JOIN and
RIGHT JOIN Optimization
MySQL implements a A
LEFT JOIN B join_condition
as follows:
Table B is set to depend
on table
A and all tables on which
A depends.
Table A is set to depend
on all tables (except B)
that are used in the LEFT JOIN
condition.
The LEFT JOIN condition is used to
decide how to retrieve rows from table
B. (In other words, any
condition in the WHERE clause is not
used.)
All standard join optimizations are performed, with the exception
that a table is always read after all tables on which it depends. If
there is a circular dependence, MySQL issues an error.
All standard WHERE optimizations are
performed.
If there is a row in A
that matches the WHERE clause, but
there is no row in B that
matches the
ON condition, an extra
B row is generated with
all columns set to NULL.
If you use LEFT JOIN to find rows that
do not exist in some table and you have the following test:
col_name
IS NULL in the WHERE part, where
col_name is a column that
is declared as NOT NULL, MySQL stops
searching for more rows (for a particular key combination) after it
has found one row that matches the LEFT JOIN
condition.
The implementation of RIGHT JOIN is
analogous to that of LEFT JOIN with the
roles of the tables reversed.
The join optimizer calculates the order in which tables should be
joined. The table read order forced by LEFT JOIN
or STRAIGHT_JOIN helps the join optimizer
do its work much more quickly, because there are fewer table
permutations to check. Note that this means that if you do a query of
the following type, MySQL does a full scan on
b because the LEFT
JOIN
forces it to be read before d:
SELECT *
FROM a JOIN b LEFT JOIN c ON (c.key=a.key) LEFT JOIN d ON (d.key=a.key)
WHERE b.key=d.key;
The fix in this case is reverse the order in which
a and b are
listed in the
FROM clause:
SELECT *
FROM b JOIN a LEFT JOIN c ON (c.key=a.key) LEFT JOIN d ON (d.key=a.key)
WHERE b.key=d.key;
Starting from 4.0.14, for a LEFT JOIN , if
the WHERE condition is always false for the
generated NULL row, the
LEFT JOIN is changed to a normal join. For example, the
WHERE clause would be false in the
following query if t2.column1 were
NULL:
SELECT * FROM t1 LEFT JOIN t2 ON (column1) WHERE t2.column2=5;
Therefore, it is safe to convert the query to a normal join:
SELECT * FROM t1, t2 WHERE t2.column2=5 AND t1.column1=t2.column1;
This can be made faster because MySQL can use table
t2 before table t1
if doing so would result in a better query plan. To force a specific
table order, use STRAIGHT_JOIN.
7.2.9. ORDER BY Optimization
In some cases, MySQL can use an index to satisfy an
ORDER BY clause without doing any extra
sorting.
The index can also be used even if the ORDER BY
does not match the index exactly, as long as all of the unused portions
of the index and all the extra
ORDER BY columns are constants in the
WHERE clause. The following queries use the
index to resolve the ORDER BY part:
SELECT * FROM t1
ORDER BY key_part1,key_part2,... ;
SELECT * FROM t1
WHERE key_part1=constant
ORDER BY key_part2;
SELECT * FROM t1
ORDER BY key_part1 DESC, key_part2 DESC;
SELECT * FROM t1
WHERE key_part1=1
ORDER BY key_part1 DESC, key_part2 DESC;
In some cases, MySQL cannot use
indexes to resolve the ORDER BY, although
it still uses indexes to find the rows that match the
WHERE
clause. These cases include the following:
You use ORDER BY on different keys:
SELECT * FROM t1 ORDER BY key1, key2;
You use ORDER BY on non-consecutive
parts of a key:
SELECT * FROM t1 WHERE key2=constant ORDER BY key_part2;
You mix ASC and
DESC:
SELECT * FROM t1 ORDER BY key_part1 DESC, key_part2 ASC;
The key used to fetch the rows is not the same as the one used in
the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
You are joining many tables, and the columns in the
ORDER BY are not all from the first
non-constant table that is used to retrieve rows. (This is the first
table in the EXPLAIN output that does
not have a const join type.)
You have different ORDER BY and
GROUP BY expressions.
The type of table index used does not store rows in order. For
example, this is true for a HASH
index in a MEMORY table.
With EXPLAIN SELECT ... ORDER BY, you can
check whether MySQL can use indexes to resolve the query. It cannot if
you see Using filesort in the
Extra column. See
Section 7.2.1, “Optimizing Queries with EXPLAIN”.
Prior to MySQL 4.1, in those cases where MySQL must sort the result, it
uses the following filesort
algorithm:
Read all rows according to key or by table scanning. Rows that do
not match the WHERE clause are skipped.
For each row, store a pair of values in a buffer (the sort key and
the row pointer). The size of the buffer is the value of the
sort_buffer_size system variable.
When the buffer gets full, run a qsort (quicksort) on it and store
the result in a temporary file. Save a pointer to the sorted block.
(If all pairs fit into the sort buffer, no temporary file is
created.)
Repeat the preceding steps until all rows have been read.
Do a multi-merge of up to MERGEBUFF (7)
regions to one block in another temporary file. Repeat until all
blocks from the first file are in the second file.
Repeat the following until there are fewer than
MERGEBUFF2 (15) blocks left.
On the last multi-merge, only the pointer to the row (the last part
of the sort key) is written to a result file.
Read the rows in sorted order by using the row pointers in the
result file. To optimize this, we read in a big block of row
pointers, sort them, and use them to read the rows in sorted order
into a row buffer. The size of the buffer is the value of the
read_rnd_buffer_size
system variable. The code for this step is in the
sql/records.cc source file.
One problem with this approach is that it reads rows twice: One time
when evaluating the WHERE clause, and again
after sorting the pair values. And even if the rows were accessed
successively the first time (for example, if a table scan is done), the
second time they are accessed randomly. (The sort keys are ordered, but
the row positions are not.)
In MySQL 4.1 and up, a filesort
optimization is used that records not only the sort key value and row
position, but also the columns required for the query. This avoids
reading the rows twice. The modified
filesort algorithm works like this:
Read the rows that match the WHERE
clause.
For each row, record a tuple of values consisting of the sort key
value and row position, and also the columns required for the query.
Sort the tuples by sort key value
Retrieve the rows in sorted order, but read the required columns
directly from the sorted tuples rather than by accessing the table a
second time.
Using the modified filesort algorithm, the
tuples are longer than the pairs used in the original method, and fewer
of them fit in the sort buffer (the size of which is given by
sort_buffer_size). As a result, it is possible for the extra I/O
to make the modified approach slower, not faster. To avoid a slowdown,
the optimization is used only if the total size of the extra columns in
the sort tuple does not exceed the value of the
max_length_for_sort_data system variable.
(A symptom of setting the value of this variable too high is that you
should see high disk activity and low CPU activity.)
If you want to increase ORDER BY speed,
check whether you can get MySQL to use indexes rather than an extra
sorting phase. If this is not possible, you can try the following
strategies:
Increase the size of the sort_buffer_size
variable.
Increase the size of the
read_rnd_buffer_size variable.
Change tmpdir to point to a dedicated
filesystem with large amounts of empty space. If you use MySQL 4.1
or later, this option accepts several paths that are used in
round-robin fashion. Paths should be separated by colon characters
(‘:’) on Unix and semicolon characters
(‘;’) on Windows, NetWare, and OS/2.
You can use this feature to spread the load across several
directories. Note: The paths
should be for directories in filesystems that are located on
different physical disks, not
different partitions on the same disk.
By default, MySQL sorts all GROUP BY
col1,
col2, ... queries as if
you specified ORDER BY
col1,
col2, ... in the query
as well. If you include an ORDER BY clause
explicitly that contains the same column list, MySQL optimizes it away
without any speed penalty, although the sorting still occurs. If a query
includes GROUP BY but you want to avoid the
overhead of sorting the result, you can suppress sorting by specifying
ORDER BY NULL. For example:
INSERT INTO foo
SELECT a, COUNT(*) FROM bar GROUP BY a ORDER BY NULL;
The most general way to satisfy a GROUP BY
clause is to scan the whole table and create a new temporary table where
all rows from each group are consecutive, and then use this temporary
table to discover groups and apply aggregate functions (if any). In some
cases, MySQL is able to do much better than that and to avoid creation
of temporary tables by using index access.
The most important preconditions for using indexes for
GROUP BY are that all
GROUP BY columns reference attributes from the same index, and
that the index stores its keys in order (for example, this is a
BTREE index, and not a
HASH index). Whether use of temporary
tables can be replaced by index access also depends on which parts of an
index are used in a query, the conditions specified for these parts, and
the selected aggregate functions.
There are two ways to execute a GROUP BY
query via index access, as detailed in the following sections. In the
first method, the grouping operation is applied together with all range
predicates (if any). The second method first performs a range scan, and
then groups the resulting tuples.
7.2.10.1. Tight index scan
A tight index scan may be either a full index scan or a range index
scan, depending on the query conditions.