nice machine with 2 gb of ram, 800 megabytes in 2 logfiles. single word as search phrase. polish utf-8 locale (pl_PL.UTF-8), gnu grep 2.5.1. results?
=> time grep -in reloading postgresql-2007-10-22_000000.log postgresql-2007-10-22_120909.log postgresql-2007-10-22_000000.log:40001:2007-10-22 10:50:13.528 CEST @ 24681 LOG: received SIGHUP, reloading configuration files postgresql-2007-10-22_120909.log:1215696:2007-10-22 12:15:21.769 CEST @ 24681 LOG: received SIGHUP, reloading configuration files real 1m21.212s user 1m20.909s sys 0m0.284s
same, check without -i:
=> time grep -n reloading postgresql-2007-10-22_000000.log postgresql-2007-10-22_120909.log postgresql-2007-10-22_000000.log:40001:2007-10-22 10:50:13.528 CEST @ 24681 LOG: received SIGHUP, reloading configuration files postgresql-2007-10-22_120909.log:1215696:2007-10-22 12:15:21.769 CEST @ 24681 LOG: received SIGHUP, reloading configuration files real 0m1.147s user 0m0.868s sys 0m0.268s
after setting locale to C:
=> time grep -in reloading postgresql-2007-10-22_000000.log postgresql-2007-10-22_120909.log postgresql-2007-10-22_000000.log:40001:2007-10-22 10:50:13.528 CEST @ 24681 LOG: received SIGHUP, reloading configuration files postgresql-2007-10-22_120909.log:1215696:2007-10-22 12:15:21.769 CEST @ 24681 LOG: received SIGHUP, reloading configuration files real 0m1.209s user 0m0.896s sys 0m0.316s
all tests were repeated many times to get all data in memory, and check for extreme values.
does anybody need another proof that locale “thing" is broken? of course it might be that only locale handling in grep is bad, but anyway – it's still locale issue.
This is a consequence of the non-trivial case folding algorithm of UTF-8 in comparison to ASCII. Not surprising.
@Markus Bertheau:
i know. i just didn’t expect to have *that* big influence on timing.
Just a note for anyone still reading this, this bug was fixed in GNU grep 2.7
Marti: thanks for info, good to know.