Problems of grep in Emacs

By Xah Lee. Date: . Last updated: .
xah talk show 2022-01-12 why unix grep sucks, history of regex, WolframLang StringExpression

This page describes problems of calling unix grep in emacs, and why a emacs lisp version is more flexible and superior.

Unix grep util is quite useful. In emacs, it's even better. Because, emacs has commands that act as wrapper to unix grep, with the advantage that the output is colored, and file names are linked. [see Emacs: Search Text in Directory]

However, calling unix grep inside emacs has some problems, either directly by a shell-command command, or indirectly by emacs lisp wrapper commands {grep, rgrep, lgrep, grep-find, etc}.

External Program Problem

On Windows, calling a external unix util is major problem, because by default there is no unix grep installed. This makes a large part of critical emacs features not usable.

User has to install either Cygwin or others, then, emacs goes thru several layers, thru Cygwin, Windows. Unicode, or text that contain quote chars or escapes, almost always gets screwed along the way. [see Installing Cygwin Tutorial]

Unix Shell Quote Escape Problem

Today, i want to grep with this regex height="[0-9]+" /> for HTML image tags.

On my Microsoft Windows machine with Cygwin installed, when calling emacs's grep command, i gave this to the prompt:

grep -ie -nH 'height\=\"[0\-9]\+\" \/\>' *html

It doesn't work. I tried many variations: with the backslash in different places, double backslash, single/double quotes. Sometimes the error is about Cygwin detecting DOS style slash. Sometimes it silently creates a file of 0 length named ' in your directory, due to your bad escapes.

Dependence on Familiarity of Unix Shell Syntax

Unix utils syntax is incomprehensible to none-unix users, but emacs depends on them for basic features such as searching text in a dir. Users not familiar with the cryptic shell syntax won't be able to use it. For example, emacs grep command prompts this: grep -nH -e ▮.

emacs grep prompt 2020-12-30 Nb5RT
emacs grep prompt 2020-12-30

Also, to list files in emacs dired, the only command for that is find-dired. [see Linux: Walk Dir: find, xargs] These depends on familiarity of unix find/xargs commands. For example, it prompts “Run find (with args):”, where the user is supposed to type something like -name "*html" [see Emacs: Inconsistency of Search Features]

Regex Not Compatible to Emacs Regex

The regex is different from Emacs: Regular Expression, such as when Alt+x query-replace-regexp.

Unicode String Problem

When using unix grep on MS Windows for processing Unicode text, there are many encoding problems. On Windows with Cygwin, the char encoding in the stream gets messed up thru the various layers.

For example, grep fails when searching for (U+2502). This is calling Cygwin grep from emacs on Windows. It's too complex to figure out exactly why it fails.

With Unicode, you have to deal with unix environment variable “locale”, emacs's own various encoding settings, MS Window's locale and “codepage” setup. There's complex interplay of environment variables among {emacs, emacs's inferior shells, Cygwin, Windows}. [see Emacs in Microsoft Windows FAQ]

Even on Linux terminal, shell tools have issues with Unicode. See: Linux Shell Util uniq Unicode Bug

Problem with Long Search String

Sometimes you want to search a string that's part of source code. It may be long, containing 300 hundred chars or more. (For example, a snippet of HTML that contains JavaScript and span multi-lines.) You could put your search string in a file with grep using --file=filename, but this is not convenient. Here's a example of a string i need to do a literal search:

<div class="chtk"><script>ch_client="polyglut";ch_width=550;ch_height=90;ch_type="mpu";ch_sid="Chitika Default";ch_backfill=1;ch_color_site_link="#00C";ch_color_title="#00C";ch_color_border="#FFF";ch_color_text="#000";ch_color_bg="#FFF";</script><script src="http://scripts.chitika.net/eminimalls/amm.js"></script></div>

Shell escape this string would be very inconvenient, and more complex when shell is used inside emacs.

Grep Not Flexible for Specifying Files in Directories

grep is not very flexible for working with all files in a directory. There's -r option, but then you can't specify file pattern (For example, *html). You have to do it like this: grep -r 'xyz' --include='*html' dirname [see Linux: Text Processing: grep, cat, awk, uniq]

Sometimes you need to work on a list of files, sometimes by a pattern, sometimes you want to exclude some files by list or by pattern, sometimes only the first 2 levels, or a combination of the above in a specific order. Some unix tools provide these options, sometimes by combination of tools (For example, find/xargs), but their order and syntax is complex and tool specific. With a script in Perl, Python, elisp, it's much easier to control.

Too Many Incompatible Versions of Grep

There are too many versions and varieties of grep. The primary 2 are BSD vs GNU. Mac OS X comes with BSD versions, but some utils are GNU versions. Linuxes typically come with GNU versions. The different versions accept different options. Also, GNU grep supports a varieties of confusing regex (“--basic-regexp”, “--extended-regexp”, “--perl-regexp”.) It's too painful to figure them out and remember their details.

grep is Not Powerful Enough for Nested Syntax (For example, HTML/XML)

Unix grep and associated tool (sort, wc, uniq, pipe, sed, awk, etc.) is not flexible, when your need is slightly more complex. For example, suppose i need to find all occurrences of HTML “img” tag that are not wrapped by a <div> tag. This is impossible with unix tools. (extending the limit of unix tools is how Perl was born in 1987.)

Example of a Real World Problem Using Grep Inside Emacs

Here's a concrete example of grep problem.

In my vocabulary page Wordy English — the Making of Belles-Lettres, i use the Unicode BOX DRAWINGS LIGHT VERTICAL “│” as a temp marker for processing the word list. Today i need to grep pages containing that character.

Calling Meta+x grep in emacs with grep -inH -e "│" *html returns a error:

-*- mode: grep; default-directory: "c:/Users/xah/web/xahlee_org/emacs/" -*-
Grep started at Tue Apr 05 15:37:47

grep "│" *html
warning: extra args ignored after 'grep "│\'

Grep finished with no matches found at Tue Apr 05 15:37:47

Starting shell in emacs (which runs Microsoft cmd.exe in Windows Vista) doesn't work neither. (it works fine when grepping ASCII string) Here's a session log:

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

c:\Users\xah\web\xahlee_org\emacs>grep "│" *html
grep "â\224? *html

It stuck there. Ctrl+c Ctrl+c doesn't get out. I had to kill the buffer.

Calling msys-shell works. (msys-shell is bundled with ErgoEmacs. It calls bash in MinGW, which is a subset of Cygwin port.) Here's a log:

sh-3.2$ grep "│" *html
antonymous_synonyms.html:<li> cry, decry │ you can cry, as in crying out loud, but you can also decry, by crying out loud </li>
antonymous_synonyms.html:<li> linear, rectilinear │ linear algebra, rectilinear motion. Rectilinear is the linearness of motion.</li>
…

Calling it in Cygwin Bash running inside Windows Console also works.

So, this means, the problem isn't grep not understanding Unicode. Something went wrong when emacs talks to Cygwin. Though, what exactly is the problem? Well, i'm not about to spend few hours to find out.

in PowerShell, it also works. For example: with this command select-string -path *.html -pattern "│". However, calling PowerShell thru emacs does not work.

Here's my system setup:

O, Complexity and Tedium of Software Engineering.

Addendum: Adding the option -P also worked. For example: call emacs “grep” command, then give grep -inH -e -P "│" *html. Thanks to “blandest” (gnu.emacs.help).

Emacs Lisp Solves All Problems

Here's pure emacs lisp for grep/sed: Emacs: Xah Find Replace (xah-find.el)

This page started when i wrote a grep in emacs lisp: ELisp: Write grep, and people are asking why.

Emacs Modernization