Vlad Lazarenko

... making all this up as I go along

Smart Command Line Options Parsing

A countless number of command line programs have been written since the invention of computers. Most of them parse command line options and spew out errors and usage messages if a user enters an unknown option or simply makes a typo. Examples are not far to seek. Make a typo trying to commit your changes into SVN repository and Subversion would bail out with errors:

1
2
3
$ svn commir
Unknown command: 'commir'
Type 'svn help' for usage.

There are command line tools that are standing out among others, impress with their intelligence and deliver unforgettable user experience.

One of them is Git. If it encounters an unknown option, it can analyze it, guess what user meant and suggest correct options:

1
2
3
4
5
$ git commir
git: 'commir' is not a git command. See 'git --help'.

Did you mean this?
  commit

To see this kind of feature in a command line tool is a pleasant surprise and could catch user’s attention. But if we take one step back and try to reckon programs with similar features, there will be a lot. Any decent word processors would have spell-checking feature that not only can detect mistakes but also suggest a correct spelling. Proofreading tools. Search engines often guess what we are looking for as well.

So what kind of sorcery is this?

String Metrics

All of the similar algorithms are based on string metric algorithms that measure similarity or dissimilarity between two strings for approximate matching or comparison and in fuzzy string searching. The similarity or dissimilarity between two strings of text is commonly known as string distance.

There are a number of different string metrics algorithms. The most widely known string metric is the Levenshtein Distance, also known as Edit Distance.

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. One could apply this algorithm to determine which one of the possible words is the most similar by comparing a number of character edits it requires to transform one word into another.

How Git Does This

Since Git is free and open source, we can check how it does the command line argument suggestion, and what algorithm is being used for that.

The Git source code can be downloaded or viewed online at https://github.com/git/git. All of the command line options handling is implemented in help.c file, and the suggestion is being made by help_unknown_cmd() function on line 293.

After quickly grasping through the code, it turns out that Git is using a Damerau–Levenshtein distance algorithm to calculate similarities between an unknown command and a list of supported commands. The implementation of this algorithm consists of two small files - levenshtein.h and levenshtein.c.

Use It In Your Program

To see how hard it would be to implement something similar, we have decided to borrow the algorithm implementation and use it to implement a simple command line tool in C++.

The basic idea is to accept command line options from a user and function that corresponds to a command name. In case the command name is unknown, use Damerau-Levenshtein algorithm and compare the unknown command to a list of supported ones. The algorithm returns an integer number that is a string distance. The bigger it is, less similar two strings are. Therefore, after comparing all of the strings, we simply store a result in std::map and take the first element as the most likely command user meant to type, but didn’t quite get it right.

Since we are more focused on C++ rather than C, the first is to transform levenshtein.c into a C++ code (alternatively, a mixing of C and C++ code could have been used) and get rid of Git-specific xmalloc() function. Here are the changes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
--- ../git/levenshtein.c    2013-12-03 19:09:52.000000000 -0500
+++ levenshtein.cc   2013-12-03 19:49:40.000000000 -0500
@@ -1,5 +1,6 @@
-#include "cache.h"
-#include "levenshtein.h"
+#include "levenshtein.hh"
+#include <cstdlib> // for malloc and free
+#include <cstring> // for strlen

 /*
  * This function implements the Damerau-Levenshtein algorithm to
@@ -42,9 +43,9 @@
      int w, int s, int a, int d)
 {
  int len1 = strlen(string1), len2 = strlen(string2);
-    int *row0 = xmalloc(sizeof(int) * (len2 + 1));
-    int *row1 = xmalloc(sizeof(int) * (len2 + 1));
-    int *row2 = xmalloc(sizeof(int) * (len2 + 1));
+    int *row0 = (int*)malloc(sizeof(int) * (len2 + 1));
+    int *row1 = (int*)malloc(sizeof(int) * (len2 + 1));
+    int *row2 = (int*)malloc(sizeof(int) * (len2 + 1));
  int i, j;

  for (j = 0; j <= len2; j++)

And a simple command line tool that uses Git’s levenshtein() function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
//
// An example of using Damerau-Levenshtein algorithm
// to suggest a correct command if user enters something
// that we don't know how to handle.
//

#include <cstdlib>
#include <string>
#include <map>
#include <unordered_map>
#include <iostream>
#include "levenshtein.h"

// Define command handler function type.
static void on_help() {
    std::cout << "Sure thing! What can I do for you?\n";
}

static void on_version() {
    std::cout << "I am not versioned, sorry.\n";
}

static void on_verbose() {
    std::cout << "If you need verbose, go read Twitter!\n";
}

static void on_say() {
    std::cout << "Oops, my speech speed synthesizer is broken :(\n";
}

// This function does not need an introduction.
int main(int argc, const char *argv[]) {
    // Check that we have at least one argument.
    if (argc < 2) {
        std::cerr << "Please enter a command\n";
        return EXIT_FAILURE;
    }

    // Deduce command from command line.
    std::string cmd{argv[1]};

    // Create a list of commands.
    const std::unordered_map<
        std::string,
        std::function<void()>
    > cmd_map{
        { "help", &on_help },
        { "say", &on_say },
        { "version", &on_version },
        { "verbose", &on_verbose }
    };

    // Find a command.
    auto cmd_it = cmd_map.find(cmd);
    if (cmd_it != cmd_map.end()) {
        // Command was found. Run the handler and exit.
        (cmd_it->second)();
        return EXIT_SUCCESS;
    }

    // If we get there, the command is unknown.
    // First, tell it to the user.
    std::cerr << "Unknown command: " << cmd << '\n';

    // Use Damerau-Levenshtein algorithm to get similarity
    // score with all of the commands we support, then pick
    // the best few to suggest to the user.
    std::multimap<
        int /* Levenshtein distance, less is better. */,
        std::string /* Supported command */
    > dist;

    for (auto& p : cmd_map) {
        dist.emplace(
            levenshtein(
                cmd.c_str(), // Command entered by the user.
                p.first.c_str(), // One of the supported commands.
                0, // swap cost
                2, // substitution cost
                1, // insertion cost
                3  // deletion cost
            ),
            p.first
        );
    }

    // Take the first best guess (lowest distance = better match).
    // Suggest a list of commands with the same Levenshtein distance.
    // This can be tweaked to show more options, etc..
    if (!dist.empty()) {
        auto best_score = dist.begin()->first;
        std::cerr << "Did you mean one of these?\n";
        for (auto& p : dist) {
            if (p.first != best_score) {
                // For more option, one could use something like
                // `p.first - best_score > N` where N is max.
                // allowed difference.
                break;
            }
            std::cout << '\t' << p.second << '\n';
        }
    }

    return EXIT_FAILURE;
}

Compile two files into a binary using Clang:

1
clang++ -Wall -Wextra -pedantic -std=c++11 -o test ./smart_help.cc

The tool supports only three commands — “help”, “version”, “verbose” and “say”. Let’s say user only types “ver”, in that case it correctly suggest two possible alternatives:

1
2
3
4
5
$ ./test ver
Unknown command: ver
Did you mean one of these?
  verbose
  version

It doesn’t lose face when handling “he” and “s” as well:

1
2
3
4
5
6
7
8
$ ./test he
Unknown command: he
Did you mean one of these?
  help
$ ./test s
Unknown command: s
Did you mean one of these?
  say

Of course, it is possible to tweak and tune this program further - do more error checking, play with different cost weights, do less or more aggressive guessing and so one.

The main point is – it isn’t all that difficult to make your command line option parsing more intelligent and user friendly by using this or similar algorithms. A great user experience matters a lot.

Maybe one day this would become a standard feature of option parsing libraries like Boost’s Program Options, Python’s argparse and even GNU getopt.

As always, thanks for reading!