A countless number of command line programs have been written since the invention of computers. Most of them parse command line options and spew out errors and usage messages if a user enters an unknown option or simply makes a typo. Examples are not far to seek. Make a typo trying to commit your changes into SVN repository and Subversion would bail out with errors:
1 2 3
There are command line tools that are standing out among others, impress with their intelligence and deliver unforgettable user experience.
One of them is Git. If it encounters an unknown option, it can analyze it, guess what user meant and suggest correct options:
1 2 3 4 5
To see this kind of feature in a command line tool is a pleasant surprise and could catch user’s attention. But if we take one step back and try to reckon programs with similar features, there will be a lot. Any decent word processors would have spell-checking feature that not only can detect mistakes but also suggest a correct spelling. Proofreading tools. Search engines often guess what we are looking for as well.
So what kind of sorcery is this?
All of the similar algorithms are based on string metric algorithms that measure similarity or dissimilarity between two strings for approximate matching or comparison and in fuzzy string searching. The similarity or dissimilarity between two strings of text is commonly known as string distance.
There are a number of different string metrics algorithms. The most widely known string metric is the Levenshtein Distance, also known as Edit Distance.
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other. One could apply this algorithm to determine which one of the possible words is the most similar by comparing a number of character edits it requires to transform one word into another.
How Git Does This
The Git source code can be downloaded or viewed online at https://github.com/git/git. All of the command line options handling is implemented in help.c file, and the suggestion is being made by
help_unknown_cmd() function on line 293.
After quickly grasping through the code, it turns out that Git is using a Damerau–Levenshtein distance algorithm to calculate similarities between an unknown command and a list of supported commands. The implementation of this algorithm consists of two small files - levenshtein.h and levenshtein.c.
Use It In Your Program
To see how hard it would be to implement something similar, we have decided to borrow the algorithm implementation and use it to implement a simple command line tool in C++.
The basic idea is to accept command line options from a user and function that corresponds to a command name. In case the command name is unknown, use Damerau-Levenshtein algorithm and compare the unknown command to a list of supported ones. The algorithm returns an integer number that is a string distance. The bigger it is, less similar two strings are. Therefore, after comparing all of the strings, we simply store a result in
std::map and take the first element as the most likely command user meant to type, but didn’t quite get it right.
Since we are more focused on C++ rather than C, the first is to transform levenshtein.c into a C++ code (alternatively, a mixing of C and C++ code could have been used) and get rid of Git-specific
xmalloc() function. Here are the changes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
And a simple command line tool that uses Git’s
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
Compile two files into a binary using Clang:
The tool supports only three commands — “help”, “version”, “verbose” and “say”. Let’s say user only types “ver”, in that case it correctly suggest two possible alternatives:
1 2 3 4 5
It doesn’t lose face when handling “he” and “s” as well:
1 2 3 4 5 6 7 8
Of course, it is possible to tweak and tune this program further - do more error checking, play with different cost weights, do less or more aggressive guessing and so one.
The main point is – it isn’t all that difficult to make your command line option parsing more intelligent and user friendly by using this or similar algorithms. A great user experience matters a lot.
As always, thanks for reading!