Hi, and welcome to my space on the Internet ...
-
More on "histogram diff", and a working program
This is a follow-on to my earlier post explaining the details of how histogram diff works, including a textual description of the algorithm and pseudocode.
The original version of histogram diff was implemented
jgit
by the late Shawn Pearce in 2010, with the comment that “HistogramDiff is an alternative implementation of patience diff, performing a search over all matching locations and picking the longest common subsequence that has the lowest occurrence count. If there are unique common elements, its behavior is identical to that of patience diff.”I have written a stand-alone diff implementation that I believe exactly duplicates the behavior of histogram diff in
jgit
, and usually creates an identical (and otherwise very close) output to whatgit diff --histogram
produces. You can find my code here. -
How "histogram diff" actually works
I’ve been interested in file comparison (diff) algorithms for quite a while. I’d assumed for some time that the classical approach, which uses a longest-common-subsequence (LCS) algorithm, is the best way to do this. The original Unix diff used an algorithm by James W. Hunt and M.D. (Doug) McIlroy. After Eugene Myers published his O(ND) algorithm in 1986, it was adopted, with adaptation and variations, as the most common algorithm for file comparison, in gnu diff and many other places. For a long time, the Myers algorithm seemed to be the best solution for diff and diff-like programs.
In 2007, BitTorrent creator Bram Cohen published on his blog a brief description of a non-LCS-based diff algorithm he called “patience diff”. This approach paired up lines unique to each file being compared, followed by a pass to retain only the longest sequence of such pairs that appear in order in both files. (The pairing of unique lines was originally published by Paul Heckel in 1978, though he did not attempt to remove “crossing” pairs, but retained them as moves between files.) What happened in patience diff after the initial pairing of unique lines in both files was somewhat fluid.
The patience diff algorithm can be slow, and it’s not clear to me what it can do if there are few or no unique matching lines. I believe it is generally implemented with a “fallback” method (usually Myers LCS?) to use if no matching unique lines are found, or to use between the matching unique lines.
In 2010, the
jgit
project added the HistogramDiff class.This was claimed to be “An extended form of Bram Cohen’s patience diff algorithm.” It seems to make pretty good diffs pretty fast and made me question the primacy of the LCS approach. I searched the WWWeb in vain for a clear and detailed explanation of how it actually works. The description in the document linked just above is pretty unclear. I wanted to know enough about it to implement it myself. Finally I examined the Java source in
jgit
and analyzed what it actually does.Here is a text and pseudocode desciption.
Further discussion and a working histogram diff program are in a later post.
-
wak - An awk implementation released
A production release
I posted over a year ago that an awk implementation called wak was “coming soon?”
It’s finally in good enough shape I feel OK saying it’s here. Of course it’s not finished; nothing like that is ever quite finished. Brian Kernighan maintained his original awk up until last year (with over 200 fixes over nearly 40 years!), and it’s still being maintained. I do hope to keep improving wak.
The code is in https://github.com/raygard/wak and a writeup on its development is at https://www.raygard.net/awkdoc/.
-
Setting up Windows and Linux dual boot on my laptop
Why dual boot Windows and Linux
I’ve been using Windows for a long time, but want to get serious about learning Linux. I’ve used it under WSL (Windows Subsystem for Linux) but want the full desktop experience. So I decided to set up Linux Mint Cinnamon on my laptop alongside Windows 10.
So just download the Mint .iso image, put it on a thumb drive, boot from thumb drive, install, right?
-
wak - An awk implementation
Introduction
coming soon?