One of the common problems when working with plain text is duplicate lines. They are bad and in most of the cases we would like to get rid of them. Here I present to you the simple solution(s) to address this problem.
Throughout this article, lets assume that in_file
is input file containing repeated
lines and out_file
is the expected result file.
Method 1 - Using awk
awk is a programming language in itself, created for the sole purpose of text processing. The shell command below is all you need to eliminate dups from input file.
|
|
Method 2 - Core Utils
Every linux/GNU distribution comes bundled with core utils and they are fully capable of accomplishing this task. One word of caution, using this method will remove duplicates and sort the file, the latter which may not be desired.
|
|
Method 3 - Using sed
sed
- stream editor for filtering and transforming text. That's what man page says.
It is a little complicated to do in sed. Since it treats all input as steam rather
than lines.
Here is how it is done with sed. Link to source is below.
|
|
Method 4 - perl
perl
is the scripting language that popularized regular expression. It was also
created for text processing. perl is extremely powerful in what is does. There is
a liner for complex text manipulation. Safe to say, it still retained its niche.
Here is one way to get it done in perl. Of course, there are more than a dozen ways to do it in perl
|
|
Conclusion
I have presented you a handful of methods to eliminate duplicate lines.There are always more than one way to solve a problem.