Eliminating Duplicate Lines

One of the common problems when working with plain text is duplicate lines. They are bad and in most of the cases we would like to get rid of them. Here I present to you the simple solution(s) to address this problem.

Throughout this article, lets assume that in_file is input file containing repeated lines and out_file is the expected result file.

Method 1 - Using awk

awk is a programming language in itself, created for the sole purpose of text processing. The shell command below is all you need to eliminate dups from input file.

1
awk '!a[$0]++' in_file > out_file

Method 2 - Core Utils

Every linux/GNU distribution comes bundled with core utils and they are fully capable of accomplishing this task. One word of caution, using this method will remove duplicates and sort the file, the latter which may not be desired.

1
2
3
# core utils
sort -u in_file > out_file # method 2.1
sort in_file | uniq > out_file # Method 2.2

Method 3 - Using sed

sed - stream editor for filtering and transforming text. That's what man page says. It is a little complicated to do in sed. Since it treats all input as steam rather than lines. Here is how it is done with sed. Link to source is below.

1
2
3
# resource : http://sed.sourceforge.net/sed1line.txt 
sort in_file | sed -E '$!N; /^\(.*\)\n\1$/!P; D' > out_file # for sorted file
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P' in_file > out_file # non-sorted file.

Method 4 - perl

perl is the scripting language that popularized regular expression. It was also created for text processing. perl is extremely powerful in what is does. There is a liner for complex text manipulation. Safe to say, it still retained its niche.

Here is one way to get it done in perl. Of course, there are more than a dozen ways to do it in perl

1
perl -ne 'print if ! $x{$_}++' in_file > out_file

Conclusion

I have presented you a handful of methods to eliminate duplicate lines.There are always more than one way to solve a problem.