Post

Linux Archive and Compression Commands

Introduction

You know that feeling when your hard drive is screaming for space, or you need to send a bunch of files to a colleague? That’s where Linux archive and compression commands become your best friends. I’ve been working with these tools for years, and honestly, they’ve saved me more times than I can count.

In this guide, we’ll explore the essential archiving and compression tools that every Linux user should know. Whether you’re backing up your home directory, sharing project files, or just trying to understand what that .tar.gz file actually is, I’ve got you covered.

Understanding the Basics: Archiving vs Compression

Before we dive into the commands, let’s clear up something that confuses a lot of people. Archiving and compression are two different things, though they often work together:

Archiving combines multiple files into one single file. Think of it like putting a bunch of loose papers into a folder. The papers don’t get smaller, but they’re easier to carry around.

Compression actually makes files smaller by removing redundant information. It’s like vacuum-sealing your clothes before packing them.

Here’s the interesting part: you can do either one independently, but most of the time, we do both. First, we archive multiple files into one, then we compress that archive. That’s why you see file extensions like .tar.gz (archived with tar, then compressed with gzip).

Note: When you compress and then extract files from an archive, this process is called “unarchiving” rather than “decompressing.”

The gzip Family: Fast and Reliable

gzip and gunzip Commands

The gzip command uses the Lempel-Ziv compression algorithm, which offers a great balance between speed and compression ratio. It’s been around forever, and that’s because it just works.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Basic compression (WARNING: replaces the original file!)
gzip report.txt
# Result: report.txt.gz (original file is gone)

# Preserve the original file using -c option
gzip -c budget.txt > budget.txt.gz
# Result: both budget.txt and budget.txt.gz exist

# Decompress a file
gunzip report.txt.gz
# Result: report.txt (the .gz file is removed)

# Check compression statistics without decompressing
gunzip -l archive.txt.gz
# Shows: compressed size, uncompressed size, and compression ratio

Warning: By default, gzip replaces your original file with the compressed version. Always use the -c option if you want to keep the original!

Here’s a real-world example I use constantly:

1
2
3
4
5
6
# Compress all log files in a directory while keeping originals
for file in /var/log/myapp/*.log; do
    gzip -c "$file" > "$file.gz"
done

# This is perfect for log rotation scenarios

The zcat Command: View Without Extracting

Sometimes you just want to peek inside a compressed file without actually extracting it. That’s where zcat comes in handy:

1
2
3
4
5
6
7
8
9
10
11
# View a compressed file
zcat error.log.gz

# Search through compressed logs without extracting
zcat access.log.gz | grep "404"

# Combine multiple compressed files
zcat file1.gz file2.gz file3.gz > combined.txt

# View uncompressed files too (with -f flag)
zcat -f mixed_files.txt

I use zcat all the time when troubleshooting production issues. You can grep through gigabytes of compressed logs without filling up your disk with extracted files.

The bzip2 Family: Better Compression, More CPU

bzip2 and bunzip2 Commands

The bzip2 command uses the Burrows-Wheeler block sorting algorithm. Translation: it compresses files smaller than gzip, but it takes longer to do it. If storage space is more critical than CPU time, this is your tool.

1
2
3
4
5
6
7
8
9
10
11
12
# Compress with verbose output to see compression ratio
bzip2 -v large_database.sql
# Output: large_database.sql: 3.147:1, 2.540 bits/byte, 68.22% saved

# Preserve original file (same concept as gzip)
bzip2 -cv document.txt > document.txt.bz2

# Decompress multiple files using wildcards
bunzip2 -v backups/*.bz2

# Test integrity of compressed files
bzip2 -tv archive.bz2

Real-world scenario: I once had to archive 5 years of database backups. Using bzip2 instead of gzip saved us about 20% more space, which translated to significant storage costs over time.

The bzcat Command

Just like zcat, but for bzip2 files:

1
2
3
4
5
# View compressed file contents
bzcat system_backup.tar.bz2 | less

# Extract specific files from a compressed archive
bzcat backup.tar.bz2 | tar -xf - specific_file.txt

The xz Family: Maximum Compression

xz and unxz Commands

The xz command is the heavyweight champion of compression ratios. It uses the LZMA2 algorithm and can squeeze files incredibly small, but it demands the most CPU power.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Compress files (default is -z for compression)
xz -z important_data.txt
# Or simply: xz important_data.txt

# Compress with maximum compression level (1-9, where 9 is best)
xz -9 massive_file.dat

# Decompress files
xz -d compressed_file.xz
# Or use: unxz compressed_file.xz

# Keep original file during compression
xz -k source_file.txt

# Compress multiple files
xz *.log
# Note: directories are skipped automatically

The xzcat Command

View xz-compressed files without extraction:

1
2
3
4
5
# Display compressed file
xzcat data.txt.xz

# Pipe to other commands
xzcat archive.txt.xz | grep "ERROR" | wc -l

Compression comparison example:

1
2
3
4
5
6
7
8
9
10
11
# Let's compare all three methods on the same file
cp large_file.txt test1.txt
cp large_file.txt test2.txt
cp large_file.txt test3.txt

gzip test1.txt    # Fast, good compression
bzip2 test2.txt   # Slower, better compression
xz test3.txt      # Slowest, best compression

ls -lh test*
# Output shows: test1.txt.gz (45MB) > test2.txt.bz2 (38MB) > test3.txt.xz (32MB)

The tar Command: The Swiss Army Knife

The tar command (short for “tape archive”) is probably the most important archiving tool in Linux. It was originally designed for backing up to magnetic tapes, but today it’s used for everything.

Creating Archives

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create a basic tar archive (no compression)
tar -cf documents_backup.tar ~/Documents/
# -c = create, -f = file (must be the last option!)

# Create with gzip compression and verbose output
tar -cvzf documents_backup.tar.gz ~/Documents/
# -v = verbose (shows files being added), -z = gzip compression

# Create with bzip2 compression
tar -cjf projects_backup.tar.bz2 ~/Projects/
# -j = bzip2 compression

# Create with xz compression
tar -cJf archives_backup.tar.xz ~/Archives/
# -J = xz compression (capital J!)

# Exclude certain files or directories
tar -czf backup.tar.gz --exclude='*.log' --exclude='temp/*' ~/myapp/

Here’s my personal backup script that I run weekly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# Automated backup script

DATE=$(date +%Y%m%d)
BACKUP_DIR="/backups"
SOURCE_DIR="/home/username/important_stuff"

# Create compressed archive with timestamp
tar -czf "$BACKUP_DIR/backup_$DATE.tar.gz" \
    --exclude='*.tmp' \
    --exclude='.cache' \
    "$SOURCE_DIR"

# Keep only backups from the last 30 days
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -delete

echo "Backup completed: backup_$DATE.tar.gz"

Listing Archive Contents

Before extracting an archive, it’s smart to see what’s inside:

1
2
3
4
5
6
7
8
9
# List files in archive
tar -tf backup.tar.gz

# List with detailed information (like ls -l)
tar -vtf backup.tar.gz
# Shows permissions, ownership, size, modification date

# Search for specific files
tar -tf backup.tar.gz | grep "config"

Tip: Always list the archive contents before extracting, especially if it came from an external source. This prevents the “tar bomb” scenario where files extract all over your current directory instead of into a subdirectory.

Extracting Archives

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Extract all files to current directory
tar -xf backup.tar.gz
# -x = extract

# Extract with verbose output
tar -xvf backup.tar.gz

# Extract to specific directory
tar -xf backup.tar.gz -C /tmp/restore/

# Extract only specific files
tar -xf backup.tar.gz path/to/specific/file.txt

# Extract files matching a pattern
tar -xf backup.tar.gz --wildcards '*.conf'

Advanced tar Techniques

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Create archive and compress in one pipeline
tar -czf - /home/user | ssh remote@server 'cat > /backup/home.tar.gz'
# This backs up directly to a remote server!

# Split large archives into smaller chunks
tar -czf - /large/directory | split -b 1G - backup.tar.gz.part
# Creates backup.tar.gz.partaa, backup.tar.gz.partab, etc.

# Combine and extract split archives
cat backup.tar.gz.part* | tar -xzf -

# Incremental backups using timestamps
tar -czf backup.tar.gz --newer-mtime='2025-01-01' /data/
# Only archives files modified after specified date

# Add files to existing archive (uncompressed archives only)
tar -rf existing.tar new_file.txt

The zip Family: Cross-Platform Champion

The zip command is special because it works seamlessly across Windows, macOS, and Linux. Unlike gzip and bzip2, it doesn’t replace your original files, which is actually quite convenient.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Create a zip archive
zip my_files.zip file1.txt file2.txt file3.txt
# Extension .zip is added automatically

# Zip entire directory recursively
zip -r project.zip project_folder/
# -r = recursive (include subdirectories)

# Zip with maximum compression
zip -9 -r compressed.zip large_directory/

# Add password protection
zip -er secure.zip sensitive_data/
# -e = encrypt (will prompt for password)

# Update existing zip with new files
zip -u archive.zip updated_file.txt

# Delete files from zip archive
zip -d archive.zip unwanted_file.txt

Extracting zip files:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Extract to current directory
unzip archive.zip

# Extract to specific directory
unzip archive.zip -d /tmp/extracted/

# List contents without extracting
unzip -l archive.zip

# Test archive integrity
unzip -t archive.zip

# Extract only specific files
unzip archive.zip "*.txt"

# Quietly extract (no output)
unzip -q archive.zip

Cross-platform sharing example:

1
2
3
# Create a zip file that works perfectly on Windows
zip -r shared_project.zip project/ -x "*.DS_Store" "*.git/*"
# Excludes macOS and git files that Windows users don't need

The cpio Command: The POSIX Standard

The cpio command is less common nowadays, but it’s still important because it follows the POSIX specification and works on all Unix systems. It’s particularly useful for creating backups and moving directory structures.

Copy-Out Mode (Create Archive)

1
2
3
4
5
6
7
8
9
10
11
12
13
# Basic archive creation
ls | cpio -ov > archive.cpio
# -o = copy-out, -v = verbose

# Archive with find command (more powerful)
find . -depth -print | cpio -ov > backup.cpio
# -depth ensures directories are processed after their contents

# Archive specific file types
find . -name "*.txt" -print | cpio -ov > text_files.cpio

# Archive with compression
find . -print | cpio -ov | gzip > archive.cpio.gz

Copy-In Mode (Extract Archive)

1
2
3
4
5
6
7
8
9
10
# Extract archive
cpio -idv < archive.cpio
# -i = input (extract), -d = create directories, -v = verbose

# Extract specific files
cpio -idv "*.conf" < archive.cpio

# List archive contents without extracting
cpio -tv < archive.cpio
# -t = list (table of contents)

Copy-Pass Mode (Direct Copy)

This mode copies files directly without creating an archive file:

1
2
3
4
5
# Copy directory structure to another location
find ~/source_dir | cpio -pd /backup/destination/
# -p = pass-through mode, -d = create directories

# This is useful for preserving permissions and timestamps

Practical backup scenario:

1
2
3
# Backup system configuration files
find /etc -type f -name "*.conf" | cpio -ov | \
    bzip2 > /backup/system_configs_$(date +%F).cpio.bz2

The dd Command: Low-Level Disk Operations

The dd command (sometimes called “disk destroyer” because of its power) operates at the bit level. It’s not really a compression tool, but it’s essential for certain archiving tasks.

Danger: The dd command can destroy data instantly if used incorrectly. There’s no undo button. Always double-check your if= (input file) and of= (output file) parameters!

Basic dd Operations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Create a swap file filled with zeros
dd if=/dev/zero of=/tmp/swapfile bs=1M count=500
# Creates a 500MB file
# if = input file, of = output file, bs = block size, count = number of blocks

# Clone an entire hard drive
sudo dd if=/dev/sda of=/dev/sdb bs=4M status=progress
# Copies everything from sda to sdb, including boot sector
# status=progress shows real-time progress

# Create ISO image from CD/DVD
dd if=/dev/cdrom of=~/backup.iso bs=2048
# Preserves bootable media structure

# Backup Master Boot Record (MBR)
sudo dd if=/dev/sda of=~/mbr_backup.img bs=512 count=1
# Backs up first 512 bytes (MBR + partition table)

# Restore MBR
sudo dd if=~/mbr_backup.img of=/dev/sda bs=512 count=1

Advanced dd Techniques

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Create bootable USB from ISO
sudo dd if=ubuntu.iso of=/dev/sdb bs=4M status=progress oflag=sync
# oflag=sync ensures data is written before command completes

# Securely wipe a disk (be VERY careful!)
sudo dd if=/dev/urandom of=/dev/sdb bs=4M status=progress
# Fills entire disk with random data

# Benchmark disk write speed
dd if=/dev/zero of=testfile bs=1G count=1 oflag=dsync
# Tests actual disk write performance

# Create sparse file (appears large but uses minimal space)
dd if=/dev/zero of=sparse.img bs=1 count=0 seek=10G
# Creates 10GB file that uses almost no disk space initially

# Convert and backup with compression
sudo dd if=/dev/sda bs=4M | gzip > disk_backup.img.gz

# Monitor dd progress from another terminal
sudo watch -n 5 'kill -USR1 $(pgrep ^dd)'
# Sends signal to dd to print progress

Real-world scenario: Cloning a system disk before major upgrade:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# System disk backup script

SOURCE="/dev/sda"
BACKUP_DIR="/mnt/backups"
DATE=$(date +%Y%m%d_%H%M%S)

echo "Creating full disk image of $SOURCE..."
sudo dd if=$SOURCE bs=4M status=progress | \
    gzip -c > "$BACKUP_DIR/system_backup_$DATE.img.gz"

if [ $? -eq 0 ]; then
    echo "Backup completed successfully!"
    echo "File: $BACKUP_DIR/system_backup_$DATE.img.gz"
else
    echo "Backup failed!"
    exit 1
fi

Combining Commands for Power Users

Here’s where things get really interesting. You can combine these tools to create powerful workflows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Backup database, compress, and encrypt
mysqldump -u root -p mydb | gzip | openssl enc -aes-256-cbc -out db_backup.sql.gz.enc

# Backup over SSH with compression
tar -czf - /important/data | ssh user@remote 'cat > /backup/data.tar.gz'

# Split large backup into DVD-sized chunks
tar -czf - /large/directory | split -b 4480M - backup_dvd_

# Monitor compression in real-time
tar -czf - /logs | pv | dd of=logs.tar.gz
# pv shows throughput and ETA

# Parallel compression for faster processing
tar -cf - /data | pigz -9 > data.tar.gz
# pigz is parallel gzip (uses multiple CPU cores)

# Create incremental backups
tar -czf incremental.tar.gz --listed-incremental=snapshot.file /data/

Archive Command Comparison Table

Command Extension Algorithm Compression Ratio Speed Best For
gzip .gz Lempel-Ziv Good Fast General purpose, logs
bzip2 .bz2 Burrows-Wheeler Better Medium Long-term storage
xz .xz LZMA2 Best Slow Maximum compression
tar .tar None (archive) N/A N/A Combining with compression
zip .zip DEFLATE Good Fast Cross-platform sharing
cpio .cpio None (archive) N/A N/A System backups, POSIX
dd N/A None (bit copy) N/A Fast Disk cloning, low-level

Automated Backup Strategy

Here’s a complete backup script I use in production:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/bin/bash
# comprehensive-backup.sh
# Full system backup with rotation

# Configuration
BACKUP_ROOT="/backups"
SOURCE_DIRS=("/home" "/etc" "/var/www")
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/backup.log"

# Functions
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Main backup routine
main() {
    log "Starting backup process..."
    
    for dir in "${SOURCE_DIRS[@]}"; do
        dir_name=$(basename "$dir")
        backup_file="$BACKUP_ROOT/${dir_name}_${DATE}.tar.gz"
        
        log "Backing up $dir to $backup_file"
        
        # Create compressed archive with error checking
        tar -czf "$backup_file" \
            --exclude='*.tmp' \
            --exclude='cache/*' \
            "$dir" 2>&1 | tee -a "$LOG_FILE"
        
        if [ ${PIPESTATUS[0]} -eq 0 ]; then
            # Verify archive integrity
            tar -tzf "$backup_file" > /dev/null
            if [ $? -eq 0 ]; then
                log "Successfully backed up $dir"
                
                # Calculate and log file size
                size=$(du -h "$backup_file" | cut -f1)
                log "Backup size: $size"
            else
                log "ERROR: Archive verification failed for $backup_file"
            fi
        else
            log "ERROR: Backup failed for $dir"
        fi
    done
    
    # Cleanup old backups
    log "Removing backups older than $RETENTION_DAYS days..."
    find "$BACKUP_ROOT" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
    
    log "Backup process completed"
    
    # Send summary email
    mail -s "Backup Report - $(date +%Y-%m-%d)" admin@example.com < "$LOG_FILE"
}

# Run main function
main

Best Practices and Tips

1. Always test your backups

1
2
# Don't just create backups, verify them!
tar -tzf backup.tar.gz > /dev/null && echo "Archive is valid"

2. Use appropriate compression levels

1
2
3
4
5
# For frequently accessed files
gzip -1 fast_compression.txt  # Fastest compression

# For archival storage
xz -9e long_term_storage.dat  # Maximum compression

3. Include metadata in backup filenames

1
2
3
# Good filename structure
backup_$(hostname)_$(date +%Y%m%d_%H%M%S).tar.gz
# Example: backup_webserver01_20251002_143000.tar.gz

4. Document your archives

1
2
3
# Create a manifest file
tar -czf backup.tar.gz /data && \
    tar -tzf backup.tar.gz > backup_manifest.txt

5. Use rsync for incremental backups

1
2
# More efficient than tar for large directory structures
rsync -avz --delete /source/ /backup/destination/

Troubleshooting Common Issues

Issue 1: “Cannot create directory: Permission denied”

Problem: Extracting archives without proper permissions.

1
2
3
4
5
# Solution: Extract to a location you own
tar -xzf archive.tar.gz -C ~/temp/

# Or use sudo if needed (be careful!)
sudo tar -xzf archive.tar.gz -C /opt/

Issue 2: Archive appears to have no files

Problem: Creating tar archive with absolute paths.

1
2
3
4
5
6
7
8
# Wrong way (stores absolute paths)
tar -czf backup.tar.gz /home/user/documents

# Right way (stores relative paths)
cd /home/user && tar -czf backup.tar.gz documents/

# Or use -C option
tar -czf backup.tar.gz -C /home/user documents/

Issue 3: Out of space during extraction

Problem: Archive is larger than available disk space.

1
2
3
4
5
6
# Check archive size before extracting
tar -tzf large.tar.gz | wc -l  # Count files
gunzip -l large.tar.gz          # Check uncompressed size

# Extract to different partition
tar -xzf large.tar.gz -C /mnt/external/

Issue 4: Corrupted archive

Problem: Incomplete download or disk error.

1
2
3
4
5
6
7
# Test archive integrity
gzip -t file.gz      # For gzip files
bzip2 -t file.bz2    # For bzip2 files
tar -tzf file.tar.gz # For tar archives

# Try to recover what you can
tar -xzf corrupted.tar.gz --ignore-failed-read

Conclusion

Mastering these archive and compression commands gives you serious power over your Linux system. Whether you’re backing up critical data, transferring files efficiently, or managing disk space, these tools are essential.

The key takeaways:

  • gzip for quick, everyday compression
  • bzip2 when you need better compression and don’t mind waiting
  • xz for maximum compression on long-term archives
  • tar for combining files and creating portable archives
  • zip when sharing across different operating systems
  • cpio for POSIX-compliant backups
  • dd for low-level disk operations (use with extreme caution!)

Remember: always test your backups and verify archive integrity. A backup you can’t restore is worse than no backup at all.

Additional Resources

This post is licensed under CC BY 4.0 by the author.