Gzip ( .gz ) is a widely used, open-source algorithm and file format developed in 1992 by Jean-loup Gailly and Mark Adler to replace proprietary compression tools. It is the standard for web compression and is frequently used to shrink large, text-heavy files, such as CSVs, to save storage space and increase transfer speeds.
The algorithm scans data to find repeating patterns. Instead of storing the repeated data twice, it replaces subsequent occurrences with a pointer (a pair of numbers: distance and length) to the initial occurrence. cskvdhdgzip
No data is lost; decompressing restores the exact original file. Gzip (
After LZ77 identifies the repetitions, Huffman coding assigns shorter binary codes to frequently appearing characters or patterns and longer codes to rarer ones. Instead of storing the repeated data twice, it
import pandas as pd # Write DataFrame to Gzip CSV df.to_csv("data.csv.gz", index=False, compression="gzip") # Read Gzip CSV df = pd.read_csv("data.csv.gz", compression="gzip") Use code with caution. Copied to clipboard Gzip vs. Other Formats How Gzip Compression Works
import gzip import shutil # Compress with open('data.csv', 'rb') as f_in: with gzip.open('data.csv.gz', 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Decompress with gzip.open('data.csv.gz', 'rb') as f_in: with open('data_restored.csv', 'wb') as f_out: shutil.copyfileobj(f_in, f_out) Use code with caution. Copied to clipboard Working with Pandas
Gzip operates using a combination of two primary algorithms, often referred to as :