Tabix operation VCF file

tabix can index files in common formats in NGS analysis to speed up access, not only supports VCF files, but also supports BED, GFF, SAM and other formats.

The installation process is as follows

tar xjvf tabix-0.2.6.tar.bz2
cd tabix-0.2.6/

Download the source code, unzip it, and compile it. After the compilation is successful, there will be two executable files tabixand bgzip.

Due to the huge number of SNP sites, the corresponding VCF file is also very large. For example, to save storage space, the most common method is compression. bgzipVCF files can be compressed, the usage is as follows

bgzip -d view.vcf.gz
gunzip view.vcf.gz

After compression, the original view.vcffile becomes a view.vcf.gzfile. The compression suffix is .gz, if you want to decompress, there are the following two usages

bgzip -d view.vcf.gz
gunzip view.vcf.gz

The bgzip compression algorithm is similar to the gzip compression algorithm, so for bgzip compressed files, in addition to the bgzip software itself, gunzip can be used for decompression.

It should be noted that although the two algorithms have similarities, there are still essential differences. When compressing VCF files, gzip cannot be used instead of bgzip.

For large VCF files, how to quickly access the records in it is also difficult. tabixThe VCF file can be indexed. After the index is built, the access speed will be much faster. tabixThe usage of indexing VCF files is as follows

tabix -p vcf view.vcf.gz

Note that the input VCF file must be a bgzipcompressed VCF file, the generated index file is view.vcf.gz.tbi, the suffix is .tbi​​.

After building the index, you can quickly get the records of the specified area, the example is as follows

1. Obtain the SNP locus located on chromosome 11

tabix view.vcf.gz 11

2. Obtain the SNP site whose mutation position is greater than or equal to 2343545 on chromosome 11

tabix view.vcf.gz 11:2343545

3. Obtain SNP loci with mutation positions ranging from 2343540 to 2343596 on chromosome 11

tabix view.vcf.gz 11:2343540-2343596

Many software that operates VCF will recognize tabixthe established index, thus speeding up the processing. Many large-scale project VCF files will also be compressed with bgzip and then tabixindexed.

