HPC:kClust

From HPC wiki

kClust

kClust is a program intended for fast and sensitive clustering of large protein sequence databases. kClust v1.0 is installed across all the HPC nodes.

Usage

kClust can be loaded as module.

[asrini@node062 ~]$ module show kClust-1.0
-------------------------------------------------------------------
/usr/share/Modules/modulefiles/kClust-1.0:

module-whatis	 kClust: fast and sensitive clustering of large protein sequence databases. This version is compiled against our version of GCC and our architecture.
prepend-path	 PATH /opt/software/kClust/1.0/bin
-------------------------------------------------------------------

[asrini@node062 ~]$ module load kClust-1.0

[asrini@node062 ~]$ which kClust
/opt/software/kClust/1.0/bin/kClust

[asrini@node062 ~]$ kClust --help
Usage: ./kClust -i [fasta-db-file] -d [directory] [options]

Version 1.0

kClust is a clustering program for protein sequences.
Written by Christian Mayer (christian.eberhard.mayer@googlemail.com) and Maria Hauser (mhauser@genzentrum.lmu.de)

Required arguments:
 -i                   [fasta-db-file]     : Sequence database in fasta format or directory with the output of the previous kClust run if -P option is set.
 -d                   [directory]         : Directory for temporary and result files.

Optional arguments:
 -M                   [megabytes]         : Memory limit for clustering (default=3500MB).
 -P                                       : Cluster profiles computed from existing alignment files (default=false).
 -sc                                      : Use sequence background frequency score correction for the k-mer scores (default=false).
 -td                  [directory]         : Directory for temporary files (default=WORKING_DIR/tmp)
 -s                   [float]             : Clustering threshold (score per column) (default=1.12 half bits ~ 30% sequence identity). Set to zero for the clustering based only on the e-value of the hit.
 -e                   [float]             : Clustering E-value threshold (default=1.0e-4).
 -c                   [float]             : Alignment coverage of the longer sequence (default=0.8).
 --merge-ncbi-headers                     : Compress NCBI headers in representatives database, creating a merged header instead of the representative sequence header.
 --merge-uniprot-headers                     : Compress Uniprot headers in representatives database, creating a merged header instead of the representative sequence header.
 --write-time-benchmark                     : Write time benchmark files, containing sequences which consume the most computation time (default=false).

Expert arguments:
 --filter-k           [integer]           : Length of k-mers for similarity scoring filter (default=6).
 --filter-T           [float]             : Similarity threshold for filter k-mer generation (default=4.3 half bits).
 --filter-t           [float]             : k-mer score threshold for prefiltering (default=0.55 half bits).
 --kdp-k              [integer]           : Length of k-mers for kDP alignments (default=4).
 --kdp-T              [float]             : Similarity threshold for kDP k-mer generation (default=2.9 half bits).
 --kdp-G              [float]             : Gap open penalty (default=12.0 half bits).
 --kdp-E              [float]             : Gap extension penalty (default=2.0 half bits).
 --kdp-F              [float]             : Intra-diagonal gap penalty (default=0.27 half bits).
 --kdp-delta          [integer]           : Width of delta window (default=50).

Sequence identity ~ score per column (see -s option):
20%   30%   40%   50%   60%   70%   80%   90%   99%
0.52  1.12  1.73  2.33  2.93  3.53  4.14  4.74  5.28

Other Pages