NGS

Resource usage to deduplicate BAM files with UMI-tools

While having access to HPC, resource is not infinite, and our HPC reasonably manage job queues by how much computation you have requested recently. This provides incentive to only request enough for whatever task you work on, but for me, it hasn’t been easy: Estimating resource usage is not trivial, and trying to be as frugal as possible would sometimes result in time-consuming jobs that are terminated after several hours or days due to insufficient memory or running time requested.

Downloading multiple samples from a entry at GEO Dataset

Openly shared data is invaluable. It provides a way for others to test reproducibility of analysis and reduces the need of repeated screening experiments. Besides, these data is also an excellent training ground for amateurs like me.

Subsampling a fastq file with awk

What you are going to find here A minimal introduction of the awk command in Linux and Mac (For Mac user, installing GNU awk might be necessary. It introduced some new functions like sorting an array with asort().