I got a small set of 16S rRNA sequence data from professor.
(I said small, but it's still 1.8GB!!)
First step,
I installed some necessary packages and ran QC with FastQC and NanoPlot.
I'm running everything on my personal laptop.
(AMD Ryzen 5 4500U … & 16+4GB RAM)
It took a long time… I don’t know—I watched a movie, and it was still working.
So I just went to bed, and it was done by morning.
What do I look for in the result? And why?
1) Length Information
-
a. Mean Length
-
b. Median Length
-
c. N50
a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value. :Wikipedia)
We use this to decide the --min_length value for Filtlong.
Also, 16S amplicon length should be approx. 1500bp long—so my result (all around 1500) looks good!
2) Q-Score (Read Quality)
-
Q10: 90% accuracy
-
Q20: 99% accuracy
-
Q30: 99.9% accuracy
...
My result says:
-
>Q10 is only 39%
-
>Q15 is 0%
So it’s low-quality data → We need to set Filtlong more generous (by setting --keep_percent).
In the next article, I’ll proceed to Filtlong with the data I got today!

Comments
Post a Comment