This container track helps call out sections of the genome that often cause problems or
confusion when working with the genome. The hg19 genome has a track with the same name, but with
many more subtracks, as the GeT-RM and Genome-in-a-Bottle artifact variants do not exist yet
for hg38, to our knowledge. If you are missing a track here that you know from
hg19 and have an idea how to add it hg38, do not hesitate to contact us.
Problematic Regions
The Problematic Regions track contains the following subtracks:
-
The UCSC Unusual Regions subtrack contains annotations collected at UCSC,
put together from other tracks, our experiences and support email list
requests over the years. For example, it contains the most well-known gene
clusters (IGH, IGL, PAR1/2, TCRA, TCRB, etc) and annotations for the GRC
fixed sequences, alternate haplotypes, unplaced
contigs, pseudo-autosomal regions, and mitochondria. These loci can yield alignments with
low-quality mapping scores and discordant read pairs, especially for short-read sequencing data.
This data set was manually curated, based on the Genome Browser's
assembly description, the FAQs about assembly, and the
NCBI RefSeq "other" annotations
track data.
-
The ENCODE Blacklist subtrack contains a comprehensive set of regions which are troublesome
for high-throughput Next-Generation Sequencing (NGS) aligners. These regions tend to have a very
high ratio of multi-mapping to unique mapping reads and high variance in mappability due to
repetitive elements such as satellite, centromeric and telomeric repeats.
-
The GRC Exclusions subtrack contains a set of regions that have been flagged by the GRC to
contain false duplications or contamination sequences. The GRC has now removed these sequences from
the files that it uses to generate the reference assembly, however, removing the sequences from the
GRCh38/hg38 assembly would trigger the next major release of the human assembly. In order to
help users recognize these regions and avoid them in their analyses, the GRC have produced a masking
file to be used as a companion to GRCh38, and the BED file is available from the
GenBank FTP site.
Highly Reproducible Regions
The Highly Reproducible Regions track highlights regions and variants
from eight samples that can be used to assess variant detection pipelines. The
"Highly Reproducible Regions" subtrack comprises the intersection of the reproducible
regions across all eight samples, while the "Variants" subtracks contain the reproducible
variants from each assayed sample. Both tracks contain data from the following samples:
- a Chinese Quartet, samples CQ-5, CQ-6, CQ-7, CQ-8
- a HapMap Trio, samples NA10385, NA12248, NA12249
- a Genome in a Bottle sample, NA12878s
Please refer to the Pan et al reference for more information on how
these regions were defined.
GIAB Problematic Regions
The Genome in a Bottle (GIAB) Problematic Regions tracks provide stratifications of the
genome to evaluate variant calls in complex regions. It is designed for use with Global Alliance
for Genomic Health (GA4GH) benchmarking tools like
hap.py
and includes regions with low complexity, segmental duplications, functional regions,
and difficult-to-sequence areas. Developed in collaboration with GA4GH, the
Genome in a Bottle (GIAB) consortium, and the
Telomere-to-Telomere Consortium (T2T), the dataset aims to standardize the
analysis of genetic variation by offering pre-defined BED files for stratifying true and false
positives in genomic studies, facilitating accurate assessments in complex areas of the genome.
The creation of the GIAB Problematic Regions tracks involves using a pipeline and configuration to
generate stratification BED files that categorize genomic regions based on specific challenges,
such as low complexity or difficult mapping, to facilitate accurate benchmarking of variant calls.
For more information on the pipeline and configuration used, please visit the following webpage:
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/README.md.
If you have questions or comments, please write to Justin Zook (jzook@nist.gov).
To view the full description, click here.
|