The BED (Browser Extensible Data) format is a flexible and simple way to represent genomic regions. This format is line based, delimited by tabs, designed for annotating information about the genomic data.
How can the BED file be used in a study?
Besides being a good format for storing different types of notes about a given region, BED can be used for very specific tasks. In genomic studies, for example, a BED file delimits exactly the genome regions (eg. Genes) you want to study, ignoring everything else. In addition, after an alignment step, measurements such as sample coverage, or even target regions, are intrinsically linked to the positions contained in the BED file. We can also see the BED file being used widely to represent repeats in the genome, protein isoforms, ORF regions or even transcription factor binding regions.
To manipulate a BED file, the Bedtools program stands out.
The first three columns are mandatory and have standard format, they are used to indicate as genomic regions. The other columns that this file may have varies according to the type of analysis to be performed and according to the program that will be used. In addition, each line corresponds to a single annotation.
Most of the time it is necessary that the BED file used, be ordered by name followed by the initial position.
About the mandatory columns, we have:
- 1st column – Genomic fragment where the note can be found (ex: chr5; scaffold SCAF01; contig NGAT753783);
- 2nd column – Region of interest initial position that starts on zero basis. This makes it different from some other commonly used files, such as VCF and GFF, which have base 1 as a start;
– Base zero means that the first base of the genomic fragment is numbered as zero;
- 3rd column – Region of interest final position in base 1;
– Base 1 in the final position, means that the final base represented by the value will not be captured by the programs, however it will be represented in the table.
Eg.: We want the first 30 bases from chromosome 21. The note of this information in BED format would be:
Chr21 0 30
The programs would use base 0 to 29 from chromosome 21, and not from 0 to 30 (which would be 31 bases).
About the author:
Livia Moura has a Ph.D. in bioinformatics, has worked with genome curation and script development at the University of California, Berkeley and nowadays she works as a bioinformatist at Albert Einstein Hospital.