Skip to content

sec2pri/mapping_preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

839 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Processing mapping files for the omics FixID tool

This repository contains the source code for data processing to create identifier (IDs) mapping files for secondary IDs (outdated/deprecated/split/megred). The following databases have been included in this project:

Datasource license citation
ChEBI (config) CC BY 4.0 Hastings J, Owen G, Dekker A, et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2016 Jan;44(D1):D1214-9. DOI: 10.1093/nar/gkv1031. PMID: 26467479; PMCID: PMC4702775.
HMDB (config) CC0 Wishart DS, Guo A, Oler E, Wang F, Anjum A, Peters H, Dizon R, Sayeeda Z, Tian S, Lee BL, Berjanskii M, Mah R, Yamamoto M, Jovel J, Torres-Calzada C, Hiebert-Giesbrecht M, Lui VW, Varshavi D, Varshavi D, Allen D, Arndt D, Khetarpal N, Sivakumaran A, Harford K, Sanford S, Yee K, Cao X, Budinski Z, Liigand J, Zhang L, Zheng J, Mandal R, Karu N, Dambrova M, Schiöth HB, Greiner R, Gautam V. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022 Jan 7;50(D1):D622-D631. doi: 10.1093/nar/gkab1062. PMID: 34986597; PMCID: PMC8728138.
HGNC (config) link Seal RL, Braschi B, Gray K, Jones TEM, Tweedie S, Haim-Vilmovsky L, Bruford EA. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D1003-D1009. doi: 10.1093/nar/gkac888. PMID: 36243972; PMCID: PMC9825485.
NCBI (config) link Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Connor R, Funk K, Kelly C, Kim S, Madej T, Marchler-Bauer A, Lanczycki C, Lathrop S, Lu Z, Thibaud-Nissen F, Murphy T, Phan L, Skripchenko Y, Tse T, Wang J, Williams R, Trawick BW, Pruitt KD, Sherry ST. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. doi: 10.1093/nar/gkab1112. PMID: 34850941; PMCID: PMC8728269.
UniProt (config) CC BY 4.0 UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100. PMID: 33237286; PMCID: PMC7778908.
Wikidata (config) CC0 Vrandecic, D., Krotzsch, M. Wikidata: a free collaborative knowledgebase. Communications of the ACM. 2014. doi: 10.1145/2629489.

You can access the executable libraries to create mapping files here.

Contributing

If you wish to develop the code further, install the source code requiring Java 8 (or 11) as JRE (depending on the version used in BridgeDb.

  1. Clone the code from this repository
  2. Add this project in Eclipse and build from maven using 'clean install', or run the build from your command line:

Build from Command Line

sudo apt update
sudo apt install gh 
gh repo clone sec2pri/mapping_preprocessing
sudo apt install openjdk-8-jre-headless #or: sudo apt install openjdk-11-jre-headless
sudo apt install maven #to build the code

This will create an executable java file called 'mapping_preprocessing-0.0.1.jar'

Create ID mapping files

Visit the location where the executable java file is located (in folder 'target').

#sudo apt-get install gzip #if not available
RELEASE_NUMBER="247"
wget "http://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel${RELEASE_NUMBER}/SDF/chebi_3_stars.sdf.gz"
gunzip chebi_3_stars.sdf.gz
inputFile="chebi_3_stars.sdf"
outputDir="$(pwd)"
java -cp ".:*" target/mapping_preprocessing-0.0.1.jar org.sec2pri.chebi_sdf $inputFile $outputDir
java -cp target/mapping_preprocessing-0.0.1.jar org.sec2pri.hmdb_xml $inputFile $outputDir
3) NCBI txt
java -cp target/mapping_preprocessing-0.0.1.jar org.sec2pri.ncbi_txt $inputFile $outputDir

InputFile: the input file directory and file name (ChEBI: SDF download and unzipping; HMDB: XML download, unzipping, and splitting the file into individual XMLs per entry; NCBI: download the data).

outputDir: the directory in which the output file(s) should be saved.

Releases

The mapping files are released and archived on Zenodo link tba

About

Processing the secondary-to-primary identifier mapping files for the omics FixID tool and data release

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors