This repository contains the source code for data processing to create identifier (IDs) mapping files for secondary IDs (outdated/deprecated/split/megred). The following databases have been included in this project:
| Datasource | license | citation |
|---|---|---|
| ChEBI (config) | CC BY 4.0 | Hastings J, Owen G, Dekker A, et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2016 Jan;44(D1):D1214-9. DOI: 10.1093/nar/gkv1031. PMID: 26467479; PMCID: PMC4702775. |
| HMDB (config) | CC0 | Wishart DS, Guo A, Oler E, Wang F, Anjum A, Peters H, Dizon R, Sayeeda Z, Tian S, Lee BL, Berjanskii M, Mah R, Yamamoto M, Jovel J, Torres-Calzada C, Hiebert-Giesbrecht M, Lui VW, Varshavi D, Varshavi D, Allen D, Arndt D, Khetarpal N, Sivakumaran A, Harford K, Sanford S, Yee K, Cao X, Budinski Z, Liigand J, Zhang L, Zheng J, Mandal R, Karu N, Dambrova M, Schiöth HB, Greiner R, Gautam V. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022 Jan 7;50(D1):D622-D631. doi: 10.1093/nar/gkab1062. PMID: 34986597; PMCID: PMC8728138. |
| HGNC (config) | link | Seal RL, Braschi B, Gray K, Jones TEM, Tweedie S, Haim-Vilmovsky L, Bruford EA. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D1003-D1009. doi: 10.1093/nar/gkac888. PMID: 36243972; PMCID: PMC9825485. |
| NCBI (config) | link | Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Connor R, Funk K, Kelly C, Kim S, Madej T, Marchler-Bauer A, Lanczycki C, Lathrop S, Lu Z, Thibaud-Nissen F, Murphy T, Phan L, Skripchenko Y, Tse T, Wang J, Williams R, Trawick BW, Pruitt KD, Sherry ST. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. doi: 10.1093/nar/gkab1112. PMID: 34850941; PMCID: PMC8728269. |
| UniProt (config) | CC BY 4.0 | UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100. PMID: 33237286; PMCID: PMC7778908. |
| Wikidata (config) | CC0 | Vrandecic, D., Krotzsch, M. Wikidata: a free collaborative knowledgebase. Communications of the ACM. 2014. doi: 10.1145/2629489. |
You can access the executable libraries to create mapping files here.
If you wish to develop the code further, install the source code requiring Java 8 (or 11) as JRE (depending on the version used in BridgeDb.
- Clone the code from this repository
- Add this project in Eclipse and build from maven using 'clean install', or run the build from your command line:
sudo apt update
sudo apt install gh
gh repo clone sec2pri/mapping_preprocessing
sudo apt install openjdk-8-jre-headless #or: sudo apt install openjdk-11-jre-headless
sudo apt install maven #to build the codeThis will create an executable java file called 'mapping_preprocessing-0.0.1.jar'
Visit the location where the executable java file is located (in folder 'target').
#sudo apt-get install gzip #if not available
RELEASE_NUMBER="247"
wget "http://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel${RELEASE_NUMBER}/SDF/chebi_3_stars.sdf.gz"
gunzip chebi_3_stars.sdf.gz
inputFile="chebi_3_stars.sdf"
outputDir="$(pwd)"
java -cp ".:*" target/mapping_preprocessing-0.0.1.jar org.sec2pri.chebi_sdf $inputFile $outputDir
java -cp target/mapping_preprocessing-0.0.1.jar org.sec2pri.hmdb_xml $inputFile $outputDir
3) NCBI txt
java -cp target/mapping_preprocessing-0.0.1.jar org.sec2pri.ncbi_txt $inputFile $outputDirInputFile: the input file directory and file name (ChEBI: SDF download and unzipping; HMDB: XML download, unzipping, and splitting the file into individual XMLs per entry; NCBI: download the data).
outputDir: the directory in which the output file(s) should be saved.
The mapping files are released and archived on Zenodo link tba