This project concerns parsing the XML of the MEDLINE data into an RDBMS database in order to perform further and high-volume interrogation of this information. 0. License 1. Background 2. Resources 3. MySQL 4. Running 5. DTD Notes ========== 0. License Copyright 2010 Institute for Systems Biology Seattle, Washington, USA. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ========== 1. Background Teams at Stanford and Berkeley published a paper about doing just this 7 October 2004: "Tools for loading Medline into a local relational database." Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, (7Oct2004) This project updates the Java project applicable to MEDLINE 2004 (available from ) to conform to schema changes and import MEDLINE 2010 data. This project also updates the original to take advantage of Java 1.6 and ties to a MySQL RDBMS (whereas the original expected DB2). ========== 2. Resources * The original paper: "Tools for loading MEDLINE into a local relational database" * The original software in Java (for MEDLINE 2004) and Perl (for MEDLINE 2003): * Information about the MEDLINE data: ========== 3. MySQL The schema creation script at src/main/sql/medline-schema-mysql51.sql will create the tables and their indexes within a MySQL database. It does not create the database itself; you will have to do so manually. Although the schema creation was written against MySQL 5.1, it likely works fine against other versions of MySQL. a. Character Sets While the MEDLINE XML data uses UTF-8 encoding (and the generated SQL files likewise use this encoding), the MySQL database itself should use 'latin1' encoding with the default collation. Attempting to use other character sets (such as 'utf8') may cause issues with index creation. For more information, see "9.1.3. Specifying Character Sets and Collations." b. Storage Engines The schema creation script does not specify which storage engine to use for each table, so the default engine will be used. Unless you specified otherwise, the engine being used is likely the MyISAM engine. This is a good choice, since it provides very fast reads and is easy to back up. However, it doesn't support FOREIGN KEY constraints. These constraints are present in the schema creation script, but are ignored by the storage engine. This is probably just fine, but you should be aware of it nonetheless. For more information regarding storage engines, see "Chapter 13. Storage Engines." ========== 4. Running This project may be built using Maven2: a. Switch to the root directory and build the project with: mvn package This will build the archive medlineParser-2.0-SNAPSHOT.jar in the ./target directory. b. Switch to the target directory: cd target c. Run the application. Running the application always requires the MySQL Connector/J jar (available from ). The full path to this jar will be represented below by $MYSQL_CJ. This code has been tested with version 5.1.13 of the connector. If your database is named 'medline' and your MySQL server is running at 'localhost:3306' (the default) running the parser is simple: java -cp medlineParser-2.0-SNAPSHOT.jar:$MYSQL_CJ \ biotextEngine.xmlparsers.medline.MedlineParser \ /path/to/medline-data.xml To specify a different database, use the '-url' flag: java -cp medlineParser-2.0-SNAPSHOT.jar:$MYSQL_CJ \ biotextEngine.xmlparsers.medline.MedlineParser \ -url=jdbc:mysql://hostname:port/database \ /path/to/medline-data.xml The contents of the '-url' parameter are passed directly to Connector/J. See the URL syntax for Connector/J at: It's faster to generate SQL files from the XML then import them into the database. To do this, use the '-file' parameter: java -cp medlineParser-2.0-SNAPSHOT.jar:$MYSQL_CJ \ biotextEngine.xmlparsers.medline.MedlineParser \ -file=medline-inserts.sql \ /path/to/medline-data.xml If an output file name is not given, the name of the input file is used with '-insert.sql' appended. NOTE: even when generating SQL files, you will need to have a MySQL instance for the program to connect to. The MySQL JDBC driver is used to generate the SQL; this is a limitation of the program. ========== 5. DTD Notes a. DataBank Under MedlineCitation/Article is a list of DataBank information (see DataBankList[1]). The DTD specifies that a DataBank must have a name (DataBankName) but a list of AccessionNumber is optional. However, the java code as written will only import data when at least one AccessionNumber exists. Therefore, if an Article were to have a DataBank with a name and no number, that would not be reflected in the database. [1] b. NameID and Pagination The 2010 DTD defines a few elements which are not currently used and reserved for future use. These elements are not represented in the code and would be ignored if encountered. These are: NameID in Author and Investigator, and StartPage and EndPage in Pagination. c. Gene Symbols Gene Symbols are associated with each PMID and represented in the table medline_gene_symbol_list. There are cases where the gene symbols only differ by case (e.g., aprt and APRT). The database is case agnostic and will see them as the same (if one were, for example, to search). This is something to keep in mind when designing searches concerns these symbols. ========== $Id$