This project concerns parsing the XML of the MEDLINE data into an RDBMS
database in order to perform further and high-volume interrogation of this
information.

0. License
1. Background
2. Resources
3. MySQL
4. Running
5. DTD Notes

==========
0. License

  Copyright 2010 Institute for Systems Biology
                 Seattle, Washington, USA.

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

==========
1. Background

Teams at Stanford and Berkeley published a paper about doing just this 7
October 2004:

  "Tools for loading Medline into a local relational database." Diane E.
  Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst,
  BMC Bioinformatics 2004, (7Oct2004)
  <http://www.biomedcentral.com/1471-2105/5/146/abstract>

This project updates the Java project applicable to MEDLINE 2004 (available
from <http://biotext.berkeley.edu/software.html>) to conform to schema changes
and import MEDLINE 2010 data.

This project also updates the original to take advantage of Java 1.6 and ties
to a MySQL RDBMS (whereas the original expected DB2).

==========
2. Resources

* The original paper: "Tools for loading MEDLINE into a local relational
  database" <http://www.biomedcentral.com/1471-2105/5/146>

* The original software in Java (for MEDLINE 2004) and Perl (for MEDLINE
  2003): <http://biotext.berkeley.edu/software.html>

* Information about the MEDLINE data:
  <http://www.nlm.nih.gov/bsd/licensee/medpmmenu.html>

==========
3. MySQL

The schema creation script at src/main/sql/medline-schema-mysql51.sql will
create the tables and their indexes within a MySQL database.  It does not
create the database itself; you will have to do so manually.

Although the schema creation was written against MySQL 5.1, it likely works
fine against other versions of MySQL.

a. Character Sets

While the MEDLINE XML data uses UTF-8 encoding (and the generated SQL files
likewise use this encoding), the MySQL database itself should use 'latin1'
encoding with the default collation.  Attempting to use other character sets
(such as 'utf8') may cause issues with index creation.

For more information, see "9.1.3. Specifying Character Sets and Collations."
<http://dev.mysql.com/doc/refman/5.1/en/charset-syntax.html>

b. Storage Engines

The schema creation script does not specify which storage engine to use for
each table, so the default engine will be used.  Unless you specified
otherwise, the engine being used is likely the MyISAM engine.  This is a good
choice, since it provides very fast reads and is easy to back up.  However, it
doesn't support FOREIGN KEY constraints.  These constraints are present in the
schema creation script, but are ignored by the storage engine.  This is
probably just fine, but you should be aware of it nonetheless.

For more information regarding storage engines, see "Chapter 13. Storage
Engines." <http://dev.mysql.com/doc/refman/5.1/en/storage-engines.html>

==========
4. Running

This project may be built using Maven2: <http://maven.apache.org/>

a. Switch to the root directory and build the project with: mvn package
   This will build the archive medlineParser-2.0-SNAPSHOT.jar in the ./target
   directory.

b. Switch to the target directory: cd target

c. Run the application.

Running the application always requires the MySQL Connector/J jar (available
from <http://dev.mysql.com/downloads/connector/j/>).  The full path to this
jar will be represented below by $MYSQL_CJ.  This code has been tested with
version 5.1.13 of the connector.

If your database is named 'medline' and your MySQL server is running at
'localhost:3306' (the default) running the parser is simple:

    java -cp medlineParser-2.0-SNAPSHOT.jar:$MYSQL_CJ \
      biotextEngine.xmlparsers.medline.MedlineParser \
      /path/to/medline-data.xml
  
To specify a different database, use the '-url' flag:

    java -cp medlineParser-2.0-SNAPSHOT.jar:$MYSQL_CJ \
      biotextEngine.xmlparsers.medline.MedlineParser \
      -url=jdbc:mysql://hostname:port/database \
      /path/to/medline-data.xml

The contents of the '-url' parameter are passed directly to Connector/J.  See
the URL syntax for Connector/J at:
<http://dev.mysql.com/doc/refman/5.1/en/connector-j-reference-configuration-properties.html>

It's faster to generate SQL files from the XML then import them into the
database.  To do this, use the '-file' parameter:

    java -cp medlineParser-2.0-SNAPSHOT.jar:$MYSQL_CJ \
      biotextEngine.xmlparsers.medline.MedlineParser \
      -file=medline-inserts.sql \
      /path/to/medline-data.xml

If an output file name is not given, the name of the input file is used with
'-insert.sql' appended.

NOTE: even when generating SQL files, you will need to have a MySQL instance
for the program to connect to.  The MySQL JDBC driver is used to generate the
SQL; this is a limitation of the program.

==========
5. DTD Notes

a. DataBank

Under MedlineCitation/Article is a list of DataBank information (see
DataBankList[1]).  The DTD specifies that a DataBank must have a name
(DataBankName) but a list of AccessionNumber is optional. However, the java
code as written will only import data when at least one AccessionNumber
exists. Therefore, if an Article were to have a DataBank with a name and no
number, that would not be reflected in the database. 

[1] <http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html#databanklist>

b. NameID and Pagination

The 2010 DTD defines a few elements which are not currently used and reserved
for future use.  These elements are not represented in the code and would be
ignored if encountered.  These are: NameID in Author and Investigator, and
StartPage and EndPage in Pagination.

c. Gene Symbols

Gene Symbols are associated with each PMID and represented in the table
medline_gene_symbol_list.  There are cases where the gene symbols only differ
by case (e.g., aprt and APRT).  The database is case agnostic and will see
them as the same (if one were, for example, to search).  This is something to
keep in mind when designing searches concerns these symbols.

==========
  $Id$