Decima
Data type for effective screening and indexing of time-series data
Overview
We designed PostgreSQL data type for storing time-series and longitudinal data in relational database. We also design GiST index type to facilitate similarity search among time series.
This project aims to contribute time-series and longitudinal data management algorithms to PostgreSQl RDBMS, with features like:
- time-series and longitudinal data storage in relational database
- relational operators for search and comparison of time-series and longitudinal data
- conversion of time-series and longitudinal data to string representation based upon SAX
- indexing algorithm for facilitation similarity/dissimilarity search
We designed this data type to be used with the time series data representing gene expression profiles from a plant biology experiment, but it will work just as well in all situations where a set of numbers ordered on a line is viewed as a single entity, and where information is thought to be encoded in the shape of the series, rather than in each individual point. Such can be the distributions of physical properties along the DNA double helix, or along a polypeptide chain. In all these applications, the series is a unit of information, and it is far more efficient to store and process it as a single unit, than it is to spread it across multiple basic units requiring complicated external logic.
An important goal of this project is to provide support for .
This project targets release 8.3 and 8.4 of the PostgreSQL database. Source code downloads are available from the postgres.org site
Status
The initial source has been imported into SVN on July 2nd, 2009. There are plenty of bugs to fix and features to implement more properly. But most of the features outlined above are available and are going to be improved in the next couple of weeks/months. Source code releases will be made available regularly.
At the moment development of the project is supported by RBRF grant 11-04-01436-a, "The role of electrostatic properties of DNA in bacterial genome functioning"
Install datatype
You'll need source code installation of PostgreSQL RDBMS. Download PostgreSQL 8.3 or 8.4 to the folder PGSQLSRC. Install Postgres from source code according to instructions (see also); download Decima source code and execute following commands:
% cd PGSQLSRC/contrib % tar xjvf decima.tar.bz2 % cd ts % make % make install % make installcheck
Now you need to add type and function definitions to your database:
% psql DBNAME < PGSQLSRC/contrib/ts/st.sql
Setup test database
To demonstrate abilities of data types we have created test database. Postgres dump of the database is available here. Download it (488MB in archive and more then 2GB uncompressed) and install into your server:
% tar -xjf test.tar.bz2 % createdb testDecima % psql -d testDecima -f test.dump >inst.log 2>&1 &
You should run the last command in background as Index creation for such big database (1,111,011 time series and 4,444,044 SAX strings) could take up to 5h.