TY - GEN
T1 - Migrating a (large) science database to the cloud
AU - Thakar, Ani
AU - Szalay, Alex
PY - 2010
Y1 - 2010
N2 - We report on attempts to put an existing scientific (astronomical) database - the Sloan Digital Sky Survey (SDSS) science archive [1] - in the cloud. Based on our experience, it is either very frustrating or impossible at this time to migrate an existing, complex SQL Server database into current cloud service offerings such as Amazon (EC2) and Microsoft (SQL Azure). Certainly it is impossible to migrate a large database in excess of a TB, but even with (much) smaller databases, the limitations of cloud services make it very difficult to migrate the data to the cloud without making changes to the schema and settings (for example, inability to migrate a spatial indexing library, and several other user-defined functions and stored procedures) that would invalidate performance comparisons between cloud and on-premise versions. So it is not surprising that our preliminary performance comparisons show a very large (an order of magnitude) performance discrepancy with the Amazon cloud version of the SDSS database. We have also not yet investigated the performance tweaks that could be possible within the cloud. Although we managed to successfully migrate (a subset of) the SDSS catalog database to Amazon EC2, we were not able to access the database in a meaningful way from the outside world. Even though this was advertised as a public dataset on the AWS blog, it was not clear how other users or the public would be able to access this data in a meaningful way, if at all. These difficulties suggest that much work and coordination needs to occur between cloud service providers and their potential database clients before science databases can successfully and effectively be deployed in the cloud. This is true not just for large scientific databases but all databases that make extensive use of advanced database management system (DBMS) features for performance and user convenience.
AB - We report on attempts to put an existing scientific (astronomical) database - the Sloan Digital Sky Survey (SDSS) science archive [1] - in the cloud. Based on our experience, it is either very frustrating or impossible at this time to migrate an existing, complex SQL Server database into current cloud service offerings such as Amazon (EC2) and Microsoft (SQL Azure). Certainly it is impossible to migrate a large database in excess of a TB, but even with (much) smaller databases, the limitations of cloud services make it very difficult to migrate the data to the cloud without making changes to the schema and settings (for example, inability to migrate a spatial indexing library, and several other user-defined functions and stored procedures) that would invalidate performance comparisons between cloud and on-premise versions. So it is not surprising that our preliminary performance comparisons show a very large (an order of magnitude) performance discrepancy with the Amazon cloud version of the SDSS database. We have also not yet investigated the performance tweaks that could be possible within the cloud. Although we managed to successfully migrate (a subset of) the SDSS catalog database to Amazon EC2, we were not able to access the database in a meaningful way from the outside world. Even though this was advertised as a public dataset on the AWS blog, it was not clear how other users or the public would be able to access this data in a meaningful way, if at all. These difficulties suggest that much work and coordination needs to occur between cloud service providers and their potential database clients before science databases can successfully and effectively be deployed in the cloud. This is true not just for large scientific databases but all databases that make extensive use of advanced database management system (DBMS) features for performance and user convenience.
KW - Cloud
KW - Data in the cloud. cloud computing
KW - Databases
KW - Scientific databases
UR - http://www.scopus.com/inward/record.url?scp=78650033918&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650033918&partnerID=8YFLogxK
U2 - 10.1145/1851476.1851539
DO - 10.1145/1851476.1851539
M3 - Conference contribution
AN - SCOPUS:78650033918
SN - 9781605589428
T3 - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
SP - 430
EP - 434
BT - HPDC 2010 - Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
T2 - 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010
Y2 - 21 June 2010 through 25 June 2010
ER -