Solving the Big Science Checklist

What’s ailing big science these days? Stonebreaker provides a glance in his recent posting on [ “databases for big science”]. (Thanks to [ Neil] for the pointer) It’s no surprise to me he says that what is needed is big science to help define just what the big science projects need, so that idea could be turned into a general solution and commercialized; after all, that’s what Stonbreaker does. But I think it might be more prudent for Universities to turn toward open source, and specifically PostgreSQL. Enter Floris Sluiter. Floris Sluiter works in the Netherlands on the [ EcoGRID] project. His thinking is that he is moving into the big science realm. I’m not sure he is at the level Stonebreaker is looking (Stonbreaker mentioned petabyte sized problem sets, Sluiter mentions 300 million rows of data; without knowing how large the data is it’s hard to be sure), but in any case, Sluiter does list the current problems he has with his PostgreSQL 8.1 based solution, and to me I think several look like solvable problems:
* GIS rasters (or multidimensional matrices with neighbourhood indices and functions). ISO 19123 * Timeinterval datatype for timestamps, including operators for before, after, overlap. * Space/time datatypes for moving species (GPS tracks) or Timeseries for GPS tracks that can grow while measuring, that we can couple with GIS-linestrings.
It’s not clear to me what is missing from PostgreSQL’s current interval type (just the operators?) but all of these should be able to be implemented into PostgreSQL, either as standard custom datatypes or in conjunction with PostGIS. ISTM this screams for a couple of undergrads to be tasked with working on this, or perhaps this this could be fodder for a nice Summer Of Code project.
* Functionality to synchronize login/password between postgresql clusters.
This might be better off being done slightly outside the database, I’m thinking using ldap to centrally manage authentication, or perhaps with 8.3’s new gssapi stuff doing single sign on that way.
* Remote tables: efficient querying of large tables stored in remote clusters: (SQL/MED) ISO/IEC 9075-9:2003
This is one of those very hard problems. You can certainly do inefficient querying right now with dblink, but most of the systems out there for querying remote databases have similar issues with query planning across systems.
* Parallel query processing: when we could access in parallel several remote tables on different clusters/servers it could provide a significant speed up. I am thinking of remote tables that are childs inherited from one parent table. The childs would be stored on different servers.
Somehwere between PgCluster II, Pl/Proxy, and Bizgres MPP there is a solution for PostgreSQL to help solve this problem.
* Multi-Master replication with automated recovery after restoring backups. Replication of only parts (schema/table) of a database is a requisite.
Bucardo gets you close, but recovery from back-up seems tricky. Pobably not unobtainable though… if you initially setup the new master to always lose wrt conflict resolution, it seems fairly plausible… just wonder how Bucardo is about changing conflict resolution on-the-fly. Not mentioned in his list, but I think lingering near the surface, is also the need to solve the in-place upgrades problem (a pain I am all to familiar with given TB level databases already are there). Again there’s obviously work here that needs to be done, and some of these things sound like very hard problems, but if you could get several universities to commit to a small part, I think everyone could get there.