COPPE/UFRJ, Rio de Janeiro
Tuesday, March 17 2015, 11am, IBC (LIRMM Bat.5) room 01/124
Scientific applications generate raw data files in very large scale. Most of these files follow a standard format established by the domain area application, like HDF5, NetCDF and FITS. These formats are supported by a variety of programming languages, libraries and programs. Since they are in large scale, analyzing these files require writing a specific program. Scientific data analysis systems like database management systems (DBMS) are not suited because of time consuming data loading, data transformation at large scale and legacy code incompatible with a DBMS. Recently there have been several proposals for indexing and querying raw data files without the overhead of using a DBMS. Systems like noDB, SDS and RAW offer query support to the raw data file after a scientific program has generated it. However, these solutions are focused on the analysis of one single large file. When a large number of files are all related and required to the evaluation of one scientific hypothesis, the relationships must be managed manually or by writing specific programs. In this talk we will discuss current approaches for raw data analysis and present our approach that combines DBMS and raw data analysis. It takes advantage of provenance database system support from scientific workflow management systems (SWfMS). When scientific applications are managed by SWfMS, the data that is being generated is registered along the provenance database. Therefore this provenance data may act as a description of theses files. When the SWfMS is dataflow aware and also registers selected data and pointers to domain data all in the same database. This resulting database becomes an important access path to the large number of files that are generated by the scientific workflow execution. This becomes a complementary approach to the single raw data file analysis support.
*Joint work with Vitor Silva, Daniel Oliveira and Patrick Valduriez