Contenuto
Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricercaINIZIO_TESTO_DA_INDICIZZARE
RESEARCH PROGRAM
italiano - inglese
Research Units
- Università degli Studi di ROMA "La Sapienza"
INFORMATICA E SISTEMISTICA
- Università degli Studi di PADOVA
INGEGNERIA DELL'INFORMAZIONE
- Università degli Studi ROMA TRE
INFORMATICA E AUTOMAZIONE
- Università degli Studi di ROMA "Tor Vergata"
INFORMATICA, SISTEMI E PRODUZIONE
- Università degli Studi di PISA
INFORMATICA
Similar research programs:
- 1 - Algorithms for the Next Generation Internet and Web: Methodologies, Design, and Experiments
- 2 - Web Ram: Web Retrieval and Mining
- 3 - Similarity-based Methods for Computer Vision and Pattern Recognition: Theory, Algorithms, Applications
- 4 - Models and algorithms for robust network optimization
- 5 - Design methodologies for platform-based multi-processor systems-on-chip
- 6 - Learning Hierarchical, Abstract Models from Temporal or Spatial Data
- 7 - The geomatics in support of the actions of Government of the territory
- 8 - Peer to peeR beyOnd FILE Sharing (PROFILES)
- 9 - Cryptographic databases
- 10 - Data mining and optimization models for biolife and healthcare problems
Scientific and education field classification
International Patent Classification
- PHYSICS
- COMPUTING; CALCULATING; COUNTING (score computers for games A63; combinations of writing applicances with computing devices B43K29/08)
- ELECTRICAL DIGITAL DATA PROCESSING (computers in which a part of the computation is effected hydraulically or pneumatically G06D; optically G06E; self-contained input or output peripheral equipment G06K; impedance networks using digital techniques H03H) [C9603]
- INFORMATION STORAGE
- STATIC STORES (information storage based on relative movement between record carrier and transducer G11B; semiconductor devices for storage H01L, e.g. H01L27/108 to H01L27/115; pulse technique in general H03K, e.g. electronic switches H03K17/00 [N: using a static store as a picture recording medium H04N5/907; Calculators 42P; see provisionally 42M37G])
- COMPUTING; CALCULATING; COUNTING (score computers for games A63; combinations of writing applicances with computing devices B43K29/08)
Geographical classification
- Region: Lazio
Bibliografia
[ACS87] A. Aggarwal, A. Chandra, M. Snir. Hierarchical Memory with Block transfer. Proc. FOCS, 1987[ADRR04] G. Aggarwal, M. Datar, S. Rajagopalan, M. Ruhl. On the streaming model augmented with a sorting primitive. Proc. FOCS, 540-549, 2004
[AV88] A. Aggarwal, J.S. Vitter. The input/output complexity of sorting and related problems. Comm. ACM, 31(9):1116-1127, 1988
[ADM06] D.Ajwani, R. Dementiev, U. Meyer. A computational study of external-memory BFS algorithms. Proc. SODA, 601-610, 2006
[ADE01] P.R. Amestoy, I.S. Duff, J.Y. L'Excellent, X.S. Li. Analysis and comparison of two general sparse solvers for distributed memory computers, ACM Trans. Math. Softw., 27(4):388-421, 2001
[ABF05] L. Arge, G.S. Brodal, R. Fagerberg. Cache-oblivious data structures. Handb. Data Str. and Appl., 34, 2005
[AFI05] G. Ausiello, P.G. Franciosa and G.F. Italiano. Small Stretch Spanners on Dynamic Graphs. Proc. ESA, 2005
[B05] D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. 1, 3rd Ed., 2005
[B+03] M. Bianco, G. Bilardi, F. Pesavento, G. Pucci, B.A. Schrefler. A frontal solver tuned for fully-coupled non-linear hygro-thermo-mechanical problems. Int. J. Numer. Meth. in Eng., 57(13):1801-1818, 2003
[BEP02] G. Bilardi, K. Ekanadham, P. Pattnaik, Optimal organizations for pipelined hierarchical memories, Proc. SPAA, 109-116, 2002
[BP01] G. Bilardi, E. Peserico, A Characterization of Temporal Locality and Its Portability across Memory Hierarchies, Proc. ICALP, 128-139, 2001
[BF03] G.S. Brodal, R. Fagerberg. On the limits of cache-obliviousness. Proc. STOC, 307-315, 2003
[BW94] M. Burrows, D.J. Wheeler, A block-sorting lossless data compression algorithm, Res. Report 124, DSRC, 1994
[CMV05] B. Catania, A. Maddalena, A. Vakali. XML document indexes: a classification. IEEE Internet Comp., 2005
[CLGKP94] P. M. Chen, E. L. Lee, G. A. Gibson, R. H. Katz, D. A. Patterson. RAID: High-performance, reliable secondary storage. ACM Comp. Surv., 26(2):145-185, 1994
[CDPP05] P.F. Cortese, G. Di Battista, M. Patrignani, M. Pizzonia. Clustering Cycles into Cycles of Clusters, JGAA, 9:391-413, 2005
[DDFLPP99] C. Demetrescu, G. Di Battista, I. Finocchi, G. Liotta, M. Patrignani, M. Pizzonia: Infinite Trees and the Future. Proc. GD, 379-391, 1999
[DFIN02] C. Demetrescu, I. Finocchi, G.F. Italiano, and S. Naeher. Visualization in Algorithm Engineering: Tools and Techniques. In "Experim. Algorithmics, From Algorithm Design to Robust and Efficient Softw.", LNCS 2547, 2002
[DFR06] C. Demetrescu, I. Finocchi, and A. Ribichini. Trading off space for passes in graph streaming problems. Proc. SODA, 714-723, 2006
[DI04] C. Demetrescu and G. F. Italiano. A new approach to dynamic all pairs shortest paths. JACM, 51(6), 968-992, 2004
[DEG99] J.W. Demmel, S.C. Eisenstat, J.R. Gilbert, X.S. Li, J.W.H. Liu. A supernodal approach to sparse partial pivoting. SIAM J. Matrix Anal. and App., 20(3):720-755, 1999
[DDGL05a] E. Di Giacomo, W. Didimo, L. Grilli, G. Liotta: WhatsOnWeb: Using Graph Drawing to Search the Web. Proc. GD, 480-491, 2005
[DDLM04] E. Di Giacomo, W. Didimo, G. Liotta, H. Meijer: Computing Radial Drawings on the Minimum Number of Circles. Proc. GD, 251-261, 2004
[D59] E.W. Dijkstra. A note on two problems in connexion with graphs. Num. Mathematik, 1959
[FPP05] C. Fantozzi, A. Pietracaprina, G. Pucci. Translating Submachine Locality into Locality of Reference. JPDC S.I. IPDPS'04 Best Papers, 2005
[FPRU94] U. Feige, P. Raghavan, D. Peleg, E. Upfal. Computing with noisy information. SIAM J. Comp., 23:1001-1018, 1994
[FKMSZ04] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang. On graph problems in a semi-streaming model. Proc. ICALP, 531-543, 2004
[FCE95] Q.W. Feng, R.F. Cohen, P. Eades. Planarity for Clustered Graphs. Proc. ESA, 213-226, 1995
[FLMM05] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. FOCS, 2005
[FM05] P. Ferragina, G. Manzini. Indexing Compressed Text. JACM, 52:552-581, 2005
[FI04] I. Finocchi, G. F. Italiano. Sorting and searching in the presence of memory faults (without redundancy). Proc. STOC, 101-110, 2004
[FGMP02] G. Franceschini, R. Grossi, J. I. Munro, L. Pagli, Implicit B-trees: New results for the dictionary problem. Proc. FOCS and JCSS S.I., 145-154, 2003
[FJ98] M. Frigo, S. Johnson, FFTW: An adaptive software architecture for the FFT, Proc. ICASSP, 3:1381-1384, 1998
[FLPR99] M. Frigo, C.E. Leiserson, H. Prokop, S. Ramachandran. Cache-Oblivious Algorithms. Proc. FOCS, 285-297, 1999
[GLS05] M.T. Goodrich, G.S. Lueker, J.Z. Sun. C-planarity of extrovert clustered graphs. Proc. GD, 211-222, 2005
[GV05] R. Grossi, J.S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM J. Comp., 35(2):378-407, 2005.
[HEW98] M. L. Huang, P. Eades, J. Wang: On-line Animated Visualization of Huge Graphs using a Modified Spring Algorithm. J. Vis. Lang. and Comp., 9(6): 623-645, 1998.
[ISO89] International Standards Organization: "Intra-Domain IS-IS Routing Protocol", ISO-IEC JTC1/SC6 WG2N323, 1989
[K05] M. Kaki. Findex: search result categories help users when document ranking fails. Proc. CHI, 131-140, 2005
[K99] V. King. Fully dynamic algorithms for maintaining all-pairs shortest paths and transitive closure in digraphs. Proc. FOCS, 1999
[LRP95] J. Lamping, R. Rao, P. Pirolli. A focus-context technique based on hyperbolic geometry for visualization large hierarchies. Proc. InterCHI, 401-408, 1995.
[MS00] L. Marsan and M.-F. Sagot, Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. J. Comp. Bio., 7:345-360, 2000.
[MZ03] U. Meyer, N. Zeh. I/O-efficient undirected shortest paths. Proc. ESA, 434-445, 2003
[M+02] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon. Network motifs: simple building blocks of complex networks. Science 298(5594): 824-827, 2002
[M99] J. Moy, "OSPF: Anatomy of an Internet Routing Protocol", Addison-Wesley, 1999
[M05] S. Muthukrishnan. Data streams: algorithms and applications. http://athos.rutgers.edu/~muthu.
[NPW04] E. Nardelli, G. Proietti, and P. Widmayer. Nearly linear time minimum spanning tree maintenance for transient node failures. Algoritmica, 40:119-132, 2004
[PS05] N. Pisanti, M.F. Sagot: Inference of Network Expressions. In Applied Combinatorics on Words, 5:241-267, 2005
[PSWZ04] G. Prencipe, N. Santoro, P. Widmayer, T. Zuva. Computing all the Best Swap Edges Distributively. Proc. OPODIS, 2004, 154-168
[PZ03] A. Pietracaprina, D. Zandolin., Mining Frequent Itemsets using Patricia Tries, Proc. FIMI, 2003
[R+05] M. T. Ross et al., The DNA sequence of the human X chromosome Nature 434:325-337, 2005
[RMC91] G. G. Robertson, J.D. Mackinlay, S. K. Card, Cone trees: animated 3D visualizations of hierarchical information. Proc. CHI, 189-193, 1991
[R03] J. Ruhl. Efficient Algorithms for New Computational Models. PhD thesis, MIT, 2003
[SB94] M. Sarkar, M.H. Brown, Graphical fisheye views. Comm. of ACM, 37:73-84, 1994
[SGG05] A. Silberschatz, P.B. Galvin, G. Gagna. Operating Systems Concepts, 7th Edition, Wiley, 2005
[Vi01] J.S. Vitter. External Memory Algorithms and Data Structures: Dealing with MASSIVE DATA, ACM Comp. Surv., 33(2):209-271, 2001
[WSS99] J.T.L.Wang and B.A. Shapiro and D. Shasha, Pattern Discovery in Biomolecular Data, Oxford University Press, 1999
[We05] S. Wernike: A faster algorithm for detecting network motifs. Proc. WABI, 1-12, 2005
[WPD01] R. Whaley, A. Petitet, J. Dongarra, Automated Empirical Optimization of Software and the ATLAS Project. Par. Comp., 27(1-2):3-35, 2001
[ZE99] O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to web search results. Comp. Networks, 31(11-16):1361-1374, 1999
[ZL77] J. Ziv, A. Lempel. A universal algorithm for sequential data compression, IEEE Tr. Inf. Theory, 23(3):337-343, 1977
Keywords
ALGORITHMS, DATA STRUCTURES, COMPLEX MEMORY SYSTEMS, EXTERNAL MEMORY, DATA COMPRESSION, GRAPHS, DATA VISUALIZATIONMAINSTREAM: Algorithms for massive information structures and data streams
Università degli Studi di Roma "La Sapienza"Abstract
This research project brings together five leading groups in algorithmic research in Italy in a joint effort that aims at:(1) Discovering new algorithmic techniques and methodologies for processing very large information structures;
(2) Identifying and solving key algorithmic problems in important applications that deal with massive data sets; and
(3) Contributing to the accelerated transfer of advanced algorithmic technologies through experiments and the engineering of efficient algorithmic code.
As a natural consequence of our methodological and application-oriented approach, the research plan is organized into two mutually related Work Parts (WP):
WP1 - COMPUTATIONAL MODELS AND ALGORITHMIC METHODOLOGIES
In this Work Part we will address a variety of computation models used for managing massive information structures, such as external memory models, hierarchical memory models, pipelined memories, faulty memories and data streaming. In particular we will study how algorithm design techniques and technological aspects of complex memory systems interact and adapt to each other. We also plan to study general algorithmic techniques such as compression and space-efficient data representation, general methods for storing and visualizing very large graphs, and pattern discovery techniques. Another goal will be to develop tools and techniques for algorithmic engineering on massive data sets. Our activity in this context will be organized in three Tasks, corresponding to the research topics described above:
Task 1.1 - Complex memory systems
Task 1.2 - Data streams
Task 1.3 - Algorithmic techniques and tools for massive data sets
WP2 - ALGORITHMS FOR SPECIFIC APPLICATION DOMAINS
In many real-world application domains, the capacity to process efficiently massive data is playing a crucial role in order to save costs and to improve efficiency and reliability. In this Work Part we will concentrate on the design of algorithms for specific types of massive data occurring in real world application domains, such as (i) large collections of documents and sequences, (ii) large graphs and networks and (iii) simulation of multi-physics problems. Our work in this context will be organized in three Tasks, corresponding to the three application domains addressed above:
Task 2.1 - Very large collections of documents and sequences
Task 2.2 - Very large graphs and networks
Task 2.3 - Very large computational problems
All research activities presented in the description of the project will be divided in two phases, corresponding to the two years of duration of the project. <<<
Principal Investigator
Giorgio Ausiello Università degli Studi di ROMA "La Sapienza"Research Objectives
During the last ten years the developments in computer applications have put more and more emphasis on the management of massive data sets, such as very long strings, huge collections of texts and documents, very large graphs and matrices, etc. The need for handling such data sets arises, for example, in bioinformatics, in meteorology, in the solution of challenging computational problems, in the exploration and visualization of very large computer networks (such as Internet) or social networks. For various reasons, designing efficient algorithms for dealing with huge information structures represents a major challenge. In most cases data are stored in external memory and we have to devise new algorithmic techniques for handling them minimizing disk accesses. Besides, in such circumstances, new techniques for compression and visualization are required. Note that in some cases data are so large that they cannot even fit in external memory but have to be analyzed 'on the fly'. Finally, when massive data are considered, new algorithms for dealing with faulty data are needed because error probability is not anymore negligible.This research project brings together five leading groups in algorithmic research in Italy in a joint effort that aims at:
(1) Discovering new algorithmic techniques and methodologies for processing very large information structures;
(2) Identifying and solving key algorithmic problems in important applications that deal with massive data sets; and
(3) Contributing to the accelerated transfer of advanced algorithmic technologies through experiments and the engineering of efficient algorithmic code.
The main emphasis of the project will be on a novel combination of application-oriented research in important domains such as networking, large documents management and simulation of large-scale physical phenomena with innovative methodological work on algorithm engineering, information visualization and general algorithmic techniques. As a natural consequence of our methodological and application-oriented approach, the research plan is organized into two mutually related Work Parts (WP):
WP1 - COMPUTATIONAL MODELS AND ALGORITHMIC METHODOLOGIES
We will address a variety of computation models used for managing massive information structures, such as external memory models, hierarchical memory models, pipelined memories, faulty memories and data streaming. In particular we will study how algorithm design techniques and technological aspects of complex memory systems interact and adapt each other. We also plan to study general algorithmic techniques such as compression and space-efficient data representation, general methods for storing and visualizing very large graphs, and pattern discovery techniques. Another goal will be to develop tools and techniques for algorithmic engineering on massive data sets. Our activity in this context will be organized in three Tasks, corresponding to the research topics described above:
Task 1.1 - Complex memory systems
Task 1.2 - Data streams
Task 1.3 - Algorithmic techniques and tools for massive data sets
WP2 - ALGORITHMS FOR SPECIFIC APPLICATION DOMAINS
In many real-world application domains, the capacity to process efficiently massive data is playing a crucial role in order to save costs and to improve efficiency and reliability. In this Work Part we will concentrate on the design of algorithms for specific types of massive data occurring in real world application domains, such as (i) large collections of documents and sequences, (ii) large graphs and networks and (iii) simulation of multi-physics problems. Our work in this context will be organized in three Tasks, corresponding to the three application domains addressed above:
Task 2.1 - Very large collections of documents and sequences
Task 2.2 - Very large graphs and networks
Task 2.3 - Very large computational problems
All research activities presented in the description of the project will be divided in two phases, corresponding to the two years of duration of the project.
The consortium is composed of about 53 researchers, for a total of 605 person-months in a 2-year period. Of these, 528 months are contributed by 43 researchers (faculty, postdocs, Ph.D. students and other university research personnel) whose salary is not charged against this project, and 77 months by research personnel supported by the project. The total cost of the project is EUR 282,600, out of which EUR 197,900 are requested to MIUR and EUR 84,700 are charged to the research units. <<<
Timescale
24 monthsNational and international background
In the following we will illustrate, for each task, the scientific background on which the project is based.WP1 - COMPUTATIONAL MODELS AND ALGORITHMIC METHODOLOGIES
Task 1.1 - Complex memory systems
Contemporary computing platforms exhibit rather complex memory systems typically organized as a hierarchy of increasingly larger and slower memory levels. Adapting the structure of the computation to the structure of the memory system is crucial to performance, especially when massive data sets or streams are involved. Moreover, faults are likely to occur in large and complex memories, and suitable strategies must be adopted to guarantee the correctness of the computation. We summarize below recent results, relevant to the project goals, regarding adaptivity (at both algorithmic and virtual memory management levels) and fault-tolerant algorithms.
Algorithmic adaptivity
Two well known approaches for the design of adaptive algorithms in memory hierarchies are based on the External Memory (EM) (see [AV88]) and the Cache-Oblivious (CO) (see [FLPR99]) frameworks. Both frameworks are based on a two-level hierarchy model formed by a fast but small level and a large but slow level. Data blocks in the slow memory can only be accessed by first copying them into the fast memory, and the goal is to minimize copy operations. In the EM framework, parameters such as fast memory size and block size can be explicitly used in an algorithm, and copy operations are under programmer's control; while in the CO framework hierarchy parameters are unknown to the programmer and copy operations are automatically managed by the memory system. Under some assumptions, a CO algorithm minimizing transfers in a two level hierarchy also minimizes transfers between any pair of adjacent levels in deeper hierarchies typical of real machines. The bounds of many EM algorithms have been matched by corresponding CO algorithms [ABF05]; this is not always possible, as shown in [BP01,BF03]. For relevant results, see also [ACS87, Vi01,MZ03,AFI05,ADM06].
A number of software packages (FFTW [FJ98] and ATLAS [WPD01] among others) have been developed in recent years which achieve optimal performance adapting the value of some internal parameters to the memory system and, more generally, to the whole executing platform.
Past theoretical work on adaptive and oblivious algorithms has focussed mostly on adapting to the variable size and speed of the memory hierarchy, assuming however that a single memory request, read or write, can be pending at any given time. It is also interesting to investigate adaptivity for memory systems exhibiting considerable concurrency. In this regard, pipelined hierarchies have been studied in [BEP02], while a very intimate connection between hierarchical and distributed memory (without pipelining) has been established in [FPP05].
Adaptivity in VMM
One key subsystem of operating systems is the virtual memory manager (VMM). The VMM makes sure that pages in the virtual space of the processes are correctly represented, either in (main) memory or on disk. Since only pages resident in memory are accessible by processes, when a non-resident page is referenced a page fault occurs and that page has to be loaded from the disk into some free memory frame. To maintain a supply of free frames, resident pages have to be occasionally evicted (in steady state, one per fault) and stored on disk. The various aspects of efficient virtual memory management have been very extensively studied (see [SGG05] and its bibliography). The optimal control theory methodology that we plan to bring to bear on the studies proposed here can be found in [B05].
Algorithms for unreliable memory
The inexpensive memories of today's computer platforms are characterized by high error rates, that cannot be underestimated as the memory size becomes larger. Environmental conditions, such as cosmic rays and alpha particles, can indeed affect the memory behavior, resulting in unpredictable failures, which may jeopardize the correctness of computational results. The problem of computing in the presence of faulty components has been previously investigated in a variety of settings [CLGKP94, FPRU94]. Past research, however, has mostly focused on error recovery in a static setting, using data replication and error correcting codes [CLGKP94]: this can be very inefficient in dynamic settings in which the objects to be managed are large and complex. It seems thus natural to ask whether it is possible to design fault-tolerant algorithms and dynamic data structures that do not exploit data replication in order to achieve resilience to memory faults. Upper and lower bounds for the problems of sorting and searching in faulty memories without replicating data appear in [FI04].
Task 1.2 - Data streams
Data stream processing has gained increasing popularity in the last few years as an effective paradigm for processing massive data sets. A wide range of applications in computational sciences generate huge and rapidly changing data streams that need to be continuously monitored in order to support exploratory analyses and to detect correlations, rare events, fraud, intrusion, unusual or anomalous activities. Relevant examples include monitoring network traffic, online auctions, transaction logs, telephone call records, automated bank machine operations, atmospheric and astronomical events [M05]. Data streams need to be processed in one or few sequential passes using a limited amount of working memory. Despite the heavy restrictions on time and space resources imposed by this data access model, some data sketching and statistics problems, such as frequency moments, quantiles, histograms, and wavelets can be approximated using a working space and a number of passes polylogarithmic in the size of the input stream [M05]. More recently, there has been growing interest in solving graph problems [FKMSZ04], which appear to be especially hard in a streaming setting. Motivated by practical factors, some researchers have recently proposed less restrictive streaming models, where algorithms are allowed to write intermediate streams, or even use sorting as a primitive, devising efficient algorithms in external memory based on the streaming paradigm [ADRR04, DFR06].
Task 1.3 - Algorithmic techniques and tools for massive data sets
In recent years the amount of information processed by computers has grown so dramatically that new technologies for dealing with this world-wide phenomenon are needed [Vi01]. Specifically, compression, graph visualization, and frequent pattern extraction, as well as the development of algorithm engineering tools, represent challenging problems, and will be investigated in this task of the project. Below, we survey a number of relevant known results related to these problems.
Compression and space-efficient techniques
The area of data compression is a classic one and a heavily used part of computer science. We believe that a new area in computer science has emerged: algorithmics on compressed data, where the goal is to deal efficiently with objects that are already compressed. These new approaches are based on Lempel-Ziv (LZ) and the Burrows-Wheeler transform (BWT), which are at the heart of the best compressors, such as gzip and bzip. They will also stimulate further research in classical data compression for finding extensions of such basic techniques [BW94,ZL77]. Indeed, a paradigm shift is occurring, whereby it is becoming clear that some compression methods can actively facilitate indexing and querying of large corpora of data. Recent results in this direction are found in [FM05,GV05].
Drawing of large graphs
Most visualization systems fall short when coping with very large graphs and rely mainly on geometric windows. The most common strategy is to provide a virtual (very large) page to fit the layout of the whole graph, and then provide a small window and scroll bars to allow the user to navigate through the visualization. Some alternative techniques have also been proposed: fish-eye views [SB94] or hyperbolic browser technique [LRP95], or three-dimensional methods, such as cone trees [RMC91]. A methodology relying on "topological windows" was first envisaged in [HEW98] and also applied to the visualization of trees in [DDFLPP99]. Concerning the visualization of the subgraphs shown in each topological window we intend to focus our attention on radial drawings which are suitable to highlight the "centrality" of the different vertices in the topological window. Radial drawings have been studied, e.g., in [DDLM04]. In most of the networks it is possible to find semantic relationships among components that allow grouping them into clusters. Most of the existing graph drawing methodologies are based on graph planarity. However, the possibility of their application to the visualization of clustered graphs, which is of considerable practical interest, conflicts against the limited development of a clustered planarity theory. The problem of deciding whether a clustered graph admits a planar representation (c-planarity) and the problem of finding a planar representation for it are still open in the general case [FCE95, GLS05, CDPP05].
Tools for algorithm engineering on massive data sets
Developing efficient and robust algorithmic software is a complex and delicate task fraught with many pitfalls. This task becomes even more difficult when the data to be processed is very large, since the computation may need to access data stored at different levels of a memory hierarchy. Visual tools, largely lacking in conventional development environments, can be of invaluable help in analyzing, tuning, and debugging codes in this context. While many software systems and libraries have been proposed in the last decade for visualizing different aspects of algorithmic codes [DFIN02], the problem of dealing with software for massive data sets still remains a challenging research area.
Motif extraction in networks
An important task when performing information retrieval on massive data structures is detecting frequent patterns, called "motifs" in certain applications, that occur in significantly higher frequencies than in a random structure [PZ03, PS05]. Motifs are often searched for in approximated form, leading to computationally hard problems. In the literature several methods are proposed, designed for specific instances of the data structure supporting data and for different goals. These methods are based on consolidated algorithmic techniques, such as the whole class of pattern discovery algorithms that use suffix trees, and share the principle that the degree of feasible approximation is bounded. The problem of finding motifs in networks has been recently investigated in [M+02,We05]. To some extent this is a generalization of the classical problem of detecting motifs in biological sequences, also studied in Task 2.1 of this project.
WP2 - ALGORITHMS FOR SPECIFIC APPLICATION DOMAINS
Task 2.1 - Very large collections of documents and sequences
Processing very large strings of symbols is a fundamental problem that arises in several modern applications. For example, a large fraction of the data published over the Web is maintained in the form of text documents. Another important application scenario where the data is maintained as a sequence of symbols is computational biology. Searching and extracting patterns are fundamental problems in these settings. Below, we survey a number of relevant known results related to these problems.
XML documents
Today XML is the standard for information representation, exchange and publishing over the Web. This comes at a threefold cost for XML documents: first, they have a natural tree structure; second, they are verbose; third, they have mixed elements with both text and numerical or categorical attributes. As a result, XML queries are richer than commonly used SQL queries, see for example [CMV05].
Biological sequences
Another rich field of investigation is motif discovery in sequences, with one main streamline aimed at detecting parts that are particularly frequent, or shared by distinct sequences [WSS99]. Applications range from the detection of transcription factors binding sites [MS00] to the understanding of mammalian evolution [R+05], just to mention two extreme cases.
Task 2.2 - Very large graphs and networks
The sheer size and the dynamic nature of modern communication networks in several applications raises challenging algorithmic questions, some of which will be addressed in this project. Below we survey some of the most fundamental previous results related to the problems we plan to study.
Routing and broadcasting in networks
Maintaining information about large dynamic networks poses many challenging algorithmic problems. While fast algorithms for computing shortest paths in directed graphs have been known for decades [D59], surprisingly no general efficient method for updating them in a graph subject to dynamic changes (such as link failures) has been devised until recently [K99, DI04]. While the sheer size of networks arising in Internet routing applications often requires algorithms to work with data stored on secondary storage, surprisingly no dynamic external memory algorithms are known in the literature.
In other scenarios such as peer-to-peer networks, communication is typically achieved using suitable interconnecting topologies such as minimum spanning trees. Since a single node failure is enough to interrupt a message transmission, and typically failures happen one at a time, one may precompute replacement edges for each possible failure, to be used when a failure of a node occurs. A sequential algorithm following this approach has been studied in [NPW04], while distributed algorithms solving similar problems have been developed e.g., in [PSWZ04].
Visual analysis of networks
The limitations of traditional Web search engines are widely documented (see, e.g., [K05]). New-generation search engines are proposed as a solution. Web-meta search clustering engines are discussed in [DDGL05a, ZE99]. Despite the graphical user interface of a Web meta-search clustering engines plays a fundamental role for the effectiveness of the system, the vast majority of Web meta-search clustering engines have a GUI in which the set of clusters is represented as a tree of directories and subdirectories. Preliminary results towards applying graph drawing techniques to this domain have been presented, e.g., in [DDGL05a].
Task 2.3 - Very large computational problems
Other applications where input instances are typically of very large size include physics simulations. Among them, the solution of the very large non-linear systems of PDEs arising in the FE simulation of multi-physics problems has seen the last few years the emergence of a number of multifrontal solvers, among the others, MUMPS, SuperLU, PaStiX, MFACT, SPOOLES, PSPASES, WSM (see [ADE01] and references therein). However, only two prominent solvers, MUMPS and SuperLU_DIST [DEG99], work in parallel on distributed-memory architectures and are capable of solving general unsymmetric linear systems, which is our case of interest. MUMPS implements a multifrontal algorithm based on a parallel hierarchical LU decomposition approach, with a dynamic allocation of computational subtasks to the computing processors. In contrast, SuperLU_DIST partitions the system matrix by means of a different technique, based on the identification of supernodes. The performance of both solvers degrades rapidly as the size of the parallel machine scales. <<<



