Vai al contenuto| Home page|

   Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca
INIZIO_TESTO_DA_INDICIZZARE

RESEARCH PROGRAM

italiano - inglese
Similar research programs:
Scientific and education field classification
International Patent Classification
Geographical classification
Keywords
ALGORITHMS; DATA STRUCTURES; ALGORITHM ENGINEERING; GRAPH ALGORITHMS; INTERNET ALGORITHMICS; WEB ALGORITMICS; COMPUTATIONAL MODELS; INFORMATION VISUALIZATION

Algorithms for the Next Generation Internet and Web: Methodologies, Design, and Experiments

Università degli Studi di Roma "Tor Vergata"
Abstract
This research project brings together five leading groups in algorithmic research in Italy in a joint effort that aims at:

(1) Discovering new algorithmic techniques and methodologies;

(2) Identifying and solving key algorithmic problems in important applications; and

(3) Contributing to the accelerated transfer of advanced algorithmic technologies into tools, libraries and systems.

The main emphasis of the project will be on a novel combination of application-oriented research in two important domains (the next generation Internet and the Web) with innovative methodological work on algorithm engineering, information visualization and general algorithmic techniques. As a natural consequence of our methodological and application-oriented approach, the research plan is organized into three mutually related Work Parts (WP):

WP1 - Advanced Methodologies for Internet and the Web. We plan to identify and develop key algorithmic methodologies that arise from Internet and Web applications, which are general enough and thus widely applicable. Furthermore, we plan to produce also methodologies and testbeds for algorithmic engineering.

WP2 - Internet Algorithmics: Design and Experiments. In the next generation Internet, the capacity to optimize the use of the network resources will play a crucial role in order to save costs and to improve efficiency and reliability. We thus plan to design and to deploy new algorithmic techniques capable of ensuring efficient communication and effective usage of resources in next generation networks.

WP3 - Web Algorithmics: Design and Experiments. The wealth of information increasingly available on the Web has triggered a considerable effort for developing efficient algorithms for several challenging data mining, information retrieval and visualization problems. We plan to study the linked structure of the Web, in order to investigate and validate models of the Web graph, to design algorithmic techniques that are capable of searching and updating massive data stored in complex memory systems, and to realize new systems for visualizing and browsing Web Communities.

The consortium is composed of about 50 researchers, for a total of 456 person-months in a 2-year period. Of these, 384 months are contributed by faculty, postdocs, Ph.D. students and other university research personnel, whose salary is not charged against this project, and 72 months by research personnel supported by the project. The total cost of the project is EUR 276.000, out of which EUR 193.200 are requested to MIUR and EUR 82.400 are charged to the research units. <<<

Principal Investigator
Giuseppe Francesco ITALIANO Università degli Studi di ROMA "Tor Vergata"
Research Objectives
Traditional algorithms and algorithm design methods seem to be inadequate for the algorithmic challenges of the near future. Massive data sets and sophisticated retrieval operations on these data arise in many important applications, such as Web Information Retrieval. In this framework, new algorithmic techniques, which better suit the architecture of real computing platforms, seem also deeply needed. The next generation Internet will pose new challenging algorithmic problems as well: communications patterns in networks are radically changing, and the very large and heterogenous networks of tomorrow must be controlled in a more efficient way than what is currently possible.

This project aims at discovering new algorithmic techniques and methodologies, identifying and solving key algorithmic problems in important applications, and contributing to the accelerated transfer of advanced algorithmic technologies into practical systems. In more detail, the main objectives of the project are:

(1) To contribute to the science and engineering of algorithms and data structures that, spurred by the above domains, will be enriched by new methodologies and techniques.

(2) To design and analyze new and efficient algorithms for tackling specific Internet and Web problems, in order to have an impact on key applications in these areas.

(3) To assess the effectiveness of our solutions not only by analytical means, but also by developing prototypes and by experimenting them against test suites taken from the practical realm.

In order to achieve these goals, we shall combine application-oriented research in two important application domains (the next generation Internet and the Web) with innovative methodological work on algorithm engineering, information visualization and general algorithmic techniques. As a natural consequence of our combined methodological and application-oriented approach, we shall focus on three specific problem domains:

1. Advanced Methodologies for Internet and the Web
Internet and the Web motivate the algorithmic solution of a wealth of computational problems, some of which we plan to investigate in this project. In several cases, while challenging, the development of efficient algorithms can be approached within established models of computation and by consolidated design paradigms. In other cases, however, new models or paradigms are desirable, if not necessary. Then, a substantial investigation effort to develop the appropriate framework is a key prerequisite to algorithm design. In this workpart, we shall identify and develop key algorithmic methodologies that arise from Internet and Web applications, which are of a general nature and thus widely applicable. Furthermore, we shall produce methodologies and testbeds for experimental algorithmics, and in particular for algorithms that exploit the memory hierarchy of modern computing platforms.

2. Internet Algorithmics: Design and Experiments
Next Generation Internet will pose new algorithmic challenges, as the networks will be required to be, among other aspects:
- Capable of delivering different services with different quality of service in terms of bandwidth, delay, jitter, and reliability.
- Fault-tolerant, and thus capable of recovering quickly from link failures, congestion, and routing instabilities.
As a consequence, the capacity to optimize the use of the network resources in order to save costs and to improve efficiency and reliability will play a crucial role. We thus plan to design and to deploy novel algorithmic techniques capable of ensuring efficient communication and effective usage of resources in next generation networks.

3. Web Algorithmics: Design and Experiments
The wealth of information increasingly available on the Web and in other contexts has triggered a considerable effort for developing efficient algorithms for several challenging data mining, information retrieval, and visualization problems. A large body of research has also focused on the study of properties of the Web and in particular of the Web Graph. Realistic models of the Web graph make it possible to study its hyper-link structure and the way it develops over time. They are useful for crucial Web applications, such as search engines. Furthermore, the study of the linked structure of the Web also allows one to infer Web Communities, an important step to understand how the information is distributed over the Web. We shall investigate and validate models for the Web graph, and we shall design a new system for the visualization and the browsing of Web Communities.
Web applications are also a rich source of new algorithmic problems. Such applications indeed typically run on large data sets that are stored on computer platforms with complex memory systems. We plan to design algorithmic techniques that are capable of searching and updating large data sets, such as HTML and XML repositories, by directly exploiting the memory hierarchy of computing platforms.

A fully integrated approach to modelling, design, analysis, implementation and engineering seems crucial for the success of an ambitious project of this kind. Indeed, modelling implies trade-offs between capturing the essence of computer systems and networks and making the tasks of algorithm design and analysis tractable. However, algorithm design, while inspired by a particular model, should not overlook implementation issues: the measured performance of the implementation should be the ultimate benchmark for the modelling and design efforts. Furthermore, the increased level of sophistication required by the modelling and design phases might make the scientific results more difficult to access and to implement by non-specialists; hence, algorithmic research might have a limited impact if its initial transfer to libraries, tools and systems is not accomplished as part of the research effort.

This integrated approach is also ensured by the fact that researchers in the consortium have leading-edge research expertise in algorithmic design and algorithmic software development, and these two research skills are quite often combined in the very same person. The capability of accomplishing the project objectives is ensured also by the fact that all the research units in the consortium have access to adequate computing equipment and software platforms (just to name a few, research unit 2 owns a parallel machine with a large memory system; research unit 4 belongs to a network of excellence on the design of the Next Generation Internet, and can therefore access network testbeds and simulators, other research units have access to large data sets from the Internet and the Web). This probably allows the consortium to deliver the different phases of the research, from the algorithm design to its simulation analysis and its embedding in a real network environment. We believe that this can give a strong indication that most of the objectives, if not all, can be succesfully met by the proponents. <<<
First Results
Below we list the main results expected for Phase 1, organized accordingly to the different tasks of the project.

Task 1.1 Models and Algorithmic Paradigms
- A first release of an open source algorithmic software library to handle massive graphs
- An Architecturally Parametrized Model of Computation (APMC)
- Design of novel fault-tolerant algorithms for fundamental combinatorial problems

Task 1.2 Visualization Methodologies
- Characterization of classes of clustered graphs for which c-planarity testing can be solved in polynomial time
- New results about properties and relationships of k-dense trees and k-trees
- Polynomial time algorithms for computing radial drawings of graphs with predefined vertex levels

Task 2.1 Algorithms for Internet Routing
- New algorithms for dynamic single-source shortest paths and dynamic t-spanners
- The theoretical and experimental study of Nash equilibria and price of anarchy of congestion management policies
- Fast IP lookup and classification methods

Task 2.2 Discovery and Visualization of the Internet
- Models and drawing algorithms for the visualization of the network at different abstraction levels
- Design of algorithms and a platform for IPv4-IPv6 network discovery
- Models of inter-domain routing dynamics and studies about routing instabilities and their inference from archives of BGP updates

Task 3.1. Algorithms for Web Searching
- Framework to assess the performance of algorithms for mining frequent/interesting patterns
- Theoretical and experimental analysis of the stability of link analysis ranking algorithms
- Efficient methods for using patterns to detect text similarity

Task 3.2 Discovery and Visualization of the Web
- A prototype system for visualizing and browsing Web Communities
- Mapping of the components of the Web GraphBelow we list the main results expected for Phase 2, organized accordingly to the different tasks of the project.

Task 1.1 Models and Algorithmic Paradigms
- Second release of the algorithmic software library (open source) for massive graphs, and empirical study of properties of samples from the Web
- Enhancements to the APMC model, and experimental evaluation (on 3 different hardware platforms) of the parametrized algorithms developed as case studies.
- Design of fault-tolerant external memory models and algorithms.

Task 1.2 - Visualization Methodologies
- Efficient c-planar drawing algorithms for the classes of clustered graphs for which the problem is tractable.
- Efficient and effective algorithms to draw k-dense trees and k-trees in 2D and 3D space.
- Polynomial time algorithms for computing planar radial drawings of graphs using the minimum number of levels.

Task 2.1 - Algorithms for Internet Routing
- Engineering of the algorithms devised in Phase 1 for Internet routing.
- Cost sharing mechanisms for single source rent-or-buy network design that use randomization and share the expected cost of the network infrastructure.
- Experimental benchmark of IP lookup and classification schemes

Task 2.2 - Discovery and Visualization of the Internet
- Experiments of network discovery in a mixed IPv4-IPv6 environment with a new software library developed for this purpose, and reports on the IPv6 protocol deployment.
- A system for the visual exploration of the network showing both router and AS-level information
- A system for monitoring the AS network in search for causes of instabilities

Task 3.1. Algorithms for Web Searching
- External-memory algorithms for mining interesting/frequent patterns in large datasets.
- Analysis of link analysis algorithms on specific models for the Web Graph.
- Booster tools for compressing data and efficient data structures for (text) indexing.

Task 3.2 - Discovery and Visualization of the Web
- A Web service for the discovery, the visualization, and the browsing of Web Communities.
- A framework for studying the temporal evolution of the Web Graph. <<<
Timescale
24 months
National and international background
Algorithmic technologies are playing a vital role in the spectacular growth of the Internet and of the Web: this is witnessed not only by a rich body of scientific literature which has been flourishing over this area, but also by a number of Internet companies (e.g., Akamai, Google, Yahoo), which have based a major part of their business success on new algorithmic ideas. This research project deals with the design of algorithms and data structures for applications arising from the Internet and the Web domains. In particular, we shall focus on advanced algorithmic methodologies, and on the design and experimentation of algorithmic technologies for the Internet and the Web. This state-of-the-art section will primarily refer to those areas, briefly surveying previous work which is at the basis of our research plan.

Advanced Methodologies for Internet and the Web

Most of the applications arising from the Internet and the Web, like Web searching, generate ever larger data sets, of the order of PetaBytes. In such a scenario, algorithms for massive data problems must take explicitly into account that the size of the internal memory is limited, and that access to external memory (e.g., hard disks) is currently about six order of magnitudes slower than access to internal memory. Furthermore, modern computing platforms have a quite complex memory hierarchy system, which basically rewards locality of reference, as data is transferred in blocks along this hierarchy. A number of models of computations have been investigated that explicitly expose some architectural parameters: internal and external memory hierarchy [AC+94, VS94] and bandwidth-latency properties of the interconnection network [V90, BF+01].

In this framework the memory hierarchy of modern computing platforms has a direct impact on the efficiency of the codes that need to run on large data sets: if the data do not fit in internal memory, algorithms that are fully aware of the parameters of the particular memory hierarchy system they are running on (e.g., block size, memory-level size, cache line), can possibly gain considerable speed-ups. However, algorithms designed for a particular computing platform (or a particular machine model) are not necessarily efficient on another platform. They need to be targeted and engineered with that particular computing architecture in mind, as they depend heavily on the underlying memory system. For this reason, they typically need quite substantial work and a fair amount of re-engineering before being even portable to a different computing system, which may limit their overall impact.

One possible approach is to use cache-oblivious algorithms, as proposed in [F+99]. Although cache-oblivious algorithms are aware of the existence of a memory hierarchy, they do not have access to the memory system's parameters. We remark that algorithms designed in this model can be simultaneously optimal across a range of parameter values [F+99,BD+00,AB+02,FG03]. However, achieving near optimal performance with a single (non adaptive) code across several platforms unfortunately is not always possible, as shown in [BP01,BF03]. As an alternative to cache-obliviousness, the idea of designing software that adapts to the available platform for performance has become popular in recent years, due to the success of packages such as FFTW [FJ98] for Fourier transforms and ATLAS [WPD01] for linear algebra. Unlike their cache-oblivious counterparts, adaptive algorithms produce differently engineered codes on different platforms, thus optimizing the resource usage of the particular computing systems. To make this completely transparent to the user, adaptive algorithms have built-in parameters to match the structure of the platform and best harness its computational potential. This and similar software heavily relies on problem-specific intuition and knowledge on the part of the designer. Currently, there is no general methodology on how to parameterize an algorithm to make it adaptive and only preliminary results are available on how to search the parameter space [Y+03].

The large and inexpensive memories typical of modern computing platforms are also error prone. Classical algorithms quite often fail to produce always a correct output in faulty memories: for instance, designing efficient fault-tolerant sorting and searching data is a real problem in search engines, such as Google [F02]. The problem of computing in the presence of faulty components has been previously investigated in a variety of settings [AD91,CLGKP94,FI04]. Past research, however, either has focused on error recovering in a static setting, using data replication and error correcting codes [CLGKP94], or has addressed the problem of designing algorithms that can tolerate transient ALU failures, such as comparator failures [AD91]. These approaches are not suitable for dealing with highly dynamic destructive memory faults. Very recently, the design of sorting algorithms resilient to destructive memory faults has been investigated in [FI04].

Visualization tools can help us coping with the high complexity of the Internet and the Web. The difficulty of visualizing such a kind of networks mainly arises from the heterogeneous nature of their huge topologies, which consist of many subgraphs with different densities and connectivity factors. Applying standard graph drawing models and techniques [DETT99,KW01] to the visualization of such networks is actually hard. It is thus crucial to explore new methodological approaches to improve the effectiveness of models, algorithms, and systems developed so far [DK01,JM03]. Some drawing paradigms that are receiving a growing attention include the following:
- Clustering visualization techniques allow it to cope with the huge number of nodes of the Internet: only a few results are however known about the representation of clustered graphs [D98,DDM01].
- Properties of k-dense trees and k-trees [B98,LP02] allow it to deal with the problem of representing topologies with very heterogeneous density factors.
- Algorithms for computing compact drawings of k-trees are described in [DM03,WD03], but a lot of work remains to be done w.r.t. other aesthetic criteria. The radial drawing paradigm appear to be quite natural to visualize effectively the hierarchical nature of a network, but the research in this area is still at its early stages [BBF03,DD03].

Internet Algorithmics

The complexity of the basic operations that routers will be asked to perform in the next generation Internet will increase significantly. In fact, the network will be required to be fault-tolerant, quickly recovering from failures, congestion, and instabilities. Also, it will be required to be flexible enough to guarantee the access from heterogeneous technologies, delivering different quality of service levels in terms of bandwidth, delay, jitter, and reliability. Finally, it will most probably be an "all IP" network, without relying on global coordination, and facing a slow transition from the current IPv4 network protocol to IPv6. In this respect it is crucial both to enhance the algorithmic foundation of router-critical operations (which are essentially routing tables update, packet forwarding, and congestion control), and to design visualization tools for discovering, exploring, and monitoring the network.

The next generation network routing methods will most probably aim at performing continuous routing table updates in response to dynamic link cost changes. Many Internet routing schemes such as OSPF (Open Shortest Paths First) [ISO89,M99] deliver packets along shortest paths in the network to avoid traffic congestions: each node tries to deliver its packets towards their destination by using minimum distance paths. While fast algorithms for computing shortest paths have been known for decades [D59], surprisingly no general efficient method for updating them in a graph subject to dynamic changes has been devised until recently [K99,DI01,DI03]. Most of the algorithms designed so far apply to all-pairs shortest path problems, and there are very few algorithms and techniques for single source problems, that would be instead very useful for Internet routing. Furthermore, the dynamic structures should be maintained in a distributed manner, which is still the subject of active research. For instance, the approach of [NPW03], where a suitable swap edge is precomputed for each failing edge, guarantees 1-fault tolerant routing tables, but its sequential formulation does not lend itself to a distributed implementation.

Much attention has been recently devoted also to applying techniques from Game Theory to the design of network protocols, especially for congestion-control algorithms. Concepts from Game Theory are indeed natural candidates to study the behavior of network agents and to design algorithms suitable for encouraging cooperation between selfish users in anarchic networks, such as in the case of Internet. Internet switching and routing problems, bandwidth allocation, mechanisms implementing a fair share of the cost of network services and infrastructures [JV01] are prime candidates for studying such selfish behavior. Previous work analyzes switch service disciplines with specific traffic distributions in the fluid model [SH95, GK02] showing that traditional First-In-First-Out (FIFO) and selfish behavior do not guarantee efficiency and fairness, leading to a congestion collapse.

Concerning network discovery and visualization, both research fields have a rich literature (see for example [BC99, SMW02] and [CDDMP02]). However, the task of network discovery in a mixed IPv4-IPv6 environment has rarely been addressed [CDP04] and never yielded a visual exploration tool. On the contrary, since the replacement of the current IPv4 network protocol is expected to be very slow and complex, monitoring the transition phase will be a crucial effort. Also, much is known about the analysis of the graph of the Autonomous Systems (ASes), the independent organizations whose cooperation guarantees the delivery of the IP packets in the Internet. However, the inferred knowledge about the kind of commercial relationships between ASes [SARK02, DPP03] has never been used to help the analysis of inter-domain routing instabilities, and the results of this analysis have never been visualized, possibly integrated with information at a lower level of abstraction.

Web Algorithmics

The huge amount of data available on the Web requires efficient techniques for searching and updating large data sets, such as HTML and XML repositories. Substantial gain in performance can be attained by directly exploiting the memory hierarchy, via innovative algorithmic methods and data compression [FM00,GV00]. In several Web Information Retrieval (Web IR) applications, it is also crucial to identify repeated patterns, and it is necessary to detect similar rather than identical patterns: this problem is hard as the number of similar patterns can be exponential. Patterns in their general form can be expressed as regular expressions, which have become the preferable way of defining flexible patterns, both for natural language text repositories and for structured data (XML documents). Since [AIS93], a wide body of algorithms has been proposed for mining frequent patterns in transactional datasets. Some algorithms [OPPS02], refine the breadth-first pattern generation strategy featured in the A-priori algorithm initially proposed in [AS94], by introducing novel data structures and frequency counting strategies. Other algorithms [HPY00,LPWH02] adopt instead a depth-first approach, efficiently combining the candidate generation and frequency counting phases. Recently, it has been argued in [PZ03] that compressed tries attain best space and time performance in almost all cases. The most advanced algorithms for mining frequent itemsets have been presented and compared in [FIMI03]. The mining of sequential patterns is a related problem where the input is a collection of sequences, and frequent subsequences are sought. This problem arises, for example, in the extraction of patterns from weblog data [HK01]. A first strategy inspired by the previously mentioned A-priori algorithm for frequent itemset mining was proposed in [AS95] and refined in [SA96]. Improved algorithms have been subsequently developed in [Z01,P+01].

Motivated by search engine technologies, a large body of research has also focused on the measurement of the properties of the Web graph [BMR+00, LLM+03]. The study of the topological structure of this graph has revealed the self-similar structure of the Web, with the ubiquitous presence of power laws that are typical of scale-free structures. A fascinating picture of the Web shows a Bow Tie structure with a large strongly connected component including about 25% of the pages, an IN and an OUT region of almost equal size. Most of these studies have concentrated on large samples of the Web collected over few months of crawling activity, and little work [KH03] has been devoted to analyzing the temporal evolution of the properties of the Web. The temporal evolution of the Webgraph is naturally correlated with the problem of the stability of link analysis algorithms, that are a key ingredient in modern web search engines. The underlying idea beyond link analysis ranking algorithms [K98,BP98] is that an authoritative page is one that is pointed by good hubs, i.e. a page that points to many good authority pages. Link analysis algorithms are based on spectral methods, pages are ranked by eigenvector computation on the adjacency matrix of the Webgraph or on the co-citation graph. An important feature of document collections such as the Web is its dynamic feature. Recent work [NZJ01, BRRT03] has addressed the problem whether link analysis provide a robust notion of authoritativeness in the sense that the resulting rank is stable with respect to small perturbations of the link structure.

Growing attention has been also devoted to the design of effective and efficient algorithms for the discovery of Web Communities [B03,FLG00,GKR98]. Drawings produced by existing systems for the visualization of Web maps and sites[DK01,RCMC00] are mainly suitable for graphs that have a tree-like structure or that are sparse; for huge and dense graphs, these drawings may have a high number of crossings and bends. In particular, neither effective systems nor services have been specifically developed for the visualization of the complex topologies of Web Communities: this requires optimizing vertex dimensions and labelling, and taking into account planarization and clustering constraints [BDLN02,DDM01,DETT99].Finally, Web maps are widely used to enhance the perception of the content of Web sites [MB98,SH00]). In spite of the impressive array of techniques devised in the last decade, it is typically difficult for the user to get a customized Web map representation of some topic of interest. <<<