Vai al contenuto| Home page|

   Ti trovi in: HOME »Programmi, progetti e risultati »I progetti »PRIN - Programmi di ricerca di Rilevante Interesse Nazionale»Programma di ricerca»Unità di ricerca
INIZIO_TESTO_DA_INDICIZZARE

UNITA' DI RICERCA

italiano - english

Research program

Web Ram: Web Retrieval and Mining
University Co-ordinator
Università degli Studi di MILANO - SCIENZE DELL'INFORMAZIONE - ()
Research Unit Leader
Sebastiano Vigna
Description
---- Research goals ----

Link analysis applied to non-web domains

In the last years, several techniques to analyse the link structure of the world-wide web have emerged with application to ranking (e.g., PageRank and HITS), community recognition (bipartite subgraphs and cores), and spam detection.

Our aim is to explore the usage of such tools to other domains including, but not limited to, software analysis (static call graph and dynamic execution traces), text mining (automatic query expansion, text summarisation and tagging), social network analysis (reputation and trust systems), and thread and trends detection in news and emails.


Models for self-similar finite graphs

The self-similarity of the web graph is considered as folklore. Nonetheless there is a lack of theoretical foundations both in defining and proving such a claim. The main obstacle towards such a goal is that the usual notions of self-similarity are given for infinite objects (Hausdorff vs. topological dimension).

We would like to suggest some new definitions in this directions and apply them to the web graph to empirically corroborate the existing statements about its fractal nature.


Personalised ranking

Global ranking functions are usually difficult to compute on a web scale and their customisation is limited to a fixed set of topics. Personalised ranking aims at providing each user with a different ranking scheme based on the part of the web that is most interesting for the user.

However, defining a precise notion of such ranking is not trivial and should take into account several sources of information such as bookmarks, previous queries and navigation history. Moreover, it should be possible to update incrementally the ranking so that the computational effort on the user side is not too large.

Another important issue is privacy: unless personalised ranking is computed locally, the user would be forced to disclose a significant amount of confidential information.


Axiomatisation of ranking systems

Ranking systems for the web are very similar in their nature to voting systems and social choice systems. However, the latter have been studied using axiomatic approaches and obtaining celebrated results such as Arrow's Impossibility Theorem. Attempts to apply similar principled approaches to PageRank have appeared recently (Altman and Tennenholtz, "Ranking Systems: The PageRank Axioms"), but presently they need a number of additional assumptions.

Axiomatising the ranking process has the advantage of making evident links between different schemes, and makes it possible to formulate more precise theorems. In voting theory, impossibility theorems such as Arrow's would helped in identifying those voting systems (e.g., Condorcet) that are as close as possible to be optimal.

We plan to extend the current limited axiomatisations by including more features, lifting their limitations.


---- Practical Duties ----

Data collection

All research directions above require consistent data collection. We have already a significant experience in this field, and we intend to pursue further this activity. Besides crawling the web, we plan to create software that is able to build graph structures out of email and social networks (e.g., collaborative bookmarking).

Part of the data gathering activities will be aimed at creating data sets for the other members of the project, who need frequent or specialised snapshots of the web.


Software

We plan to develop software fully compatible with the new WARC (Web ARChive) format developed by the Internet Archive (currently a Work Item for ISO approval). In this way, snapshots collected will be usable by a larger community. We also plan to release as free software the implementation of link analysis algorithms that will be necessary for our research goal. We have already made available statistical software for large-scale data and basic PageRank computation software, and we plan to increase the range of techniques implemented.